Part 3: Comparisons & Data
3.1 Head-to-Head: Same Prompt, Different Techniques
Task: "Add error handling to this function"
| Technique | Prompt | ~Input Tokens | Output Quality |
|---|---|---|---|
| Verbose English | "Hey, could you please add comprehensive error handling to this function? I'd like it to handle all edge cases including null inputs, invalid types, and network errors. Please explain your changes." | ~40 | Good, but verbose output |
| Caveman lite | "Add error handling to this function. Cover null inputs, invalid types, network errors." | ~16 | Good |
| Caveman full | "Add error handling. Cover: null input, bad type, network error." | ~12 | Good |
| Caveman ultra | "Error handling: null/bad-type/net-err." | ~7 | Good (may need context) |
| Structured | fn: add error handling\n- null input\n- invalid type\n- network error |
~12 | Good |
| Code-centric | # TODO: handle None, TypeError, ConnectionError |
~8 | Good |
All six produce correct error handling code. The token costs range from 7 to 40. That's a 5.7x difference for the same result.
3.2 Language Comparison Tables
Single Sentence Comparison
Source: Ivan Krivyakov's analysis using HuggingFace tokenizer playground (GPT-4 setting).
| Language | Sentence | Characters | UTF-8 Bytes | Tokens | Cost vs. English |
|---|---|---|---|---|---|
| ๐ฌ๐ง English | I met a huge dog | 16 | 16 | 5 | 1.0x |
| ๐ช๐ธ Spanish | Conocรญ a un perro enorme | 24 | 25 | 8 | 1.6x |
| ๐ต๐ฑ Polish | Spotkaลem ogromnego psa | 23 | 24 | 8 | 1.6x |
| ๐ฎ๐ธ Icelandic | รg hitti risastรณran hund | 24 | 26 | 10 | 2.0x |
| ๐จ๐ณ Chinese | ๆ้่งไบไธๅชๅคง็ | 8 | 24 | 11 | 2.2x |
| ๐ฏ๐ต Japanese | ๅคงใใช็ฌใซๅบไผใฃใ | 9 | 27 | 11 | 2.2x |
| ๐ท๐บ Russian | ะฏ ะฒัััะตัะธะป ะพะณัะพะผะฝัั ัะพะฑะฐะบั | 26 | 49 | 14 | 2.8x |
| ๐ฎ๐ฑ Hebrew | ืคืืฉืชื ืืื ืขื ืง | 13 | 24 | 16 | 3.2x |
Large-Scale Averages
Source: Capodieci/Castillo research across larger text samples.
| Language | Avg Token Cost vs. English | Characters per Token | Verdict |
|---|---|---|---|
| English | 1.0x | 4.75 | โ Best for prompts |
| Spanish | ~1.3-1.6x | ~3.5 | โ ๏ธ 30-60% more expensive |
| German | ~1.4-1.6x | ~3.2 | โ ๏ธ 40-60% more expensive |
| Mandarin Chinese | ~1.76x | 1.33 | โ 76% more expensive |
| Japanese | ~2.12x | 1.41 | โ 112% more expensive |
| Korean | ~2.36x | ~1.2 | โ 136% more expensive |
| Russian | ~2.5-2.8x | ~2.0 | โ 150-180% more expensive |
3.3 Technique-by-Technique Matrix
The complete comparison of every technique covered in this guide:
| # | Technique | Input Savings | Output Savings | Quality Impact | Effort | Best For |
|---|---|---|---|---|---|---|
| Communication Style | ||||||
| A1 | Caveman-speak (full) | 30-50% | 40-55%โ | Minimal | Low | All Copilot interactions |
| A2 | Intensity levels (liteโultra) | 15-70% | 15-55%โ | Varies by level | Low | Tuning compression |
| A3 | Wenyan modes | โ Negative | โ Negative | Degrades | Low | Demo only |
| A4 | Structured patterns | 20-40% | 30-50% | Improves | Low | Technical prompts |
| Prompt Engineering | ||||||
| B1 | Precise prompts | 30-60% | 30-60% | Improves | Low | Every interaction |
| B2 | Ask for diffs, not rewrites | โ | 50-90% | Neutral+ | Low | Code modifications |
| B3 | One task per prompt | 20-40% | 20-40% | Improves | Low | Complex requests |
| B4 | Constrain output format | โ | 40-80% | Depends | Low | Data extraction |
| B5 | System instructions for terseness | โ | 30-60% | Good | Medium | All interactions |
| B6 | Few-shot vs zero-shot | +20-50% | -30-60% wasted | Improves | Medium | Novel patterns only |
| B7 | Retune prompts to target model guide | Indirect | Indirect | Improves | Low | Model upgrades, app prompts, agent profiles |
| Context Management | ||||||
| C1 | Limit context (file selection) | 50-90% | โ | Varies | Medium | Large codebases |
| C2 | Compressed instructions file | 40-60% of file | โ | None | Low | Every repo |
| C3 | Progressive on-demand guidance | 60-90% of optional guidance | โ | Positive | High | Teams with reusable prompt files |
| C4 | Start new conversations | 80%+ | โ | Lose context | Low | Long sessions |
| Output Control | ||||||
| D1 | Code-only responses | โ | 40-70% | Good | Low | Code generation |
| D2 | Structured output (JSON/tables) | โ | 30-60% | Depends | Low | Data tasks |
| D3 | Limit response length | โ | Variable | Risk truncation | Low | Quick answers |
| Agent-Specific | ||||||
| E1 | copilot-setup-steps.yml |
10-30% | โ | Improves | Medium | Coding Agent |
| E2 | Precise issue descriptions | 20-50% | โ | Improves | Low | Coding Agent |
| E3 | Custom agent profiles | 10-30% | โ | Improves | Medium | Coding Agent |
| E4 | Plan files | 15-40% | โ | Depends | Medium | Complex agent tasks |
| Memory/State | ||||||
| F1 | Compress memory files | 40-60% per load | โ | None | Low | Persistent contexts |
| F2 | Terse commit messages | ~5-15 tokens | โ | None | Low | Agent reads git |
| F3 | Terse review comments | โ | 60-80% | None | Low | PR workflows |
| Model Selection | ||||||
| G1 | Lower-cost models for simple tasks | N/A | N/A | Varies | Low | Simple tasks |
| G2 | Auto model selection | N/A | N/A | Good | None | Default choice |
| G3 | Draft cheap, polish premium | N/A | N/A | Good | Low | Iteration-heavy work |
| G4 | Model mixing by task type | 38-79% of requests | โ | None (SWE-bench data) | Low | All workflows |
| G5 | Reasoning effort / thinking effort | Vendor-reported savings | โ | Vendor-recommended, not benchmarked | Low | Supported reasoning models in Copilot, CLI, and API |
| Session Management | ||||||
| H1 | Ask Mode for simple questions | 60-90% | โ | Good | Low | Simple questions |
Context File Management (covers copilot-instructions.md, AGENTS.md, CLAUDE.md โ same always-on context) |
||||||
| I1 | Prune always-on context to landmines only | Variable (file size) | โ | Improves | Low | All agent workflows |
| I2 | Delete LLM-generated context files | 20-23% total | โ | Improves | Low | Projects with /init output |
| I3 | Bug-tracker approach to context | Variable | โ | Improves | Low | Living projects |
| I4 | Consolidate duplicate context files (one file, not two) | Duplicate cost | โ | Neutral | Low | Repos with both AGENTS.md + copilot-instructions.md |
| MCP & Tool Management | ||||||
| J1 | Audit MCP servers (disable unused) | 5K-190K/task | โ | None | Low | Agent mode users |
| J2 | Per-workspace MCP config | Variable | โ | None | Medium | Multi-project setups |
| J3 | Minimize tool calls (instructions) | 10-30% | โ | Neutral | Low | Agent mode |
| J4 | Compress tool output with RTK | 60-90% of shell cmd output | โ | None | Low | Agent / Coding Agent โ any AI tool |
| Agent Mode Configuration | ||||||
| K1 | Precise prompts + acceptance criteria | 30-60% | โ | Improves | Low | Agent tasks |
| K2 | Plan files for complex tasks | 15-40% | โ | Improves | Medium | Multi-step agent tasks |
| K3 | Cap agent maxTurns | Variable | โ | Risk truncation | Low | All agent tasks |
| K4 | Mode selection (Ask/Edit/Agent) | 60-90% | โ | Good | Low | Every interaction |
โ A1/A2 output savings require system-level terse output instructions (see B5). Writing terse prompts alone saves input tokens; output tokens are only reduced if the model is instructed to respond tersely.
The Big Winners
If you do nothing else, do these six. Ranked by impact-to-effort ratio:
- Caveman-speak โ 30-50% input token savings; combine with B5 for 40-55% output savings
- Precise prompts โ 30-60% savings, just a habit change
- Code-only / constrain output โ 40-80% output savings, one instruction
- Shrink always-on context (
copilot-instructions.md+AGENTS.md) โ compress filler, prune to landmines only, delete LLM-generated boilerplate. Compounds on every interaction and agent step; 20-23% agent-task reduction plus better correctness - Ask Mode for simple questions โ 60-90% savings by avoiding Agent overhead
- Audit MCP servers โ disable unused servers, save 5K-190K tokens per agent task
- Retune prompts to target model guide โ not a per-request shrink; improves first-pass quality and avoids rework after model changes
3.4 Quality Impact Assessment
Does compression hurt output quality? The research says: rarely, and only at extreme levels.
| Compression Level | Quality Impact | Evidence |
|---|---|---|
| Lite (drop filler) | None | Models trained on diverse text, understand clean prose |
| Full (drop articles, fragments) | Negligible | Models handle fragments well; technical terms preserved |
| Ultra (abbreviations, arrows) | Minor risk | Complex multi-step instructions may be misread |
| Wenyan (Classical Chinese) | Moderate risk | Models understand wenyan less reliably than English |
| Extreme (single words only) | Significant risk | Ambiguity increases, model may misinterpret |
The threshold: When you find yourself re-explaining or getting wrong results, you've compressed too far. Back off one level.
Model-specific notes: All major models (GPT-4, Claude, Gemini) handle caveman-full well. Ultra works for experienced users who know the domain. Wenyan is unreliable for code generation tasks.
Diminishing Returns
The savings curve is not linear. The first 30% of compression (dropping filler) is free. The next 20% (fragments, abbreviations) is nearly free. Beyond that, each additional compression point risks quality.
Savings vs. Quality Risk:
Quality โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Risk โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0% 20% 40% 60% 80%
Token Savings โ
lite full ultra extreme
Sweet spot: full caveman (30-50% input token savings; 40-55% output savings with terse system instructions). Maximum return, negligible risk.
Next: Practical Setup โ