Skip to content

Part 3: Comparisons & Data

โ† Back to Guide


3.1 Head-to-Head: Same Prompt, Different Techniques

Task: "Add error handling to this function"

Technique Prompt ~Input Tokens Output Quality
Verbose English "Hey, could you please add comprehensive error handling to this function? I'd like it to handle all edge cases including null inputs, invalid types, and network errors. Please explain your changes." ~40 Good, but verbose output
Caveman lite "Add error handling to this function. Cover null inputs, invalid types, network errors." ~16 Good
Caveman full "Add error handling. Cover: null input, bad type, network error." ~12 Good
Caveman ultra "Error handling: null/bad-type/net-err." ~7 Good (may need context)
Structured fn: add error handling\n- null input\n- invalid type\n- network error ~12 Good
Code-centric # TODO: handle None, TypeError, ConnectionError ~8 Good

All six produce correct error handling code. The token costs range from 7 to 40. That's a 5.7x difference for the same result.

3.2 Language Comparison Tables

Single Sentence Comparison

Source: Ivan Krivyakov's analysis using HuggingFace tokenizer playground (GPT-4 setting).

Language Sentence Characters UTF-8 Bytes Tokens Cost vs. English
๐Ÿ‡ฌ๐Ÿ‡ง English I met a huge dog 16 16 5 1.0x
๐Ÿ‡ช๐Ÿ‡ธ Spanish Conocรญ a un perro enorme 24 25 8 1.6x
๐Ÿ‡ต๐Ÿ‡ฑ Polish Spotkaล‚em ogromnego psa 23 24 8 1.6x
๐Ÿ‡ฎ๐Ÿ‡ธ Icelandic ร‰g hitti risastรณran hund 24 26 10 2.0x
๐Ÿ‡จ๐Ÿ‡ณ Chinese ๆˆ‘้‡่งไบ†ไธ€ๅชๅคง็‹— 8 24 11 2.2x
๐Ÿ‡ฏ๐Ÿ‡ต Japanese ๅคงใใช็Šฌใซๅ‡บไผšใฃใŸ 9 27 11 2.2x
๐Ÿ‡ท๐Ÿ‡บ Russian ะฏ ะฒัั‚ั€ะตั‚ะธะป ะพะณั€ะพะผะฝัƒัŽ ัะพะฑะฐะบัƒ 26 49 14 2.8x
๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew ืคื’ืฉืชื™ ื›ืœื‘ ืขื ืง 13 24 16 3.2x

Large-Scale Averages

Source: Capodieci/Castillo research across larger text samples.

Language Avg Token Cost vs. English Characters per Token Verdict
English 1.0x 4.75 โœ… Best for prompts
Spanish ~1.3-1.6x ~3.5 โš ๏ธ 30-60% more expensive
German ~1.4-1.6x ~3.2 โš ๏ธ 40-60% more expensive
Mandarin Chinese ~1.76x 1.33 โŒ 76% more expensive
Japanese ~2.12x 1.41 โŒ 112% more expensive
Korean ~2.36x ~1.2 โŒ 136% more expensive
Russian ~2.5-2.8x ~2.0 โŒ 150-180% more expensive

3.3 Technique-by-Technique Matrix

The complete comparison of every technique covered in this guide:

# Technique Input Savings Output Savings Quality Impact Effort Best For
Communication Style
A1 Caveman-speak (full) 30-50% 40-55%โ€  Minimal Low All Copilot interactions
A2 Intensity levels (liteโ†’ultra) 15-70% 15-55%โ€  Varies by level Low Tuning compression
A3 Wenyan modes โŒ Negative โŒ Negative Degrades Low Demo only
A4 Structured patterns 20-40% 30-50% Improves Low Technical prompts
Prompt Engineering
B1 Precise prompts 30-60% 30-60% Improves Low Every interaction
B2 Ask for diffs, not rewrites โ€” 50-90% Neutral+ Low Code modifications
B3 One task per prompt 20-40% 20-40% Improves Low Complex requests
B4 Constrain output format โ€” 40-80% Depends Low Data extraction
B5 System instructions for terseness โ€” 30-60% Good Medium All interactions
B6 Few-shot vs zero-shot +20-50% -30-60% wasted Improves Medium Novel patterns only
B7 Retune prompts to target model guide Indirect Indirect Improves Low Model upgrades, app prompts, agent profiles
Context Management
C1 Limit context (file selection) 50-90% โ€” Varies Medium Large codebases
C2 Compressed instructions file 40-60% of file โ€” None Low Every repo
C3 Progressive on-demand guidance 60-90% of optional guidance โ€” Positive High Teams with reusable prompt files
C4 Start new conversations 80%+ โ€” Lose context Low Long sessions
Output Control
D1 Code-only responses โ€” 40-70% Good Low Code generation
D2 Structured output (JSON/tables) โ€” 30-60% Depends Low Data tasks
D3 Limit response length โ€” Variable Risk truncation Low Quick answers
Agent-Specific
E1 copilot-setup-steps.yml 10-30% โ€” Improves Medium Coding Agent
E2 Precise issue descriptions 20-50% โ€” Improves Low Coding Agent
E3 Custom agent profiles 10-30% โ€” Improves Medium Coding Agent
E4 Plan files 15-40% โ€” Depends Medium Complex agent tasks
Memory/State
F1 Compress memory files 40-60% per load โ€” None Low Persistent contexts
F2 Terse commit messages ~5-15 tokens โ€” None Low Agent reads git
F3 Terse review comments โ€” 60-80% None Low PR workflows
Model Selection
G1 Lower-cost models for simple tasks N/A N/A Varies Low Simple tasks
G2 Auto model selection N/A N/A Good None Default choice
G3 Draft cheap, polish premium N/A N/A Good Low Iteration-heavy work
G4 Model mixing by task type 38-79% of requests โ€” None (SWE-bench data) Low All workflows
G5 Reasoning effort / thinking effort Vendor-reported savings โ€” Vendor-recommended, not benchmarked Low Supported reasoning models in Copilot, CLI, and API
Session Management
H1 Ask Mode for simple questions 60-90% โ€” Good Low Simple questions
Context File Management (covers copilot-instructions.md, AGENTS.md, CLAUDE.md โ€” same always-on context)
I1 Prune always-on context to landmines only Variable (file size) โ€” Improves Low All agent workflows
I2 Delete LLM-generated context files 20-23% total โ€” Improves Low Projects with /init output
I3 Bug-tracker approach to context Variable โ€” Improves Low Living projects
I4 Consolidate duplicate context files (one file, not two) Duplicate cost โ€” Neutral Low Repos with both AGENTS.md + copilot-instructions.md
MCP & Tool Management
J1 Audit MCP servers (disable unused) 5K-190K/task โ€” None Low Agent mode users
J2 Per-workspace MCP config Variable โ€” None Medium Multi-project setups
J3 Minimize tool calls (instructions) 10-30% โ€” Neutral Low Agent mode
J4 Compress tool output with RTK 60-90% of shell cmd output โ€” None Low Agent / Coding Agent โ€” any AI tool
Agent Mode Configuration
K1 Precise prompts + acceptance criteria 30-60% โ€” Improves Low Agent tasks
K2 Plan files for complex tasks 15-40% โ€” Improves Medium Multi-step agent tasks
K3 Cap agent maxTurns Variable โ€” Risk truncation Low All agent tasks
K4 Mode selection (Ask/Edit/Agent) 60-90% โ€” Good Low Every interaction

โ€ A1/A2 output savings require system-level terse output instructions (see B5). Writing terse prompts alone saves input tokens; output tokens are only reduced if the model is instructed to respond tersely.

The Big Winners

If you do nothing else, do these six. Ranked by impact-to-effort ratio:

  1. Caveman-speak โ€” 30-50% input token savings; combine with B5 for 40-55% output savings
  2. Precise prompts โ€” 30-60% savings, just a habit change
  3. Code-only / constrain output โ€” 40-80% output savings, one instruction
  4. Shrink always-on context (copilot-instructions.md + AGENTS.md) โ€” compress filler, prune to landmines only, delete LLM-generated boilerplate. Compounds on every interaction and agent step; 20-23% agent-task reduction plus better correctness
  5. Ask Mode for simple questions โ€” 60-90% savings by avoiding Agent overhead
  6. Audit MCP servers โ€” disable unused servers, save 5K-190K tokens per agent task
  7. Retune prompts to target model guide โ€” not a per-request shrink; improves first-pass quality and avoids rework after model changes

3.4 Quality Impact Assessment

Does compression hurt output quality? The research says: rarely, and only at extreme levels.

Compression Level Quality Impact Evidence
Lite (drop filler) None Models trained on diverse text, understand clean prose
Full (drop articles, fragments) Negligible Models handle fragments well; technical terms preserved
Ultra (abbreviations, arrows) Minor risk Complex multi-step instructions may be misread
Wenyan (Classical Chinese) Moderate risk Models understand wenyan less reliably than English
Extreme (single words only) Significant risk Ambiguity increases, model may misinterpret

The threshold: When you find yourself re-explaining or getting wrong results, you've compressed too far. Back off one level.

Model-specific notes: All major models (GPT-4, Claude, Gemini) handle caveman-full well. Ultra works for experienced users who know the domain. Wenyan is unreliable for code generation tasks.

Diminishing Returns

The savings curve is not linear. The first 30% of compression (dropping filler) is free. The next 20% (fragments, abbreviations) is nearly free. Beyond that, each additional compression point risks quality.

Savings vs. Quality Risk:

Quality  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
Risk     โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
         0%       20%      40%      60%      80%
                      Token Savings โ†’
               lite     full      ultra    extreme

Sweet spot: full caveman (30-50% input token savings; 40-55% output savings with terse system instructions). Maximum return, negligible risk.


Next: Practical Setup โ†’