2.2 Natural Language Choice — Checking the CJK Assumption

2.2.1 The Token Economics of Human Languages

There's a persistent assumption: "Write prompts in Chinese or Japanese — fewer characters means fewer tokens."

That does not hold up in token counts.

Here is a sample comparison. Test sentence: "I met a huge dog" (and equivalents):

Language	Sentence	Characters	Tokens	vs. English
English	I met a huge dog	16	5	1.0x (baseline)
Spanish	Conocí a un perro enorme	24	8	1.6x
Polish	Spotkałem ogromnego psa	23	8	1.6x
Icelandic	Ég hitti risastóran hund	24	10	2.0x
Chinese	我遇见了一只大狗	8	11	2.2x
Japanese	大きな犬に出会った	9	11	2.2x
Russian	Я встретил огромную собаку	26	14	2.8x
Hebrew	פגשתי כלב ענק	13	16	3.2x

Chinese uses 8 characters vs. English's 16. But it costs 11 tokens vs. English's 5. More than double.

2.2.2 Why This Happens

BPE tokenizers are trained on data that is majority English. English text gets the best compression ratios because:

Common English words → 1 token. The tokenizer has dedicated entries for them
CJK characters → ~1-1.4 tokens each (not 2-3 as commonly claimed). Each CJK character is 3 UTF-8 bytes, and those byte sequences are less frequent in training data. That's still ~5-7x more expensive per character than English (~0.2 tokens/char), which is why 8 Chinese characters cost 11 tokens while 16 English characters cost only 5
Non-Latin alphabets (Cyrillic, Hebrew, Arabic) fare even worse — 2.5-3.2x English

2.2.3 Large-Scale Data

Across larger text samples (Capodieci/Castillo research):

Language	Avg Token Cost vs. English	Characters per Token
English	1.0x	4.75
Spanish	~1.3-1.6x	~3.5
German	~1.4-1.6x	~3.2
Mandarin Chinese	~1.76x	1.33
Japanese	~2.12x	1.41
Korean	~2.36x	~1.2
Russian	~2.5-2.8x	~2.0

2.2.4 What About Classical Chinese (Wenyan 文言文)?

Classical Chinese is extraordinarily information-dense. Where modern Chinese might use 8 characters, wenyan might use 4. But the tokenizer doesn't care about information density — it cares about byte sequences.

Example — "Explain database connection pooling":

Mode	Text	~Tokens
English (full caveman)	"Pool reuse open DB conn. Skip handshake → fast."	~12
Wenyan-full	"池reuse conn。skip handshake → fast。"	~15
Wenyan-ultra	"池reuse conn。skip→fast。"	~12

At best, wenyan-ultra ties with terse English. At worst, it costs more. And the model understands English better because its training data is English-dominated.

Verdict: Wenyan modes are interesting for creative/educational purposes. They are not recommended for actual token optimization. Use terse English instead.

2.2.5 Practical Takeaways

Always use English for prompts and instructions. It's 1.5-3x more efficient than any other language
Don't Google Translate your prompts into CJK languages hoping to save tokens — you'll spend more
If you're bilingual, you might write more concisely in your native language, but the token cost will be higher
Code output stays English regardless of prompt language — variable names, comments, docs will be in English
Transliteration helps for non-Latin scripts — writing Russian in Latin characters cuts tokens ~50%

The Transliteration Effect

Converting non-Latin scripts to Latin characters helps:

Script	Sentence	Tokens	vs. Native
Russian (Cyrillic)	Я встретил огромную собаку	14	baseline
Russian (Transliterated)	Ya vstretil ogromnuyu sobaku	11	21% cheaper
Hebrew (Native)	פגשתי כלב ענק	16	baseline
Hebrew (Transliterated)	pgSti klv 3nq	9	44% cheaper

Key Finding

English is the most token-efficient language for LLM prompts. Period.

CJK languages use fewer characters, but each character costs ~1-1.4 tokens. The net result is 1.7-2.4x MORE tokens than English for the same meaning.

Don't write prompts in Chinese hoping to save tokens. You'll spend more.

Next: Context Management →