STL Token-Efficiency Benchmark
tiktoken cl100k_base.
Three experiments, 16 samples, fully reproducible. Run date: 2026-05-13.Methodology
tiktoken cl100k_base — frozen vocabulary, BPE-byte family shared with modern Claude / GPT tokenizersindent=2, what LLMs emit) vs JSON-minified (separators=(",",":"))cl100k_base is a frozen BPE vocab — results identical across machines and runsExperiment 01 — STL vs JSON
Six samples, each expressed three equivalent ways. Token counts measured directly.
| Sample | STL | JSON-pretty | JSON-min | Δ pretty | Δ min |
|---|---|---|---|---|---|
| 1. Minimal edge | 9 | 18 | 11 | +100.0% | +22.2% |
| 2. Simple typed edge | 23 | 43 | 25 | +87.0% | +8.7% |
| 3. Causal claim | 38 | 64 | 40 | +68.4% | +5.3% |
| 4. Empirical lesson | 54 | 81 | 57 | +50.0% | +5.6% |
| 5. CJK anchors | 52 | 78 | 54 | +50.0% | +3.8% |
| 6. 3-edge chain | 110 | 190 | 118 | +72.7% | +7.3% |
| TOTAL | 286 | 474 | 305 | +65.7% | +6.6% |
Why STL wins — four cost drivers
| Cost driver | JSON | STL |
|---|---|---|
| Quoted keys | "confidence": 0.95 | confidence=0.95 |
| Modifier wrapping | "modifiers": { ... } | single ::mod(...) call |
| Directionality | two keys ("source", "target") | one arrow token → |
| Multi-edge framing | array brackets + per-object braces | newline-separated |
description or lesson) see
smaller relative savings — the text bytes themselves tokenize identically in both formats.
The structural advantage shows up best on edge-dense, value-light data.
Sample #4 audit — same edge, three forms
The OAuth refresh-token lesson, written as STL and as both JSON forms. Token counts measured directly.
STL
[OAuth_Refresh_Token] → [Auth_Failure] ::mod( rule="empirical", confidence=0.95, lesson="Testing-mode refresh tokens expire after 7 days; must publish app to Production", occurred_time="2026-04-23" )
JSON-pretty
{
"source": "OAuth_Refresh_Token",
"target": "Auth_Failure",
"modifiers": {
"rule": "empirical",
"confidence": 0.95,
"lesson": "Testing-mode refresh tokens expire after 7 days; must publish app to Production",
"occurred_time": "2026-04-23"
}
} JSON-minified
{"source":"OAuth_Refresh_Token","target":"Auth_Failure","modifiers":{"rule":"empirical","confidence":0.95,"lesson":"Testing-mode refresh tokens expire after 7 days; must publish app to Production","occurred_time":"2026-04-23"}} Same semantic content; 54 / 81 / 57 tokens. STL beats both JSON forms while staying the most human-readable.
Experiment 02 — Quote-removal null result
Hypothesis: Removing the quote pair around modifier string values should save tokens. Finding: Savings ≈ 0% (−0.4% across 5 samples; effectively noise).
| Sample | A: quoted | B: hybrid | C: none | B vs A | C vs A |
|---|---|---|---|---|---|
| Simple typed | 23 | 23 | 23 | +0.0% | +0.0% |
| Causal | 38 | 38 | 38 | +0.0% | +0.0% |
| Empirical lesson | 53 | 53 | 53 | +0.0% | +0.0% |
| Long description | 44 | 44 | 43 | +0.0% | −2.3% |
| CJK heavy | 88 | 88 | 88 | +0.0% | +0.0% |
| TOTAL | 246 | 246 | 245 | +0.0% | −0.4% |
Why — BPE has already merged =" into one token
=" is a single BPE token in cl100k_base —
the pattern is ubiquitous in JSON/Python/JS training data, so the tokenizer learned it as one piece.
Dropping the opening quote saves zero because there was no separate " token to drop.
The closing quote sometimes saves 1 token, but that gain is offset by the new ambiguity
(free-text values containing , / = / ) would break parsing).
BPE_Equals_Quote_Merge_Insight.
Experiment 03 — §4.2 default value omission
Once a parser can fill in well-defined defaults (confidence, rule, strength),
you can omit those fields from the surface form. Five representative edge styles measured:
| Edge style | Before | After | Saving |
|---|---|---|---|
Definitional (is_a) | 28 | 16 | −42.9% |
| Causal | 34 | 29 | −14.7% |
Empirical (with lesson) | 53 | 48 | −9.4% |
Definitional + description | 35 | 30 | −14.3% |
| Role / spec | 38 | 32 | −15.8% |
| TOTAL | 188 | 155 | −17.6% |
Headline number cited in STL_Operational_Protocol.md §4.2.5.
Definitional edges save the most (−42.9%) because their rule field is the most predictable.
How to reproduce
One command per experiment
# Requires tiktoken. The STG repo's venv already has it: ~/.stg/venv/bin/python 01_stl_vs_json.py ~/.stg/venv/bin/python 02_quote_removal.py ~/.stg/venv/bin/python 03_default_omission.py # Or install fresh: pip install tiktoken python 01_stl_vs_json.py
All scripts are deterministic. cl100k_base is a frozen vocabulary,
so results are identical across machines and runs.
File layout
research/token-efficiency/ ├── README.md # source doc ├── 01_stl_vs_json.py # STL-vs-JSON benchmark ├── 02_quote_removal.py # quote-removal null-result experiment ├── 03_default_omission.py # §4.2 token-savings measurement └── results_2026-05-13.txt # full output snapshot of first run
Tokenizer caveats
This benchmark uses cl100k_base (GPT-4 family). Reasoning:
- Publicly available. No API key required to reproduce.
- Same BPE family as modern Claude / GPT tokenizers — all are byte-level BPE trained on overlapping web corpora.
- Stable. The vocabulary is frozen; no version drift between measurements.
Citation
"STL saves 39.7% tokens vs JSON-pretty / 6.2% vs JSON-minified; §4.2 default-value omission saves an additional ~17.6% average. Measured 2026-05-13 with tiktoken cl100k_base. Source: scos-lab/semantic-tension-language/research/token-efficiency/"