STL Token-Efficiency Benchmark

Empirical measurements with tiktoken cl100k_base. Three experiments, 16 samples, fully reproducible. Run date: 2026-05-13.
39.7%
STL saves vs JSON-pretty
The form LLMs actually produce when asked for structured output
6.2%
STL saves vs JSON-minified
Comparison against the mathematical floor of "fair" JSON
17.6%
Additional save — §4.2 defaults
When the parser fills in default modifier values

Methodology

Tokenizer
tiktoken cl100k_base — frozen vocabulary, BPE-byte family shared with modern Claude / GPT tokenizers
Samples
6 representative edges spanning minimal → complex, EN + CN, single edge → 3-edge chain
Compared
STL native vs JSON-pretty (indent=2, what LLMs emit) vs JSON-minified (separators=(",",":"))
Excluded
Ultra-minified JSON with 1-char keys — unfair, sacrifices semantic clarity
Determinism
cl100k_base is a frozen BPE vocab — results identical across machines and runs
Run env
Python 3.12.3 · tiktoken 0.12.0 · 2026-05-13T01:00:10Z

Experiment 01 — STL vs JSON

Six samples, each expressed three equivalent ways. Token counts measured directly.

Sample STL JSON-pretty JSON-min Δ pretty Δ min
1. Minimal edge91811+100.0%+22.2%
2. Simple typed edge234325+87.0%+8.7%
3. Causal claim386440+68.4%+5.3%
4. Empirical lesson548157+50.0%+5.6%
5. CJK anchors527854+50.0%+3.8%
6. 3-edge chain110190118+72.7%+7.3%
TOTAL286474305+65.7%+6.6%
STL
286 tokens · baseline
JSON-min
305 tokens · +6.6%
JSON-pretty
474 tokens · +65.7%
Bar lengths proportional to token counts. JSON-pretty is the realistic LLM-output baseline.

Why STL wins — four cost drivers

Cost driverJSONSTL
Quoted keys "confidence": 0.95 confidence=0.95
Modifier wrapping "modifiers": { ... } single ::mod(...) call
Directionality two keys ("source", "target") one arrow token
Multi-edge framing array brackets + per-object braces newline-separated
Honest caveat. Savings scale with structural overhead share. Edges dominated by long free-text values (a 500-word description or lesson) see smaller relative savings — the text bytes themselves tokenize identically in both formats. The structural advantage shows up best on edge-dense, value-light data.

Sample #4 audit — same edge, three forms

The OAuth refresh-token lesson, written as STL and as both JSON forms. Token counts measured directly.

STL

[OAuth_Refresh_Token]  [Auth_Failure]
::mod(
  rule="empirical",
  confidence=0.95,
  lesson="Testing-mode refresh tokens expire after 7 days; must publish app to Production",
  occurred_time="2026-04-23"
)
197 chars54 tokens

JSON-pretty

{
  "source": "OAuth_Refresh_Token",
  "target": "Auth_Failure",
  "modifiers": {
    "rule": "empirical",
    "confidence": 0.95,
    "lesson": "Testing-mode refresh tokens expire after 7 days; must publish app to Production",
    "occurred_time": "2026-04-23"
  }
}
267 chars81 tokens

JSON-minified

{"source":"OAuth_Refresh_Token","target":"Auth_Failure","modifiers":{"rule":"empirical","confidence":0.95,"lesson":"Testing-mode refresh tokens expire after 7 days; must publish app to Production","occurred_time":"2026-04-23"}}
227 chars57 tokens

Same semantic content; 54 / 81 / 57 tokens. STL beats both JSON forms while staying the most human-readable.

Experiment 02 — Quote-removal null result

Hypothesis: Removing the quote pair around modifier string values should save tokens. Finding: Savings ≈ 0% (−0.4% across 5 samples; effectively noise).

SampleA: quotedB: hybridC: noneB vs AC vs A
Simple typed232323+0.0%+0.0%
Causal383838+0.0%+0.0%
Empirical lesson535353+0.0%+0.0%
Long description444443+0.0%−2.3%
CJK heavy888888+0.0%+0.0%
TOTAL246246245+0.0%−0.4%

Why — BPE has already merged =" into one token

A: quotedrule="causal"
rule =" ca usal "
5 tokens
C: nonerule=causal
rule = ca usal "
4 tokens (saved 1 — but you lost a quote that wasn't a separate token)
The merge: =" is a single BPE token in cl100k_base — the pattern is ubiquitous in JSON/Python/JS training data, so the tokenizer learned it as one piece. Dropping the opening quote saves zero because there was no separate " token to drop. The closing quote sometimes saves 1 token, but that gain is offset by the new ambiguity (free-text values containing , / = / ) would break parsing).
Lesson kept in STG. Character-count intuition does not transfer to token cost. Always verify with a tokenizer before changing a wire format. This null result killed a tempting protocol change before it was made — see STG node BPE_Equals_Quote_Merge_Insight.

Experiment 03 — §4.2 default value omission

Once a parser can fill in well-defined defaults (confidence, rule, strength), you can omit those fields from the surface form. Five representative edge styles measured:

Edge styleBeforeAfterSaving
Definitional (is_a)2816−42.9%
Causal3429−14.7%
Empirical (with lesson)5348−9.4%
Definitional + description3530−14.3%
Role / spec3832−15.8%
TOTAL188155−17.6%

Headline number cited in STL_Operational_Protocol.md §4.2.5. Definitional edges save the most (−42.9%) because their rule field is the most predictable.

How to reproduce

One command per experiment
# Requires tiktoken. The STG repo's venv already has it:
~/.stg/venv/bin/python 01_stl_vs_json.py
~/.stg/venv/bin/python 02_quote_removal.py
~/.stg/venv/bin/python 03_default_omission.py

# Or install fresh:
pip install tiktoken
python 01_stl_vs_json.py

All scripts are deterministic. cl100k_base is a frozen vocabulary, so results are identical across machines and runs.

File layout
research/token-efficiency/
├── README.md                   # source doc
├── 01_stl_vs_json.py           # STL-vs-JSON benchmark
├── 02_quote_removal.py         # quote-removal null-result experiment
├── 03_default_omission.py     # §4.2 token-savings measurement
└── results_2026-05-13.txt      # full output snapshot of first run

Tokenizer caveats

This benchmark uses cl100k_base (GPT-4 family). Reasoning:

  • Publicly available. No API key required to reproduce.
  • Same BPE family as modern Claude / GPT tokenizers — all are byte-level BPE trained on overlapping web corpora.
  • Stable. The vocabulary is frozen; no version drift between measurements.
Anthropic does not publish Claude's exact tokenizer, so absolute token counts may differ by ±5%. The relative ratios (STL vs JSON, defaults vs explicit) are stable across BPE families because the same structural patterns get merged in all of them.

Citation

When citing these numbers in spec changes, STG nodes, or other agent context:
"STL saves 39.7% tokens vs JSON-pretty / 6.2% vs JSON-minified;
 §4.2 default-value omission saves an additional ~17.6% average.
 Measured 2026-05-13 with tiktoken cl100k_base.
 Source: scos-lab/semantic-tension-language/research/token-efficiency/"