STL Token-Efficiency Benchmark

Empirical measurements with tiktoken cl100k_base. Three experiments, 16 samples, fully reproducible. Run date: 2026-05-13.

39.7%

STL saves vs JSON-pretty

The form LLMs actually produce when asked for structured output

6.2%

STL saves vs JSON-minified

Comparison against the mathematical floor of "fair" JSON

17.6%

Additional save — §4.2 defaults

When the parser fills in default modifier values

Methodology

Tokenizer

tiktoken cl100k_base — frozen vocabulary, BPE-byte family shared with modern Claude / GPT tokenizers

Samples

6 representative edges spanning minimal → complex, EN + CN, single edge → 3-edge chain

Compared

STL native vs JSON-pretty (indent=2, what LLMs emit) vs JSON-minified (separators=(",",":"))

Excluded

Ultra-minified JSON with 1-char keys — unfair, sacrifices semantic clarity

Determinism

cl100k_base is a frozen BPE vocab — results identical across machines and runs

Run env

Python 3.12.3 · tiktoken 0.12.0 · 2026-05-13T01:00:10Z

Experiment 01 — STL vs JSON

Six samples, each expressed three equivalent ways. Token counts measured directly.

Sample	STL	JSON-pretty	JSON-min	Δ pretty	Δ min
1. Minimal edge	9	18	11	+100.0%	+22.2%
2. Simple typed edge	23	43	25	+87.0%	+8.7%
3. Causal claim	38	64	40	+68.4%	+5.3%
4. Empirical lesson	54	81	57	+50.0%	+5.6%
5. CJK anchors	52	78	54	+50.0%	+3.8%
6. 3-edge chain	110	190	118	+72.7%	+7.3%
TOTAL	286	474	305	+65.7%	+6.6%

STL

286 tokens · baseline

JSON-min

305 tokens · +6.6%

JSON-pretty

474 tokens · +65.7%

Bar lengths proportional to token counts. JSON-pretty is the realistic LLM-output baseline.

Why STL wins — four cost drivers

Cost driver	JSON	STL
Quoted keys	`"confidence": 0.95`	`confidence=0.95`
Modifier wrapping	`"modifiers": { ... }`	single `::mod(...)` call
Directionality	two keys (`"source"`, `"target"`)	one arrow token `→`
Multi-edge framing	array brackets + per-object braces	newline-separated

Honest caveat. Savings scale with structural overhead share. Edges dominated by long free-text values (a 500-word description or lesson) see smaller relative savings — the text bytes themselves tokenize identically in both formats. The structural advantage shows up best on edge-dense, value-light data.

Sample #4 audit — same edge, three forms

The OAuth refresh-token lesson, written as STL and as both JSON forms. Token counts measured directly.

STL

[OAuth_Refresh_Token] → [Auth_Failure]
::mod(
  rule="empirical",
  confidence=0.95,
  lesson="Testing-mode refresh tokens expire after 7 days; must publish app to Production",
  occurred_time="2026-04-23"
)

197 chars54 tokens

JSON-pretty

{
  "source": "OAuth_Refresh_Token",
  "target": "Auth_Failure",
  "modifiers": {
    "rule": "empirical",
    "confidence": 0.95,
    "lesson": "Testing-mode refresh tokens expire after 7 days; must publish app to Production",
    "occurred_time": "2026-04-23"
  }
}

267 chars81 tokens

JSON-minified

{"source":"OAuth_Refresh_Token","target":"Auth_Failure","modifiers":{"rule":"empirical","confidence":0.95,"lesson":"Testing-mode refresh tokens expire after 7 days; must publish app to Production","occurred_time":"2026-04-23"}}

227 chars57 tokens

Same semantic content; 54 / 81 / 57 tokens. STL beats both JSON forms while staying the most human-readable.

Experiment 02 — Quote-removal null result

Hypothesis: Removing the quote pair around modifier string values should save tokens. Finding: Savings ≈ 0% (−0.4% across 5 samples; effectively noise).

Sample	A: quoted	B: hybrid	C: none	B vs A	C vs A
Simple typed	23	23	23	+0.0%	+0.0%
Causal	38	38	38	+0.0%	+0.0%
Empirical lesson	53	53	53	+0.0%	+0.0%
Long description	44	44	43	+0.0%	−2.3%
CJK heavy	88	88	88	+0.0%	+0.0%
TOTAL	246	246	245	+0.0%	−0.4%

Why — BPE has already merged `="` into one token

A: quotedrule="causal"

rule =" ca usal "

5 tokens

C: nonerule=causal

rule = ca usal "

4 tokens (saved 1 — but you lost a quote that wasn't a separate token)

The merge: =" is a single BPE token in cl100k_base — the pattern is ubiquitous in JSON/Python/JS training data, so the tokenizer learned it as one piece. Dropping the opening quote saves zero because there was no separate " token to drop. The closing quote sometimes saves 1 token, but that gain is offset by the new ambiguity (free-text values containing , / = / ) would break parsing).

Lesson kept in STG. Character-count intuition does not transfer to token cost. Always verify with a tokenizer before changing a wire format. This null result killed a tempting protocol change before it was made — see STG node BPE_Equals_Quote_Merge_Insight.

Experiment 03 — §4.2 default value omission

Once a parser can fill in well-defined defaults (confidence, rule, strength), you can omit those fields from the surface form. Five representative edge styles measured:

Edge style	Before	After	Saving
Definitional (`is_a`)	28	16	−42.9%
Causal	34	29	−14.7%
Empirical (with `lesson`)	53	48	−9.4%
Definitional + `description`	35	30	−14.3%
Role / spec	38	32	−15.8%
TOTAL	188	155	−17.6%

Headline number cited in STL_Operational_Protocol.md §4.2.5. Definitional edges save the most (−42.9%) because their rule field is the most predictable.

How to reproduce

One command per experiment

# Requires tiktoken. The STG repo's venv already has it:
~/.stg/venv/bin/python 01_stl_vs_json.py
~/.stg/venv/bin/python 02_quote_removal.py
~/.stg/venv/bin/python 03_default_omission.py

# Or install fresh:
pip install tiktoken
python 01_stl_vs_json.py

All scripts are deterministic. cl100k_base is a frozen vocabulary, so results are identical across machines and runs.

File layout

research/token-efficiency/
├── README.md                   # source doc
├── 01_stl_vs_json.py           # STL-vs-JSON benchmark
├── 02_quote_removal.py         # quote-removal null-result experiment
├── 03_default_omission.py     # §4.2 token-savings measurement
└── results_2026-05-13.txt      # full output snapshot of first run

Tokenizer caveats

This benchmark uses cl100k_base (GPT-4 family). Reasoning:

Publicly available. No API key required to reproduce.
Same BPE family as modern Claude / GPT tokenizers — all are byte-level BPE trained on overlapping web corpora.
Stable. The vocabulary is frozen; no version drift between measurements.

Anthropic does not publish Claude's exact tokenizer, so absolute token counts may differ by ±5%. The relative ratios (STL vs JSON, defaults vs explicit) are stable across BPE families because the same structural patterns get merged in all of them.

Citation

When citing these numbers in spec changes, STG nodes, or other agent context:

"STL saves 39.7% tokens vs JSON-pretty / 6.2% vs JSON-minified;
 §4.2 default-value omission saves an additional ~17.6% average.
 Measured 2026-05-13 with tiktoken cl100k_base.
 Source: scos-lab/semantic-tension-language/research/token-efficiency/"

Repository: scos-lab/semantic-tension-language/research/token-efficiency/

Methodology

Experiment 01 — STL vs JSON

Why STL wins — four cost drivers

Sample #4 audit — same edge, three forms

STL

JSON-pretty

JSON-minified

Experiment 02 — Quote-removal null result

Why — BPE has already merged =" into one token

Experiment 03 — §4.2 default value omission

How to reproduce

Tokenizer caveats

Citation

Why — BPE has already merged `="` into one token