Problem Statement
Problem
TOON's tabular row compression requires all fields in an array of objects to be scalar primitives. When any field holds a nested object, detect_tabular_header returns None and the encoder falls back to verbose indented list output.
A real example:
Current output (verbose, no tabular compression):
blocks[2]:
- type: header
content: Q2 Sales Report
bbox:
x: 0.05
y: 0.05
width: 0.9
height: 0.04
confidence: high
confidence_score: 0.98
- type: paragraph
content: This report summarizes sales...
bbox:
...
I tried to use key folding. If I understand correctly, key folding is called on detect_tabular_header which only sees the Python dict and not the folded representation. The only workaround today is to manually pre-flatten the data at every call.
Proposed Solution
How TypeScript solves this — design reference
The TypeScript package exposes a replacer option on encode(). Its architecture is worth understanding before describing the Python proposal.
Key design principle: the replacer is a separate pre-encoding pass, not inline logic.
From the TypeScript source — packages/toon/src/encode/replacer.ts:
encodeJsonValue(
options.replacer
? applyReplacer(normalizedValue, options.replacer) // pre-pass
: normalizedValue, // no-op if absent
options, 0
);
The replacer walks the full data tree and returns a new transformed value. The encoder then runs on that value exactly as if no replacer had been specified — the encoder never sees the replacer. This keeps encoding logic clean and makes the replacer composable with any future encoder change.
TypeScript signature:
replacer: (key: string, value: unknown, path: string[]) => unknown
key — current key name; "" for the root call; str(i) for array elements
value — current value, already normalized
path — path from root to the current node; object keys are str, array indices are int (mirrors TypeScript's readonly (string | number)[])
- Return the same value to leave it unchanged; children are still traversed
- Return
undefined to drop the key or element from output
TypeScript implementation — four functions, ~50 lines:
applyReplacer(root, replacer) // entry: call replacer on root, then recurse
transformChildren(value, r, path) // route to object or array transformer
transformObject(obj, r, path) // iterate keys → call replacer → normalize → recurse
transformArray(arr, r, path) // same with str(i) as key
Proposed Python API
Mirror TypeScript exactly, including the pre-pass architecture.
Signature:
Replacer = Callable[[str, Any, List[Union[str, int]]], Any]
encode(value, {"replacer": fn}) # fn: Replacer
The replacer supports two distinct operations:
Option A — flatten a nested field into its parent (e.g. bbox)
Return a new parent object when the replacer is called on the containing dict.
The nested field is replaced with flat scalar siblings, which is what detect_tabular_header requires.
The data is fully preserved — just restructured.
from toon_format import encode
def flatten_bbox(key, value, path):
if isinstance(value, dict) and isinstance(value.get("bbox"), dict):
bbox = value["bbox"]
return {k: v for k, v in value.items() if k != "bbox"} | {
"bbox_x": bbox["x"], "bbox_y": bbox["y"],
"bbox_w": bbox["width"], "bbox_h": bbox["height"],
}
return value # leave everything else unchanged
result = encode(blocks, {"replacer": flatten_bbox})
Output — tabular compression now triggers:
blocks[2]{type,content,bbox_x,bbox_y,bbox_w,bbox_h,confidence,confidence_score}:
header,Q2 Sales Report,0.05,0.05,0.9,0.04,high,0.98
paragraph,This report summarizes sales...,0.05,0.12,0.9,0.06,high,0.97
Option B — drop fields entirely (e.g. LLM-irrelevant metadata)
Return the OMIT sentinel (Python has no undefined) to remove a key or array element from the output.
Useful for fields like embed, enriched, and enrichment_success that are preprocessing artefacts and add no value when the output is consumed by an LLM.
OMIT is exported from the top-level package.
from toon_format import encode, OMIT
_SKIP = {"embed", "enriched", "enrichment_success"}
def strip_metadata(key, value, path):
if key in _SKIP:
return OMIT
return value
result = encode(response, {"replacer": strip_metadata})
Both options can be combined in a single replacer function.
Alternatives Considered
No response
SPEC Compliance
No response
Additional Context
No response
Problem Statement
Problem
TOON's tabular row compression requires all fields in an array of objects to be scalar primitives. When any field holds a nested object,
detect_tabular_headerreturnsNoneand the encoder falls back to verbose indented list output.A real example:
Current output (verbose, no tabular compression):
I tried to use key folding. If I understand correctly, key folding is called on detect_tabular_header which only sees the Python dict and not the folded representation. The only workaround today is to manually pre-flatten the data at every call.
Proposed Solution
How TypeScript solves this — design reference
The TypeScript package exposes a
replaceroption onencode(). Its architecture is worth understanding before describing the Python proposal.Key design principle: the replacer is a separate pre-encoding pass, not inline logic.
From the TypeScript source —
packages/toon/src/encode/replacer.ts:The replacer walks the full data tree and returns a new transformed value. The encoder then runs on that value exactly as if no replacer had been specified — the encoder never sees the replacer. This keeps encoding logic clean and makes the replacer composable with any future encoder change.
TypeScript signature:
key— current key name;""for the root call;str(i)for array elementsvalue— current value, already normalizedpath— path from root to the current node; object keys arestr, array indices areint(mirrors TypeScript'sreadonly (string | number)[])undefinedto drop the key or element from outputTypeScript implementation — four functions, ~50 lines:
Proposed Python API
Mirror TypeScript exactly, including the pre-pass architecture.
Signature:
The replacer supports two distinct operations:
Option A — flatten a nested field into its parent (e.g.
bbox)Return a new parent object when the replacer is called on the containing dict.
The nested field is replaced with flat scalar siblings, which is what
detect_tabular_headerrequires.The data is fully preserved — just restructured.
Output — tabular compression now triggers:
Option B — drop fields entirely (e.g. LLM-irrelevant metadata)
Return the
OMITsentinel (Python has noundefined) to remove a key or array element from the output.Useful for fields like
embed,enriched, andenrichment_successthat are preprocessing artefacts and add no value when the output is consumed by an LLM.OMITis exported from the top-level package.Both options can be combined in a single replacer function.
Alternatives Considered
No response
SPEC Compliance
No response
Additional Context
No response