Skip to content

feat: add replacer option to encode() for pre-encoding value transformation #60

Description

@LijieZhou

Problem Statement

Problem

TOON's tabular row compression requires all fields in an array of objects to be scalar primitives. When any field holds a nested object, detect_tabular_header returns None and the encoder falls back to verbose indented list output.

A real example:

Current output (verbose, no tabular compression):

blocks[2]:
  - type: header
    content: Q2 Sales Report
    bbox:
      x: 0.05
      y: 0.05
      width: 0.9
      height: 0.04
    confidence: high
    confidence_score: 0.98
  - type: paragraph
    content: This report summarizes sales...
    bbox:
      ...

I tried to use key folding. If I understand correctly, key folding is called on detect_tabular_header which only sees the Python dict and not the folded representation. The only workaround today is to manually pre-flatten the data at every call.

Proposed Solution

How TypeScript solves this — design reference

The TypeScript package exposes a replacer option on encode(). Its architecture is worth understanding before describing the Python proposal.

Key design principle: the replacer is a separate pre-encoding pass, not inline logic.

From the TypeScript source — packages/toon/src/encode/replacer.ts:

encodeJsonValue(
  options.replacer
    ? applyReplacer(normalizedValue, options.replacer)  // pre-pass
    : normalizedValue,                                  // no-op if absent
  options, 0
);

The replacer walks the full data tree and returns a new transformed value. The encoder then runs on that value exactly as if no replacer had been specified — the encoder never sees the replacer. This keeps encoding logic clean and makes the replacer composable with any future encoder change.

TypeScript signature:

replacer: (key: string, value: unknown, path: string[]) => unknown
  • key — current key name; "" for the root call; str(i) for array elements
  • value — current value, already normalized
  • path — path from root to the current node; object keys are str, array indices are int (mirrors TypeScript's readonly (string | number)[])
  • Return the same value to leave it unchanged; children are still traversed
  • Return undefined to drop the key or element from output

TypeScript implementation — four functions, ~50 lines:

applyReplacer(root, replacer)       // entry: call replacer on root, then recurse
transformChildren(value, r, path)   // route to object or array transformer
transformObject(obj, r, path)       // iterate keys → call replacer → normalize → recurse
transformArray(arr, r, path)        // same with str(i) as key

Proposed Python API

Mirror TypeScript exactly, including the pre-pass architecture.

Signature:

Replacer = Callable[[str, Any, List[Union[str, int]]], Any]

encode(value, {"replacer": fn})  # fn: Replacer

The replacer supports two distinct operations:

Option A — flatten a nested field into its parent (e.g. bbox)

Return a new parent object when the replacer is called on the containing dict.
The nested field is replaced with flat scalar siblings, which is what detect_tabular_header requires.
The data is fully preserved — just restructured.

from toon_format import encode

def flatten_bbox(key, value, path):
    if isinstance(value, dict) and isinstance(value.get("bbox"), dict):
        bbox = value["bbox"]
        return {k: v for k, v in value.items() if k != "bbox"} | {
            "bbox_x": bbox["x"], "bbox_y": bbox["y"],
            "bbox_w": bbox["width"], "bbox_h": bbox["height"],
        }
    return value  # leave everything else unchanged

result = encode(blocks, {"replacer": flatten_bbox})

Output — tabular compression now triggers:

blocks[2]{type,content,bbox_x,bbox_y,bbox_w,bbox_h,confidence,confidence_score}:
  header,Q2 Sales Report,0.05,0.05,0.9,0.04,high,0.98
  paragraph,This report summarizes sales...,0.05,0.12,0.9,0.06,high,0.97

Option B — drop fields entirely (e.g. LLM-irrelevant metadata)

Return the OMIT sentinel (Python has no undefined) to remove a key or array element from the output.
Useful for fields like embed, enriched, and enrichment_success that are preprocessing artefacts and add no value when the output is consumed by an LLM.
OMIT is exported from the top-level package.

from toon_format import encode, OMIT

_SKIP = {"embed", "enriched", "enrichment_success"}

def strip_metadata(key, value, path):
    if key in _SKIP:
        return OMIT
    return value

result = encode(response, {"replacer": strip_metadata})

Both options can be combined in a single replacer function.

Alternatives Considered

No response

SPEC Compliance

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions