Pivot plugin to simple XML element parser by rjrodger · Pull Request #1 · jsonicjs/xml

rjrodger · 2026-04-19T16:49:45Z

Replace the CSV grammar and plugin with an XML element-only variant.
Parses a single rooted XML document into { name, children } where
children is an array of strings (text nodes) and nested elements.

Handles open/close tags, self-closing tags (<tag/>), nested and
mixed content, and reports mismatched close tags as an error. A
custom lexer matcher tokenizes <tag>, </tag>, and <tag/> as
single tokens; whitespace and JSON structural tokens are disabled
so text is preserved verbatim between tags.

Replace the CSV grammar and plugin with an XML element-only variant. Parses a single rooted XML document into `{ name, children }` where `children` is an array of strings (text nodes) and nested elements. Handles open/close tags, self-closing tags (`<tag/>`), nested and mixed content, and reports mismatched close tags as an error. A custom lexer matcher tokenizes `<tag>`, `</tag>`, and `<tag/>` as single tokens; whitespace and JSON structural tokens are disabled so text is preserved verbatim between tags.

Extend the simple element parser to cover the commonly used parts of the XML specification beyond bare elements. Lexer changes: - The custom tag matcher now parses attributes (including single- and double-quoted values with entity decoding) and recognises comments (), CDATA sections (<![CDATA[ ... ]]>), processing instructions (<? ... ?>, including the XML declaration) and DOCTYPE declarations with optional internal subsets. Comments, PIs and DOCTYPEs are emitted as an #XIG token and dropped via IGNORE; CDATA is emitted verbatim as #TX with no entity processing. - A text modifier decodes the five predefined entities (amp, lt, gt, quot, apos) plus numeric character references (&#N; and &#xN;) from text nodes; attribute values are decoded inline. Data structure changes: - Each element now has `attributes`, `localName` and optional `prefix`/`namespace` fields in addition to `name` and `children`. - A post-parse walk resolves namespace URIs from xmlns/xmlns:* declarations across nested scopes with proper inheritance and override semantics. Options: - `namespaces` (default true) - enable namespace resolution - `entities` (default true) - enable entity decoding - `customEntities` - additional named entities Grammar: - `xml` rule skips whitespace text nodes between the document prolog (declaration, DOCTYPE, comments) and the root element, and after the root element, so real-world documents with blank lines parse cleanly.

- go/xml.go: Go port of the XML plugin, with the same data shape and feature set as the TypeScript version (elements + attributes + mixed content, predefined + numeric entity decoding, namespace resolution, comments/CDATA/PI/DOCTYPE handling). Goes through jsonic/go's declarative GrammarSpec with auto-wired @xml-bc / @child-bc state actions. - go/xml_test.go: Go test suite driven by the shared TSV spec files plus an explicit jsonic-embedded-XML test case. - test/spec/*.tsv: shared parse fixtures with four columns (name, input, expected, opts). Input uses escape sequences (\n \r \t \\); expected is raw JSON or `ERROR`/`ERROR:code`; opts is optional plugin options JSON. Splits cases across basic, attributes, entities, namespaces, structure, errors, and a w3c spec of standardised/real-world XML documents (Atom, SOAP, SVG, RSS, XHTML, DOCTYPEs, not-well-formed). - test/xml.test.ts: the TypeScript test suite now auto-discovers and runs every TSV spec file, and adds the jsonic-embedded-XML test case. - Remove the leftover CSV Go package, CSV docs, CSV fixtures, and coverage artifact from the original repo layout.

embed mode ---------- Adds an `embed: true` plugin option that extends Jsonic's own grammar so a literal XML element (`<tag>…</tag>` or `<tag/>`) can appear wherever a Jsonic value is expected — inside maps, lists, or at the top level. Default behaviour (`embed: false`) remains pure-XML parsing with the JSON rules stripped. When `embed: true` the plugin: - keeps the full JSON/JSONIC grammar in place, including the structural fixed tokens `{ } [ ] : ,`; - splices two alternates into the `val` rule so `#XOP`/`#XSC` tokens dispatch to the `element` rule; - tracks XML nesting depth in `ctx.u.xmlDepth`: while depth > 0 the custom matcher also claims any run of non-`<` characters as a single `#TX` node (optionally entity-decoded), so Jsonic's lexer can't reinterpret a comma or colon inside XML text as a JSON separator; - resolves namespaces on close of an `element` rule that sits directly under a `val` rule. The embedded-XML tests in both `test/xml.test.ts` and `go/xml_test.go` now use real literal XML in Jsonic source instead of stuffing the document inside a string. W3C XML Conformance Test Suite ------------------------------ Added `scripts/fetch-xml-suite.sh` to download the 2013-09-23 snapshot of the W3C XML Test Suite (xmltest) into `test/xmlconf/` (gitignored — the suite is owned by W3C and contributors and is not redistributed). Both languages pick it up automatically when present: - `go/xmlconf_test.go` iterates `xmltest/valid/sa/*.xml` and `xmltest/not-wf/sa/*.xml`, counting successful parses and expected rejections, and asserts each count stays above a regression floor. - An equivalent `describe(..., { skip: ... })` block in `test/xml.test.ts` does the same for Node. Current numbers (regression guard in parentheses): - valid/sa : 116 / 120 parsed (floor 110) - not-wf/sa : 39 / 186 rejected (floor 30) The handful of `valid/sa` misses are UTF-16 BOM files and tests that use non-Latin tag names — both out of scope for the current parser. Many `not-wf/sa` tests hinge on character-level WF constraints our structural parser doesn't enforce, hence the conservative floor.

Add a batch of XML 1.0 well-formedness constraints that the structural matcher can enforce without DTD support. The custom matcher now also owns text-token emission whenever the parser is inside an open element (depth > 0) in both pure and embed modes, so the same validation path applies to all character data. New errors raised at lex time: comment_double_dash -- inside a comment body cdata_terminator_in_text ]]> in non-CDATA character data pi_target_invalid <? ?> with missing/invalid target lt_in_attr_value literal `<` in an attribute value bad_entity_ref malformed `&...;` reference (in text or attribute values) duplicate_attribute same attribute name twice in one tag xml_invalid_tag </> empty close tag W3C xmltest/not-wf/sa rejection rate climbs from 39/186 to 54/186 with no regression in valid/sa (116/120). All 87 TS tests and the Go suite still pass.

XML 1.0 §2.2 only allows tab (#x09), LF (#x0A), CR (#x0D) and code points >= #x20 in document content. Other C0 controls (form feed, ESC, ...) make a document not well-formed. Apply a `checkChars` validation in every place the matcher emits text-like content: character data inside an open element, CDATA section bodies, comment bodies, processing instruction bodies, and attribute values. The new error code is `invalid_xml_char`. W3C xmltest/not-wf/sa rejection rate moves from 54/186 to 58/186 with no regression in valid/sa.

XML 1.0 §2.1 requires exactly one root (document) element. The xml rule previously allowed `r: xml` to skip trailing whitespace and then re-attempt `{ p: element }`, which let documents with multiple top-level elements parse with the last one winning. Track a per-parse `ctx.u.rootSeen` flag, set in `@xml-bc` once the root element's node has been hoisted. Add a new `@no-root-yet` condition gating the `{ p: element }` alternate so subsequent attempts fail with "unexpected" rather than silently producing a wrong tree. W3C xmltest/not-wf/sa rejection rate moves from 58/186 to 60/186.

Replace the ASCII-only NameStartChar/NameChar regex with the full XML 1.0 Fifth Edition character set: NameStartChar = ':' | [A-Z] | '_' | [a-z] | the Unicode letter and ideograph blocks listed in §2.3 [4] NameChar = NameStartChar | '-' | '.' | [0-9] | #xB7 | combining-mark blocks (§2.3 [4a]) Surrogate pairs in the JS implementation and multibyte UTF-8 sequences in the Go implementation are now read as single characters via a shared `readName` helper. The same predicates also gate entity- reference name validation. W3C xmltest/valid/sa pass count moves from 116/120 to 117/120 (the Thai-named element test); the remaining three misses are UTF-16/ UTF-32 BOM files. Two new TSV cases (`tag-name-unicode-thai`, `tag-name-unicode-greek`) lock in the behaviour for both runtimes.

The custom matcher now strips a UTF-8 BOM (raw bytes EF BB BF or the single character U+FEFF) at sI=0 in both runtimes, so files saved with a BOM parse cleanly without the caller having to massage the input. For UTF-16 / UTF-32 encoded input the runtime can't sniff bytes from inside an already-decoded JS / Go string. The plugin therefore exposes a public `decodeBOM(src)` helper: - Go: `xml.DecodeBOM(string)` — accepts a string of raw bytes, detects UTF-8/16/32 BOMs and transcodes to a UTF-8 Go string (with `unicode/utf16` for the UTF-16 paths). - TS: `decodeBOM(src)` — accepts either a Node Buffer / Uint8Array or a "binary" JS string, detects UTF-8/16/32 BOMs, falls back to UTF-8 when no BOM is present, and returns a decoded Unicode string. A leading U+FEFF is stripped if the input is already a Unicode string. The W3C conformance test runners pass file contents through this helper before parsing, so the three UTF-16 documents in xmltest/valid/sa now parse: valid/sa: 120 / 120 parsed successfully

XML 1.0 §2.11 requires that any literal CR (#xD) or CR-LF (#xD #xA) be replaced with a single LF (#xA) before parsing. §3.3.3 further requires that for CDATA-typed attributes (the default in the absence of a DTD) every TAB / LF / CR in the source be replaced with a single SPACE before entity references are decoded. Add `normaliseLineEndings` and `normaliseAttrWhitespace` helpers in both runtimes and apply them in the matcher's text, CDATA and attribute paths. Five new TSV cases (`text-crlf-normalised`, `text-cr-normalised`, `attr-tab-normalised`, `attr-newline-normalised`, `attr-crlf-normalised`) cover the new behaviour.

XML 1.0 §2.10 (xml:space) and §2.12 (xml:lang) define two special attributes that an element may use to signal whitespace handling and language identification. Both inherit down through descendants, just like xmlns declarations. Fold xml:space and xml:lang propagation into the existing namespace walk via a shared `xmlScope` value that carries the active prefix map plus the current `space` and `lang`. Each element gains a `space` field when the active value differs from the default "default" (typically "preserve"), and a `lang` field when any in-scope element specifies xml:lang. A new test/spec/xmlspace-lang.tsv covers the inherited-and-overridden cases for both runtimes.

Per Namespaces in XML 1.0 §2 ("Reserved prefixes and namespace names"): - The "xml" prefix is fixed to http://www.w3.org/XML/1998/namespace. It MAY be redeclared but only to that exact URI; redeclaring it to any other namespace name is an error. - The "xmlns" prefix is fixed to http://www.w3.org/2000/xmlns/ and MUST NOT be declared. - Neither URI may be bound to any other prefix or used as the default namespace. - A prefixed element or attribute name that has no in-scope binding is an error (xmlns and xml are implicitly bound). The namespace resolver now pre-binds the xml prefix, validates each xmlns/xmlns:* declaration, and walks the tree checking that every prefixed element and attribute has a binding. Errors short-circuit resolution and surface as parse errors via two new codes: reserved_namespace unbound_prefix Six new TSV cases under test/spec/errors.tsv exercise the reserved-prefix and unbound-prefix paths; two positive cases under test/spec/namespaces.tsv lock in the implicit `xml:` binding and the correct explicit `xmlns:xml=` redeclaration.

Extract `<!ENTITY name "value">` general internal entity declarations from the DOCTYPE internal subset and use them when resolving entity references in text and attribute values. Parameter entity declarations (`<!ENTITY % name ...>`) and external entity declarations (`<!ENTITY name SYSTEM "...">` / `PUBLIC "..."`) are recognised but skipped — we don't fetch external resources. Implementation: - The DOCTYPE matcher path now records the byte range of the `[ ]` internal subset and runs `parseDoctypeEntities` over it. The extracted map is stored on the per-parse context (`ctx.u.dtdEntities` in TS, `ctx.U["dtdEntities"]` in Go). - The entity decoder is now a closure that takes an optional `dtd` map. The matcher's text and attribute paths look the map up via `lex.ctx` and pass it through. The five predefined entities and any plugin-time `customEntities` always take precedence, matching the XML 1.0 rule that the predefined entities are always available. - Recursive entity expansion is supported, with cycle detection via a `seen` set: a cyclic reference breaks the cycle (the original `&name;` is left in place) instead of looping. Entity values are stored verbatim. Character and entity references inside an entity value are expanded only when the outer entity is referenced (matches XML 1.0 §4.4 "Bypassed" treatment for general entity declarations). A new test/spec/dtd-entities.tsv covers the basic, recursive, and edge cases (single-quoted values, parameter-entity skip, external entity skip, predefined-entity precedence) for both runtimes. All 118 TS tests and the Go suite pass.

XML 1.0 §4.1 requires every named entity reference to resolve to a declared entity (predefined, custom, or DOCTYPE-declared). Add a new `strictEntities` option (default `true`) that enforces this in `checkEntityRefs`. When set to `false`, references to unknown names pass through unexpanded (legacy behaviour useful for templating). While testing the new check, the DOCTYPE depth tracker was found to treat `]` and `>` characters inside quoted entity values as if they ended the internal subset, which made declarations like `<!ENTITY rsqb "]">` cut the subset short and any subsequent `]` references reach the validator as undeclared. The tracker now skips over single- and double-quoted strings while walking the DOCTYPE, restoring the W3C valid/sa pass count to 120/120. Conformance changes: - valid/sa : 120/120 (unchanged) - not-wf/sa : 60/186 -> 64/186 (+4 strict-entity catches) The legacy "unknown-passthrough" test was renamed to "unknown-rejected" with a new "unknown-passthrough-lenient" variant that opts in via `{strictEntities: false}`.

Parse `<!ATTLIST element attr type defaultDecl>` declarations from the DOCTYPE internal subset and use them to fill in attributes that are missing from element instances. Both literal defaults and the `#FIXED "value"` form are honoured; `#REQUIRED` and `#IMPLIED` declarations contribute nothing because they have no default value. Implementation: - `parseDoctypeAttlists` scans for each `<!ATTLIST>` declaration, skips the AttType (a bare uppercase identifier, an enumeration `( ... )`, or `NOTATION ( ... )`), and collects the default value. The result is keyed by element name then attribute name and stored on the per-parse context as `dtdAttrDefaults`. - The `@element-open` and `@element-selfclose` actions consult that map via `applyAttrDefaults` and merge in any defaults that the parsed element does not already provide. A new test/spec/dtd-attlist.tsv exercises basic defaults, override by an instance attribute, multiple declarations on one element, `#FIXED`, enumeration types, the no-default `#REQUIRED`/`#IMPLIED` forms, and per-element scoping. All 126 TS tests and the Go suite pass; W3C conformance numbers are unchanged.

claude added 15 commits April 18, 2026 22:36

rjrodger merged commit d34ff56 into main Apr 19, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivot plugin to simple XML element parser#1

Pivot plugin to simple XML element parser#1
rjrodger merged 15 commits into
mainfrom
claude/xml-parser-elements-bNQB2

rjrodger commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjrodger commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants