Pivot plugin to simple XML element parser#1
Merged
Conversation
Replace the CSV grammar and plugin with an XML element-only variant.
Parses a single rooted XML document into `{ name, children }` where
`children` is an array of strings (text nodes) and nested elements.
Handles open/close tags, self-closing tags (`<tag/>`), nested and
mixed content, and reports mismatched close tags as an error. A
custom lexer matcher tokenizes `<tag>`, `</tag>`, and `<tag/>` as
single tokens; whitespace and JSON structural tokens are disabled
so text is preserved verbatim between tags.
Extend the simple element parser to cover the commonly used parts of
the XML specification beyond bare elements.
Lexer changes:
- The custom tag matcher now parses attributes (including single- and
double-quoted values with entity decoding) and recognises comments
(<!-- ... -->), CDATA sections (<![CDATA[ ... ]]>), processing
instructions (<? ... ?>, including the XML declaration) and
DOCTYPE declarations with optional internal subsets. Comments, PIs
and DOCTYPEs are emitted as an #XIG token and dropped via IGNORE;
CDATA is emitted verbatim as #TX with no entity processing.
- A text modifier decodes the five predefined entities (amp, lt, gt,
quot, apos) plus numeric character references (&#N; and &#xN;)
from text nodes; attribute values are decoded inline.
Data structure changes:
- Each element now has `attributes`, `localName` and optional
`prefix`/`namespace` fields in addition to `name` and `children`.
- A post-parse walk resolves namespace URIs from xmlns/xmlns:*
declarations across nested scopes with proper inheritance and
override semantics.
Options:
- `namespaces` (default true) - enable namespace resolution
- `entities` (default true) - enable entity decoding
- `customEntities` - additional named entities
Grammar:
- `xml` rule skips whitespace text nodes between the document
prolog (declaration, DOCTYPE, comments) and the root element, and
after the root element, so real-world documents with blank lines
parse cleanly.
- go/xml.go: Go port of the XML plugin, with the same data shape and feature set as the TypeScript version (elements + attributes + mixed content, predefined + numeric entity decoding, namespace resolution, comments/CDATA/PI/DOCTYPE handling). Goes through jsonic/go's declarative GrammarSpec with auto-wired @xml-bc / @child-bc state actions. - go/xml_test.go: Go test suite driven by the shared TSV spec files plus an explicit jsonic-embedded-XML test case. - test/spec/*.tsv: shared parse fixtures with four columns (name, input, expected, opts). Input uses escape sequences (\n \r \t \\); expected is raw JSON or `ERROR`/`ERROR:code`; opts is optional plugin options JSON. Splits cases across basic, attributes, entities, namespaces, structure, errors, and a w3c spec of standardised/real-world XML documents (Atom, SOAP, SVG, RSS, XHTML, DOCTYPEs, not-well-formed). - test/xml.test.ts: the TypeScript test suite now auto-discovers and runs every TSV spec file, and adds the jsonic-embedded-XML test case. - Remove the leftover CSV Go package, CSV docs, CSV fixtures, and coverage artifact from the original repo layout.
embed mode
----------
Adds an `embed: true` plugin option that extends Jsonic's own grammar so
a literal XML element (`<tag>…</tag>` or `<tag/>`) can appear wherever
a Jsonic value is expected — inside maps, lists, or at the top level.
Default behaviour (`embed: false`) remains pure-XML parsing with the
JSON rules stripped.
When `embed: true` the plugin:
- keeps the full JSON/JSONIC grammar in place, including the
structural fixed tokens `{ } [ ] : ,`;
- splices two alternates into the `val` rule so `#XOP`/`#XSC` tokens
dispatch to the `element` rule;
- tracks XML nesting depth in `ctx.u.xmlDepth`: while depth > 0 the
custom matcher also claims any run of non-`<` characters as a
single `#TX` node (optionally entity-decoded), so Jsonic's lexer
can't reinterpret a comma or colon inside XML text as a JSON
separator;
- resolves namespaces on close of an `element` rule that sits
directly under a `val` rule.
The embedded-XML tests in both `test/xml.test.ts` and `go/xml_test.go`
now use real literal XML in Jsonic source instead of stuffing the
document inside a string.
W3C XML Conformance Test Suite
------------------------------
Added `scripts/fetch-xml-suite.sh` to download the 2013-09-23 snapshot
of the W3C XML Test Suite (xmltest) into `test/xmlconf/` (gitignored —
the suite is owned by W3C and contributors and is not redistributed).
Both languages pick it up automatically when present:
- `go/xmlconf_test.go` iterates `xmltest/valid/sa/*.xml` and
`xmltest/not-wf/sa/*.xml`, counting successful parses and
expected rejections, and asserts each count stays above a
regression floor.
- An equivalent `describe(..., { skip: ... })` block in
`test/xml.test.ts` does the same for Node.
Current numbers (regression guard in parentheses):
- valid/sa : 116 / 120 parsed (floor 110)
- not-wf/sa : 39 / 186 rejected (floor 30)
The handful of `valid/sa` misses are UTF-16 BOM files and tests that
use non-Latin tag names — both out of scope for the current parser.
Many `not-wf/sa` tests hinge on character-level WF constraints our
structural parser doesn't enforce, hence the conservative floor.
Add a batch of XML 1.0 well-formedness constraints that the structural
matcher can enforce without DTD support. The custom matcher now also
owns text-token emission whenever the parser is inside an open
element (depth > 0) in both pure and embed modes, so the same
validation path applies to all character data.
New errors raised at lex time:
comment_double_dash -- inside a comment body
cdata_terminator_in_text ]]> in non-CDATA character data
pi_target_invalid <? ?> with missing/invalid target
lt_in_attr_value literal `<` in an attribute value
bad_entity_ref malformed `&...;` reference
(in text or attribute values)
duplicate_attribute same attribute name twice in one tag
xml_invalid_tag </> empty close tag
W3C xmltest/not-wf/sa rejection rate climbs from 39/186 to 54/186 with
no regression in valid/sa (116/120). All 87 TS tests and the Go suite
still pass.
XML 1.0 §2.2 only allows tab (#x09), LF (#x0A), CR (#x0D) and code points >= #x20 in document content. Other C0 controls (form feed, ESC, ...) make a document not well-formed. Apply a `checkChars` validation in every place the matcher emits text-like content: character data inside an open element, CDATA section bodies, comment bodies, processing instruction bodies, and attribute values. The new error code is `invalid_xml_char`. W3C xmltest/not-wf/sa rejection rate moves from 54/186 to 58/186 with no regression in valid/sa.
XML 1.0 §2.1 requires exactly one root (document) element. The xml
rule previously allowed `r: xml` to skip trailing whitespace and then
re-attempt `{ p: element }`, which let documents with multiple
top-level elements parse with the last one winning.
Track a per-parse `ctx.u.rootSeen` flag, set in `@xml-bc` once the
root element's node has been hoisted. Add a new `@no-root-yet`
condition gating the `{ p: element }` alternate so subsequent attempts
fail with "unexpected" rather than silently producing a wrong tree.
W3C xmltest/not-wf/sa rejection rate moves from 58/186 to 60/186.
Replace the ASCII-only NameStartChar/NameChar regex with the full XML
1.0 Fifth Edition character set:
NameStartChar = ':' | [A-Z] | '_' | [a-z] | the Unicode letter and
ideograph blocks listed in §2.3 [4]
NameChar = NameStartChar | '-' | '.' | [0-9] | #xB7 |
combining-mark blocks (§2.3 [4a])
Surrogate pairs in the JS implementation and multibyte UTF-8 sequences
in the Go implementation are now read as single characters via a
shared `readName` helper. The same predicates also gate entity-
reference name validation.
W3C xmltest/valid/sa pass count moves from 116/120 to 117/120 (the
Thai-named element test); the remaining three misses are UTF-16/
UTF-32 BOM files. Two new TSV cases (`tag-name-unicode-thai`,
`tag-name-unicode-greek`) lock in the behaviour for both runtimes.
The custom matcher now strips a UTF-8 BOM (raw bytes EF BB BF or the
single character U+FEFF) at sI=0 in both runtimes, so files saved with
a BOM parse cleanly without the caller having to massage the input.
For UTF-16 / UTF-32 encoded input the runtime can't sniff bytes from
inside an already-decoded JS / Go string. The plugin therefore exposes
a public `decodeBOM(src)` helper:
- Go: `xml.DecodeBOM(string)` — accepts a string of raw bytes,
detects UTF-8/16/32 BOMs and transcodes to a UTF-8 Go
string (with `unicode/utf16` for the UTF-16 paths).
- TS: `decodeBOM(src)` — accepts either a Node Buffer / Uint8Array
or a "binary" JS string, detects UTF-8/16/32 BOMs, falls
back to UTF-8 when no BOM is present, and returns a decoded
Unicode string. A leading U+FEFF is stripped if the input
is already a Unicode string.
The W3C conformance test runners pass file contents through this
helper before parsing, so the three UTF-16 documents in
xmltest/valid/sa now parse:
valid/sa: 120 / 120 parsed successfully
XML 1.0 §2.11 requires that any literal CR (#xD) or CR-LF (#xD #xA) be replaced with a single LF (#xA) before parsing. §3.3.3 further requires that for CDATA-typed attributes (the default in the absence of a DTD) every TAB / LF / CR in the source be replaced with a single SPACE before entity references are decoded. Add `normaliseLineEndings` and `normaliseAttrWhitespace` helpers in both runtimes and apply them in the matcher's text, CDATA and attribute paths. Five new TSV cases (`text-crlf-normalised`, `text-cr-normalised`, `attr-tab-normalised`, `attr-newline-normalised`, `attr-crlf-normalised`) cover the new behaviour.
XML 1.0 §2.10 (xml:space) and §2.12 (xml:lang) define two special attributes that an element may use to signal whitespace handling and language identification. Both inherit down through descendants, just like xmlns declarations. Fold xml:space and xml:lang propagation into the existing namespace walk via a shared `xmlScope` value that carries the active prefix map plus the current `space` and `lang`. Each element gains a `space` field when the active value differs from the default "default" (typically "preserve"), and a `lang` field when any in-scope element specifies xml:lang. A new test/spec/xmlspace-lang.tsv covers the inherited-and-overridden cases for both runtimes.
Per Namespaces in XML 1.0 §2 ("Reserved prefixes and namespace names"):
- The "xml" prefix is fixed to http://www.w3.org/XML/1998/namespace.
It MAY be redeclared but only to that exact URI; redeclaring it
to any other namespace name is an error.
- The "xmlns" prefix is fixed to http://www.w3.org/2000/xmlns/ and
MUST NOT be declared.
- Neither URI may be bound to any other prefix or used as the
default namespace.
- A prefixed element or attribute name that has no in-scope binding
is an error (xmlns and xml are implicitly bound).
The namespace resolver now pre-binds the xml prefix, validates each
xmlns/xmlns:* declaration, and walks the tree checking that every
prefixed element and attribute has a binding. Errors short-circuit
resolution and surface as parse errors via two new codes:
reserved_namespace
unbound_prefix
Six new TSV cases under test/spec/errors.tsv exercise the
reserved-prefix and unbound-prefix paths; two positive cases under
test/spec/namespaces.tsv lock in the implicit `xml:` binding and the
correct explicit `xmlns:xml=` redeclaration.
Extract `<!ENTITY name "value">` general internal entity declarations
from the DOCTYPE internal subset and use them when resolving entity
references in text and attribute values. Parameter entity
declarations (`<!ENTITY % name ...>`) and external entity
declarations (`<!ENTITY name SYSTEM "...">` / `PUBLIC "..."`) are
recognised but skipped — we don't fetch external resources.
Implementation:
- The DOCTYPE matcher path now records the byte range of the `[ ]`
internal subset and runs `parseDoctypeEntities` over it. The
extracted map is stored on the per-parse context
(`ctx.u.dtdEntities` in TS, `ctx.U["dtdEntities"]` in Go).
- The entity decoder is now a closure that takes an optional `dtd`
map. The matcher's text and attribute paths look the map up
via `lex.ctx` and pass it through. The five predefined
entities and any plugin-time `customEntities` always take
precedence, matching the XML 1.0 rule that the predefined
entities are always available.
- Recursive entity expansion is supported, with cycle detection
via a `seen` set: a cyclic reference breaks the cycle (the
original `&name;` is left in place) instead of looping.
Entity values are stored verbatim. Character and entity references
inside an entity value are expanded only when the outer entity is
referenced (matches XML 1.0 §4.4 "Bypassed" treatment for general
entity declarations).
A new test/spec/dtd-entities.tsv covers the basic, recursive, and
edge cases (single-quoted values, parameter-entity skip, external
entity skip, predefined-entity precedence) for both runtimes. All
118 TS tests and the Go suite pass.
XML 1.0 §4.1 requires every named entity reference to resolve to a
declared entity (predefined, custom, or DOCTYPE-declared). Add a new
`strictEntities` option (default `true`) that enforces this in
`checkEntityRefs`. When set to `false`, references to unknown names
pass through unexpanded (legacy behaviour useful for templating).
While testing the new check, the DOCTYPE depth tracker was found to
treat `]` and `>` characters inside quoted entity values as if they
ended the internal subset, which made declarations like
`<!ENTITY rsqb "]">` cut the subset short and any subsequent
`]` references reach the validator as undeclared. The tracker
now skips over single- and double-quoted strings while walking the
DOCTYPE, restoring the W3C valid/sa pass count to 120/120.
Conformance changes:
- valid/sa : 120/120 (unchanged)
- not-wf/sa : 60/186 -> 64/186 (+4 strict-entity catches)
The legacy "unknown-passthrough" test was renamed to
"unknown-rejected" with a new "unknown-passthrough-lenient" variant
that opts in via `{strictEntities: false}`.
Parse `<!ATTLIST element attr type defaultDecl>` declarations from
the DOCTYPE internal subset and use them to fill in attributes that
are missing from element instances. Both literal defaults and the
`#FIXED "value"` form are honoured; `#REQUIRED` and `#IMPLIED`
declarations contribute nothing because they have no default value.
Implementation:
- `parseDoctypeAttlists` scans for each `<!ATTLIST>` declaration,
skips the AttType (a bare uppercase identifier, an enumeration
`( ... )`, or `NOTATION ( ... )`), and collects the default
value. The result is keyed by element name then attribute
name and stored on the per-parse context as
`dtdAttrDefaults`.
- The `@element-open` and `@element-selfclose` actions consult
that map via `applyAttrDefaults` and merge in any defaults
that the parsed element does not already provide.
A new test/spec/dtd-attlist.tsv exercises basic defaults, override
by an instance attribute, multiple declarations on one element,
`#FIXED`, enumeration types, the no-default `#REQUIRED`/`#IMPLIED`
forms, and per-element scoping. All 126 TS tests and the Go suite
pass; W3C conformance numbers are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the CSV grammar and plugin with an XML element-only variant.
Parses a single rooted XML document into
{ name, children }wherechildrenis an array of strings (text nodes) and nested elements.Handles open/close tags, self-closing tags (
<tag/>), nested andmixed content, and reports mismatched close tags as an error. A
custom lexer matcher tokenizes
<tag>,</tag>, and<tag/>assingle tokens; whitespace and JSON structural tokens are disabled
so text is preserved verbatim between tags.