Skip to content

find_duplicated_keys misses escape-spelled duplicate keys (does not decode .strings escapes) #740

Description

@jkmassel

TL;DR

StringsFileValidationHelper.find_duplicated_keys compares the literal source bytes of a quoted key, but plutil (which L10nHelper.read_strings_file_as_hash uses to build the comparison hash) decodes .strings escape sequences first. Two entries whose keys are spelled differently but decode to the same bytes — e.g. "\U0041" and "A" — are one duplicate to plutil (which silently keeps only the last value) and two distinct keys to the scanner. The scanner neither reports the duplicate nor raises, so a placeholder change hidden by that duplicate is silently missed.

Root cause

helper/ios/ios_strings_file_validation_helper.rb buffers a quoted key's bytes verbatim (it writes the backslash and the following character as-is) and never decodes .strings escapes, while plutil normalizes them. So the scanner's key string differs from plutil's canonical key whenever the spelling isn't byte-identical.

Confirmed repro

"\U0041" = "first";
"A"      = "second";

plutil -convert json{"A":"second"} (collapsed, last value wins). find_duplicated_keys{} (no duplicate detected, no raise). A placeholder change between first and second would be invisible.

Scope & realism

  • Pre-existing. The escape handling predates Add ios_lint_localization_placeholder_changes action #738 (git blame3cda14f1 / 821594e0). The new ios_lint_localization_placeholder_changes action inherits the gap by newly depending on the scanner; the same gap already exists in ios_lint_localizations' check_duplicate_keys path.
  • ≈ nil in practice. It needs the same logical key present twice in one base file, spelled with inconsistent escapes ("\U0041" and "A"). Real .strings keys are identifiers or English source text; an actual duplicate would be spelled identically both times and is caught. Nobody hand-writes \U0041 for A. Filing this for completeness, not urgency.
  • Survives the fail-closed change. PR Add ios_lint_localization_placeholder_changes action #738's "fail closed when a .strings file can't be scanned" only made scanner raises safe; here the scanner returns {} without raising, so it's a silent miss the fail-closed path doesn't cover.

Decode spec (what a faithful canonicalizer must reproduce)

Derived empirically against plutil (macOS 26.x). The surprises are what make full decoding non-trivial:

  • Octal \NNN — greedy, max 3 digits; overflow wraps & 0xFF (\501A, \400→NUL); high bytes 0x80–0xFD decode via the NeXTSTEP string encoding, not Latin-1 (decisive proof: \271 U+201E, where Latin-1 would be ¹); 0xFE/0xFF are undefined → NUL (\377\0). Requires shipping a 256-entry NeXTSTEP→codepoint table — no Ruby built-in encoding reproduces it.
  • Unicode \U — uppercase only; variable 0–4 hex digits (not fixed 4); 0 digits → NUL; BMP only; surrogate pairs combine (\UD83D\UDE00→😀); a lone/broken surrogate rejects the whole file. (\U0001F600 is not 😀 — it's \x01 + literal F600.)
  • Strip rule\ + anything not recognized → drop the backslash, keep the char. So \u lowercase is literal u, \x41 is literal x41 (not C hex), \e is not ESC.
  • Recognized single-char escapes — exactly \a \b \f \n \r \t \v \" \\ (plus \U, octal). C's set minus \e, \?, \x.
  • Backslash runs\\ → one \; an odd run immediately before the closing quote rejects the file.
  • No Unicode normalizationplutil keeps NFC and NFD distinct, so compare raw decoded bytes. Adding unicode_normalize would manufacture false-positive duplicates.

Differential examples (the scanner currently misses every pair)

Spelling plutil canonical key Note
"\U0041""\101""A" A unicode ≡ octal ≡ literal
"\U9" TAB (0x09) \U variable width, 1 digit
"\UD83D\UDE00""😀" 😀 surrogate pair combines
"A" literal u0041 lowercase \u is not unicode
"\x41""x41" (≠ "A") x41 \x is not hex
"\e""e" (≠ ESC) e \e not recognized
"\200" U+00A0 NeXTSTEP, not Latin-1
"\271" U+201E decisive NeXTSTEP proof
"\377""\400""\0" NUL 0xFE/0xFF undefined; & 0xFF overflow
"a\tb" ≡ literal-TAB key a⇥b escaped ≡ literal control char
"\z""z", "\."".", "\$""$" letter/punct only strip rule
NFC é vs NFD é 2 distinct keys no normalization

Already safe (no change needed)

  • UTF-16 / BOM duplicate detection works (read_utf8_lines uses rb:BOM|UTF-8).
  • No Unicode normalization needed — the byte-comparing scanner already matches plutil.
  • Structural syntax is a strict fail-closed subset — across the full ASCII-punctuation sweep, quoted/unquoted collisions, comment hazards and multi-semicolon cases there is zero silent divergence and zero false duplicate; the only gap ($ / : in unquoted keys) RAISES (safe). The few fail-open edges (trailing backslash before EOF, value-side escaped quote, empty/BOM-only file) are all mooted because read_strings (plutil) now runs before the duplicate check and rejects those files first.

Proposed fix

Option A — count-mismatch backstop (recommended, right-sized). The action already has plutil's distinct-key count (new_strings.size). Have the scanner also return its total parsed-entry count; if scanner_entries > plutil_distinct_keys, plutil collapsed a duplicate the scanner's literal compare missed → fail closed, reusing the existing abort path. Across the full corpus this flags every missed-dup case with zero false positives (distinct-but-similar keys, NFC/NFD, and \U0001F600 vs 😀 all keep equal counts). No escape decoding, no NeXTSTEP table, no surrogate logic. Can live in the action without changing the scanner's raise behavior, so ios_lint_localizations is unaffected.

Option B — full canonicalization. Decode keys to match plutil byte-for-byte (the spec above) inside find_duplicated_keys, fixing both callers. Correct, but heavy (NeXTSTEP table, surrogate handling, backslash-run parity, reject conditions) — a faithful subset of CoreFoundation's old-style plist string decoder. Justified mainly if we also want cross-platform parsing off the macOS-only plutil shell-out, and it should be gated on a differential-parity test against plutil.

Acceptance criteria

  • The repro above is no longer a silent miss (detected as a duplicate, or fails closed).
  • A differential test corpus against plutil covering the spec rows (unicode / octal / named / strip / backslash-run / literal-vs-escaped), asserting no false positives on distinct-but-similar keys and NFC/NFD.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrubyPull requests that update ruby code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions