Skip to content

encoding/unicode: fix surrogate pair handling in UTF-16 decoder#62

Open
mohammadmseet-hue wants to merge 1 commit intogolang:masterfrom
mohammadmseet-hue:fix-utf16-surrogate
Open

encoding/unicode: fix surrogate pair handling in UTF-16 decoder#62
mohammadmseet-hue wants to merge 1 commit intogolang:masterfrom
mohammadmseet-hue:fix-utf16-surrogate

Conversation

@mohammadmseet-hue
Copy link
Copy Markdown

@mohammadmseet-hue mohammadmseet-hue commented Apr 7, 2026

The isHighSurrogate function in the UTF-16 decoder incorrectly
checks the low surrogate range (U+DC00..U+DFFF) instead of the
high surrogate range (U+D800..U+DBFF).

func isHighSurrogate(r rune) bool {
    return 0xDC00 <= r && r <= 0xDFFF // BUG: low range
}

This causes two consecutive low surrogates to be incorrectly
consumed as a 4-byte surrogate pair, producing 1 replacement
character instead of 2. N consecutive low surrogates produce
N/2 replacement characters instead of N, causing data loss.

This violates:

  • Unicode Standard Chapter 3, Table 3-8 (surrogate pair
    definition: high U+D800..U+DBFF then low U+DC00..U+DFFF)
  • WHATWG Encoding Standard (each unpaired surrogate produces
    one U+FFFD)

Go's own unicode/utf16.Decode handles this correctly.

The fix:

  • Correct isHighSurrogate to check U+D800..U+DBFF
  • Add isLowSurrogate for U+DC00..U+DFFF
  • Only attempt pair decoding when first code unit is a high
    surrogate and second is a low surrogate
  • Unpaired surrogates of either type produce U+FFFD via the
    existing RuneLen check

@gopherbot
Copy link
Copy Markdown
Contributor

This PR (HEAD: 7dc5635) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/text/+/763560.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

The isHighSurrogate function incorrectly checks the low
surrogate range (U+DC00..U+DFFF) instead of the high
surrogate range (U+D800..U+DBFF). This causes two problems:

1. Two consecutive low surrogates are incorrectly consumed
   as a surrogate pair (4 bytes producing 1 replacement
   char), instead of being treated as two individual
   unpaired surrogates (2 bytes producing 1 replacement
   char each). This causes data loss.

2. An unpaired low surrogate followed by a valid high
   surrogate would not be combined into a pair, producing
   two replacement characters instead of a valid decoded
   character.

The fix corrects isHighSurrogate to check U+D800..U+DBFF,
adds isLowSurrogate for U+DC00..U+DFFF, and restructures
the caller to only attempt pair decoding when the first
code unit is a high surrogate and the second is a low
surrogate. Unpaired surrogates of either type fall through
to the existing RuneLen check which produces U+FFFD.

This aligns with the Unicode Standard Chapter 3 Table 3-8
and the WHATWG Encoding Standard, and matches the behavior
of Go's standard library unicode/utf16.Decode function.
@gopherbot
Copy link
Copy Markdown
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/763560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

This PR (HEAD: 96288b8) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/text/+/763560.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Mohammad Seet:

Patch Set 2:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/763560.
After addressing review feedback, remember to publish your drafts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants