encoding/unicode: fix surrogate pair handling in UTF-16 decoder#62
Open
mohammadmseet-hue wants to merge 1 commit intogolang:masterfrom
Open
encoding/unicode: fix surrogate pair handling in UTF-16 decoder#62mohammadmseet-hue wants to merge 1 commit intogolang:masterfrom
mohammadmseet-hue wants to merge 1 commit intogolang:masterfrom
Conversation
Contributor
|
This PR (HEAD: 7dc5635) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/text/+/763560. Important tips:
|
The isHighSurrogate function incorrectly checks the low surrogate range (U+DC00..U+DFFF) instead of the high surrogate range (U+D800..U+DBFF). This causes two problems: 1. Two consecutive low surrogates are incorrectly consumed as a surrogate pair (4 bytes producing 1 replacement char), instead of being treated as two individual unpaired surrogates (2 bytes producing 1 replacement char each). This causes data loss. 2. An unpaired low surrogate followed by a valid high surrogate would not be combined into a pair, producing two replacement characters instead of a valid decoded character. The fix corrects isHighSurrogate to check U+D800..U+DBFF, adds isLowSurrogate for U+DC00..U+DFFF, and restructures the caller to only attempt pair decoding when the first code unit is a high surrogate and the second is a low surrogate. Unpaired surrogates of either type fall through to the existing RuneLen check which produces U+FFFD. This aligns with the Unicode Standard Chapter 3 Table 3-8 and the WHATWG Encoding Standard, and matches the behavior of Go's standard library unicode/utf16.Decode function.
7dc5635 to
96288b8
Compare
Contributor
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/763560. |
Contributor
|
This PR (HEAD: 96288b8) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/text/+/763560. Important tips:
|
Contributor
|
Message from Mohammad Seet: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/763560. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The isHighSurrogate function in the UTF-16 decoder incorrectly
checks the low surrogate range (U+DC00..U+DFFF) instead of the
high surrogate range (U+D800..U+DBFF).
This causes two consecutive low surrogates to be incorrectly
consumed as a 4-byte surrogate pair, producing 1 replacement
character instead of 2. N consecutive low surrogates produce
N/2 replacement characters instead of N, causing data loss.
This violates:
definition: high U+D800..U+DBFF then low U+DC00..U+DFFF)
one U+FFFD)
Go's own unicode/utf16.Decode handles this correctly.
The fix:
surrogate and second is a low surrogate
existing RuneLen check