From 6c423249f34fa90e6feaf89842382acb1ca0d1ba Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 18 Mar 2026 10:25:59 -0400 Subject: [PATCH 1/5] Support UTF-8 string literals --- standard/lexical-structure.md | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/standard/lexical-structure.md b/standard/lexical-structure.md index 6b8639349..90216554f 100644 --- a/standard/lexical-structure.md +++ b/standard/lexical-structure.md @@ -904,6 +904,8 @@ A verbatim string literal consists of an `@` character followed by a double-quo In a verbatim string literal, the characters between the delimiters are interpreted verbatim, with the only exception being a *Quote_Escape_Sequence*, which represents one double-quote character. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim string literals. A verbatim string literal may span multiple lines. +All string literal forms may optionally have a trailing *Utf8_Suffix*. The representation of each form is discussed below. + ```ANTLR String_Literal : Regular_String_Literal @@ -911,7 +913,7 @@ String_Literal ; fragment Regular_String_Literal - : '"' Regular_String_Literal_Character* '"' + : '"' Regular_String_Literal_Character* '"' Utf8_Suffix? ; fragment Regular_String_Literal_Character @@ -927,7 +929,7 @@ fragment Single_Regular_String_Literal_Character ; fragment Verbatim_String_Literal - : '@"' Verbatim_String_Literal_Character* '"' + : '@"' Verbatim_String_Literal_Character* '"' Utf8_Suffix? ; fragment Verbatim_String_Literal_Character @@ -942,6 +944,10 @@ fragment Single_Verbatim_String_Literal_Character fragment Quote_Escape_Sequence : '""' ; + +fragment Utf8_Suffix + : 'u8' | 'U8' + ; ``` > *Example*: The example @@ -974,7 +980,24 @@ fragment Quote_Escape_Sequence > *Note*: Since a hexadecimal escape sequence can have a variable number of hex digits, the string literal `"\x123"` contains a single character with hex value `123`. To create a string containing the character with hex value `12` followed by the character `3`, one could write `"\x00123"` or `"\x12"` + `"3"` instead. *end note* -The type of a *String_Literal* is `string`. +A *String_Literal* that does not contain a *Utf8_Suffix* is a ***UTF-16 string literal***, whose type is `string`. + +A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal***, whose type is `System.ReadOnlySpan` (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the `ReadOnlySpan`) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its *Utf8_Suffix* shall be valid UTF-16. (For example, `"\uDC00\uDD00"u8` is ill-formed as one low surrogate cannot be followed by another.) + +> *Note*: While every UTF-8 string literal is a `ReadOnlySpan`, not every `ReadOnlySpan` represents a UTF-8 string literal. See the description of UTF-8 string concatenation in [§12.13.5](expressions.md#12135-addition-operator). *end note* + + + +> *Note*: As `ReadOnlySpan` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note* + + + +> *Example*: Here are examples of each form of string literal: +> | **Encoding** | **Type** | **Regular String Literal** | **Verbatim String Literal** | **Raw String Literal** | +> |--------------|----------------------|---------------------|--------------------|--------------------| +> | UTF-16 | `string` | `"Hello"` | `@"Hello"` | `"""Hello"""` | +> | UTF-8 | `ReadOnlySpan` | `"Hello"u8` | `@"Hello"u8` | `"""Hello"""u8` | +> *end example* Each string literal does not necessarily result in a new string instance. When two or more string literals that are equivalent according to the string equality operator ([§12.15.8](expressions.md#12158-string-equality-operators)), appear in the same assembly, these string literals refer to the same string instance. From 751bf0875b1ce9fe5088db9ee13a99b9bcb54238 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 18 Mar 2026 10:34:42 -0400 Subject: [PATCH 2/5] Support UTF-8 string literals --- standard/expressions.md | 33 +++++++++++++++++++++++++++------ 1 file changed, 27 insertions(+), 6 deletions(-) diff --git a/standard/expressions.md b/standard/expressions.md index 2066e33ff..0efcb8c2d 100644 --- a/standard/expressions.md +++ b/standard/expressions.md @@ -4286,7 +4286,7 @@ Lifted ([§12.4.8](expressions.md#1248-lifted-operators)) forms of the unlifted For an operation of the form `x + y`, binary operator overload resolution ([§12.4.5](expressions.md#1245-binary-operator-overload-resolution)) is applied to select a specific operator implementation. The operands are converted to the parameter types of the selected operator, and the type of the result is the return type of the operator. -The predefined addition operators are listed below. For numeric and enumeration types, the predefined addition operators compute the sum of the two operands. When one or both operands are of type `string`, the predefined addition operators concatenate the string representation of the operands. +The predefined addition operators are listed below. For numeric and enumeration types, the predefined addition operators compute the sum of the two operands. When one or both operands are of type `string`, or both are of type `ReadOnlySpan`, the predefined addition operators concatenate the string representation of the operands. - Integer addition: @@ -4336,7 +4336,7 @@ The predefined addition operators are listed below. For numeric and enumeration ``` At run-time these operators are evaluated exactly as `(E)((U)x + (U)y`). -- String concatenation: +- UTF-16 string concatenation: ```csharp string operator +(string x, string y); @@ -4344,8 +4344,8 @@ The predefined addition operators are listed below. For numeric and enumeration string operator +(object x, string y); ``` - These overloads of the binary `+` operator perform string concatenation. If an operand of string concatenation is `null`, an empty string is substituted. Otherwise, any non-`string` operand is converted to its string representation by invoking the virtual `ToString` method inherited from type `object`. If `ToString` returns `null`, an empty string is substituted. - + These overloads of the binary `+` operator perform concatenation of UTF-16 strings. If an operand is `null`, an empty UTF-16 string is substituted. Otherwise, any non-`string` operand that is not a ref struct ([§16.2.3]( structs.md#1623-ref-modifier)) is converted to its UTF-16 string representation by invoking the virtual `ToString` method inherited from type `object`. If `ToString` returns `null`, an empty UTF-16 string is substituted. + > *Example*: > > @@ -4373,7 +4373,28 @@ The predefined addition operators are listed below. For numeric and enumeration > > *end example* - The result of the string concatenation operator is a `string` that consists of the characters of the left operand followed by the characters of the right operand. The string concatenation operator never returns a `null` value. A `System.OutOfMemoryException` may be thrown if there is not enough memory available to allocate the resulting string. + The result of the operator is a `string` that consists of the characters of the left operand followed by the characters of the right operand. The string concatenation operator never returns a `null` value. A `System.OutOfMemoryException` may be thrown if there is not enough memory available to allocate the resulting string. +- UTF-8 string concatenation: + + ```csharp + ReadOnlySpan operator +(ReadOnlySpan x, ReadOnlySpan y); + ``` + + This overload of the binary `+` operator performs concatenation of UTF-8 string literals and the concatenated results thereof (which is much more restrictive than for UTF-16 string concatenation). The operands shall be UTF-8-encoded values. + The result of the operator is a ReadOnlySpan that consists of the bytes of the left operand followed by the bytes of the right operand. The result may be used directly as an operand to the UTF-8 string concatenation operator. + + > *Example*: + > + > + > ```csharp + > ReadOnlySpan sp1 = "ABC"u8 + "DEF"u8; // OK + > ReadOnlySpan sp2 = sp1 + "DEF"u8; // error + > ReadOnlySpan sp3 = "ABC"u8 + "DEF"u8 + "123"u8; // OK + > ReadOnlySpan sp4 = "ABC"u8 + (ReadOnlySpan)stackalloc byte[] + > { (byte)'D', (byte)'E', (byte)'F', (byte)'\x0' }; // error + > ``` + > + > In the case of `sp1`, both operands are UTF-8 string literals. However, once `sp1` is initialized, that UTF-8 pedigree is no longer tracked. That is, `sp1` itself is not seen as being UTF-8 encoded. As such, it is not permitted to be an operand in the case of the initialization of `sp2`. In the initializer for `sp3`, the left pair of operands is evaluated, and as they are both UTF-8 string literals, the result is deemed to also be UTF-8 encoded, so it can further be used as the left operand of the right operator. In the case of `sp4`, while both operands are `ReadOnlySpan`s, only the left operand is UTF-8 encoded, even though the `Span` returned by `stackalloc` has the internal form of a UTF-8 string literal (that is, an array of bytes with a null-byte terminator). See [§6.4.5.6](lexical-structure.md#6456-string-literals). *end example* - Delegate combination. Every delegate type implicitly provides the following predefined operator, where `D` is the delegate type: ```csharp @@ -7493,7 +7514,7 @@ A *constant_expression* of type `nint` shall have a value in the range \[-214748 Only the following constructs are permitted in constant expressions: -- Literals (including the `null` literal). +- Literals (including the `null` literal, but excluding UTF-8 string literals). - Constant interpolated strings. - References to `const` members of class, struct, and interface types. - References to members of enumeration types. From 602d009788bbd470aafac4b37411485c72bed5a4 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 18 Mar 2026 10:36:57 -0400 Subject: [PATCH 3/5] Support UTF-8 string literals --- standard/attributes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/standard/attributes.md b/standard/attributes.md index 9384f8602..525132098 100644 --- a/standard/attributes.md +++ b/standard/attributes.md @@ -826,7 +826,7 @@ The line number may be affected by `#line` directives ([§6.5.8](lexical-structu The attribute `System.Runtime.CompilerServices.CallerFilePathAttribute` is allowed on optional parameters when there is a standard implicit conversion ([§10.4.2](conversions.md#1042-standard-implicit-conversions)) from `string` to the parameter’s type. -If a function invocation from a location in source code omits an optional parameter with the `CallerFilePathAttribute`, then a string literal representing that location’s file path is used as an argument to the invocation instead of the default parameter value. +If a function invocation from a location in source code omits an optional parameter with the `CallerFilePathAttribute`, then a UTF-16 string literal representing that location’s file path is used as an argument to the invocation instead of the default parameter value. The format of the file path is implementation-dependent. @@ -836,7 +836,7 @@ The file path may be affected by `#line` directives ([§6.5.8](lexical-structure The attribute `System.Runtime.CompilerServices.CallerMemberNameAttribute` is allowed on optional parameters when there is a standard implicit conversion ([§10.4.2](conversions.md#1042-standard-implicit-conversions)) from `string` to the parameter’s type. -If a function invocation from a location within the body of a function member or within an attribute applied to the function member itself or its return type, parameters or type parameters in source code omits an optional parameter with the `CallerMemberNameAttribute`, then a string literal representing the name of that member is used as an argument to the invocation instead of the default parameter value. (In the case of a function invocation from a top-level statement ([§7.1.3](basic-concepts.md#713-using-top-level-statements)), the member name is that generated by the implementation.) +If a function invocation from a location within the body of a function member or within an attribute applied to the function member itself or its return type, parameters or type parameters in source code omits an optional parameter with the `CallerMemberNameAttribute`, then a UTF-16 string literal representing the name of that member is used as an argument to the invocation instead of the default parameter value. (In the case of a function invocation from a top-level statement ([§7.1.3](basic-concepts.md#713-using-top-level-statements)), the member name is that generated by the implementation.) For invocations that occur within generic methods, only the method name itself is used, without the type parameter list. From c7d51765127b66070d01fa53c1091d904a71b590 Mon Sep 17 00:00:00 2001 From: Rex Jaeschke Date: Wed, 18 Mar 2026 10:59:24 -0400 Subject: [PATCH 4/5] fix md formatting --- standard/lexical-structure.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/standard/lexical-structure.md b/standard/lexical-structure.md index 90216554f..d733cbd17 100644 --- a/standard/lexical-structure.md +++ b/standard/lexical-structure.md @@ -993,10 +993,12 @@ A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal*** > *Example*: Here are examples of each form of string literal: +> > | **Encoding** | **Type** | **Regular String Literal** | **Verbatim String Literal** | **Raw String Literal** | > |--------------|----------------------|---------------------|--------------------|--------------------| > | UTF-16 | `string` | `"Hello"` | `@"Hello"` | `"""Hello"""` | > | UTF-8 | `ReadOnlySpan` | `"Hello"u8` | `@"Hello"u8` | `"""Hello"""u8` | +> > *end example* Each string literal does not necessarily result in a new string instance. When two or more string literals that are equivalent according to the string equality operator ([§12.15.8](expressions.md#12158-string-equality-operators)), appear in the same assembly, these string literals refer to the same string instance. From 2e4a8bc3fbfe77989389226af968a6e5085ff8d8 Mon Sep 17 00:00:00 2001 From: Bill Wagner Date: Mon, 4 May 2026 15:05:12 -0400 Subject: [PATCH 5/5] Review and updates Review and add additional rules for UTF-8 strings. --- standard/expressions.md | 9 +++++---- standard/lexical-structure.md | 2 +- standard/structs.md | 2 ++ 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/standard/expressions.md b/standard/expressions.md index 0efcb8c2d..3da043cc6 100644 --- a/standard/expressions.md +++ b/standard/expressions.md @@ -4286,7 +4286,7 @@ Lifted ([§12.4.8](expressions.md#1248-lifted-operators)) forms of the unlifted For an operation of the form `x + y`, binary operator overload resolution ([§12.4.5](expressions.md#1245-binary-operator-overload-resolution)) is applied to select a specific operator implementation. The operands are converted to the parameter types of the selected operator, and the type of the result is the return type of the operator. -The predefined addition operators are listed below. For numeric and enumeration types, the predefined addition operators compute the sum of the two operands. When one or both operands are of type `string`, or both are of type `ReadOnlySpan`, the predefined addition operators concatenate the string representation of the operands. +The predefined addition operators are listed below. For numeric and enumeration types, the predefined addition operators compute the sum of the two operands. When one or both operands are of type `string`, the predefined addition operators concatenate the string representation of the operands. When both operands are of type `ReadOnlySpan` and both are semantically UTF-8 byte representations, the predefined addition operator concatenates the bytes of the operands. - Integer addition: @@ -4344,7 +4344,7 @@ The predefined addition operators are listed below. For numeric and enumeration string operator +(object x, string y); ``` - These overloads of the binary `+` operator perform concatenation of UTF-16 strings. If an operand is `null`, an empty UTF-16 string is substituted. Otherwise, any non-`string` operand that is not a ref struct ([§16.2.3]( structs.md#1623-ref-modifier)) is converted to its UTF-16 string representation by invoking the virtual `ToString` method inherited from type `object`. If `ToString` returns `null`, an empty UTF-16 string is substituted. + These overloads of the binary `+` operator perform concatenation of UTF-16 strings. If an operand is `null`, an empty UTF-16 string is substituted. Otherwise, any non-`string` operand that is not a ref struct ([§16.2.3](structs.md#1623-ref-modifier)) is converted to its UTF-16 string representation by invoking the virtual `ToString` method inherited from type `object`. If `ToString` returns `null`, an empty UTF-16 string is substituted. > *Example*: > @@ -4380,8 +4380,9 @@ The predefined addition operators are listed below. For numeric and enumeration ReadOnlySpan operator +(ReadOnlySpan x, ReadOnlySpan y); ``` - This overload of the binary `+` operator performs concatenation of UTF-8 string literals and the concatenated results thereof (which is much more restrictive than for UTF-16 string concatenation). The operands shall be UTF-8-encoded values. - The result of the operator is a ReadOnlySpan that consists of the bytes of the left operand followed by the bytes of the right operand. The result may be used directly as an operand to the UTF-8 string concatenation operator. + This overload of the binary `+` operator performs concatenation of UTF-8 string literals and the results of other applications of this operator (which is much more restrictive than UTF-16 string concatenation). It is applicable if and only if both operands are *semantically UTF-8 byte representations*. An operand is *semantically a UTF-8 byte representation* if it is a UTF-8 string literal, the result of an application of this operator, or a parenthesized expression whose enclosed expression is *semantically a UTF-8 byte representation*. + + The result of the operator is a `ReadOnlySpan` that consists of the bytes of the left operand followed by the bytes of the right operand. The result is itself *semantically a UTF-8 byte representation*, and so may be used as an operand to a further application of this operator. > *Example*: > diff --git a/standard/lexical-structure.md b/standard/lexical-structure.md index d733cbd17..54dcbc20c 100644 --- a/standard/lexical-structure.md +++ b/standard/lexical-structure.md @@ -988,7 +988,7 @@ A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal*** -> *Note*: As `ReadOnlySpan` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note* +> *Note*: Because `ReadOnlySpan` is a ref struct type, the value of a UTF-8 string literal cannot be implicitly converted to `object`, nor can `ReadOnlySpan` be used as a type argument ([§16.2.3](structs.md#1623-ref-modifier)). *end note* diff --git a/standard/structs.md b/standard/structs.md index 3803084ea..093264da3 100644 --- a/standard/structs.md +++ b/standard/structs.md @@ -911,6 +911,8 @@ Any expression whose compile-time type is not a ref struct has a safe-context of A `default` expression, for any type, has safe-context of caller-context. +A UTF-8 string literal ([§6.4.5.6](lexical-structure.md#6456-string-literals)) has a safe-context of caller-context. + For any non-default expression whose compile-time type is a ref struct has a safe-context defined by the following sections. The safe-context records which context a value may be copied into. Given an assignment from an expression `E1` with a safe-context `S1`, to an expression `E2` with safe-context `S2`, it is an error if `S2` is a wider context than `S1`.