Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Length of the "White Heavy Check Mark" #109055

Closed
jjcarrier opened this issue Oct 20, 2024 · 5 comments
Closed

Length of the "White Heavy Check Mark" #109055

jjcarrier opened this issue Oct 20, 2024 · 5 comments

Comments

@jjcarrier
Copy link

Description

The symbol ✅ (U+2705) is reported as having a length of 1, other visually-wide unicode characters have a length of 2. Abnormal behavior is observed in higher-layer components such as PowerShell/Windows-Terminal due to this behavior.

Reproduction Steps

In a dotnet core console application run:

Console.WriteLine($"The length of ✅ is {"".Length}");
Console.WriteLine($"The length of 🆕 is {"🆕".Length}");
Console.WriteLine($"The length of 🔊 is {"🔊".Length}");
Console.WriteLine($"The length of 🔌 is {"🔌".Length}");

Additionally, in a Windows Terminal + PowerShell Core instance run:

''.Length
''  |  Format-Hex
'🆕'.Length
'🆕'  |  Format-Hex
'🔊'.Length
'🔊'  |  Format-Hex
'🔌'.Length
'🔌'  |  Format-Hex

Confirm that the lengths between the C# and pwsh tests are in agreement.

Paste the line '✅'.Length back into the console and move the cursor to the end of the line. Notice that the cursor is placed before the h character.

Expected behavior

I suspect that ✅ should report a length of 2 like the other characters in order to resolve this issue.

Actual behavior

The length of ✅ will be reported as 1, while other characters (such as 🆕) are reported as 2.

When interacting on the console with a powershell statement containing ✅ (such as '✅'.Length) it will become apparent when navigating the cursor through this line, that there is a discrepancy between the presentation of the line and the lower level processing (readline).

My understanding of this issue suggests that the correct behavior is for this symbol to be treated as two characters like the other symbols mentioned above, this would resolve the odd behavior seen at the terminal level, but perhaps this goes against standards or other complications.

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

Tested on:

dotnet --version
8.0.403

$PSVersionTable.PSVersion

Major  Minor  Patch  PreReleaseLabel BuildLabel
-----  -----  -----  --------------- ----------
7      4      5
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-console
See info in area-owners.md if you want to be subscribed.

@vcsjones
Copy link
Member

vcsjones commented Oct 20, 2024

The character U+2705 is a single Code Unit in length, and .NET's Length is in 16-bit code units. A length is one seems correct to me.

The others cannot be represented as a single 16-bit code unit. They need to be represented by two (a surrogate pair). Let's use 🆕 as an example. In UTF-16 code units, it gets encoded as 0xD83C, 0xDD95. We can prove this by changing your example to

Console.WriteLine($"The length of \u2705 is {"\u2705".Length}");
Console.WriteLine($"The length of \uD83C\uDD95 is {"\uD83C\uDD95".Length}");

In the console, this will print

The length of ✅ is 1
The length of 🆕 is 2

So we can see that the "white heavy check mark" only takes a single 16-bit code unit to encode. The 🆕 requires two - so they are encoded as a surrogate pair.

@tannergooding
Copy link
Member

tannergooding commented Oct 20, 2024

Length is determined by the number of UTF16 code units required to encode the character, it has nothing to do with the visual width or visual representation of the data.

✅ is 0x0000_2705, which is 1 char
🆕 is 0x0001_F195, which is 2 char, represented as the surrogate pair 0xD83C, 0xDD95
🔊 is 0x0001_F50A which is 2 char, represented as the surrogate pair 0xD83D, 0xDD0A
🔌 is 0x0001_F50C which is 2 char, represented as the surrogate pair 0xD83D, 0xDD0C

Any Unicode code point in the range [0x0001_0000, 0x0010_FFFF] must be encoded as a surrogate pair, or rather 2x UTF16 code units.

Not all visual glyphs themselves are represented as a single code point either. There existing combining characters which frequently get used, especially with emoji. For example, 👨‍👩‍👧‍👦 is 11 UTF-16 code units represented by 👨, 👩, 👧, and 👦 (each 2 UTF-16 code units) separated by 0x200D the "Zero width joiner" combining chjaracter.

@jjcarrier
Copy link
Author

jjcarrier commented Oct 20, 2024

The 👨‍👩‍👧‍👦 you mention seems to be even more problematic with Windows Terminal. I guess the issue should be moved there instead.

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Oct 20, 2024
@teo-tsirpanis teo-tsirpanis closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2024
@teo-tsirpanis teo-tsirpanis added area-System.Text.Encoding untriaged New issue has not been triaged by the area owner and removed area-System.Console untriaged New issue has not been triaged by the area owner labels Oct 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants