Count codepoints instead of bytes, when determining width #136

illfygli · 2024-08-30T10:38:16Z

A complete solution would be to count the display-width of grapheme clusters, but that would require adding a dependency on something like zg, which is quite heavy (the library is great, Unicode is just complex).

Counting codepoints will ensure that typical non-ASCII text is supported, at least, but you can still throw it off with more complex Unicode constructions, which might not be so useful in help text anyway.

I also made a commit for compatibility with current Zig.

Hejsil

Looking good. Just one tiny thing

clap/codepoint_counting_writer.zig

Hejsil · 2024-08-30T10:48:24Z

clap/codepoint_counting_writer.zig

+            const amt = try self.child_stream.write(bytes);
+            self.codepoints_written += try std.unicode.utf8CountCodepoints(bytes);
+            return amt;


I don't think it's worth the complexity to handle, but there should at least be a comment here explaining that splitting a codepoint and writing it with two writes will fail.

Good point. Do you think it makes sense to do utf8CountCodepoints(bytes[0..amt]), and disregard the TruncatedInput error, or something like that?

Didn't even think about partial writes from the child_stream. That makes this more complicated.

Calling CodepointCountingWriter.write with partial codepoints we can specify as unsupported (currently) since clap is the only user of this writer.

We have to handle partial writes from the child_stream though, since we have no control here. I'd recommend calling self.child_stream.writeAll(bytes) instead, to ensure all the bytes are written.

I have to go to an appointment now, but I have an idea for how to make it behave right without too much hassle. Will try it when I get back!

I added a modified version of the utf8CountCodepoints function, which returns the number of complete codepoints, if the string is truncated.

So writeing truncated UTF-8 will result in a partial write, leaving the incomplete codepoint for the next call.

This makes writeAll work as expected, and write works the same as before, reporting the same byte count as written to the child.

A bit more code then I would have liked, but conceptually simple

A complete solution would be to count grapheme clusters, but that would require adding a dependency on something like zg. Counting codepoints will ensure that typical non-ASCII text is supported, but you can still throw it off with more complex Unicode constructions, which might not be so useful in help text. Fixes Hejsil#75

Hejsil · 2024-08-30T14:46:10Z

Thanks!

Hejsil reviewed Aug 30, 2024

View reviewed changes

illfygli force-pushed the master branch from 8008236 to 07b8cb0 Compare August 30, 2024 14:05

illfygli added 2 commits August 30, 2024 16:34

adapt to latest zig changes

71a987e

illfygli force-pushed the master branch from 07b8cb0 to 71a987e Compare August 30, 2024 14:35

Hejsil merged commit 2d9db15 into Hejsil:master Aug 30, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count codepoints instead of bytes, when determining width #136

Count codepoints instead of bytes, when determining width #136

illfygli commented Aug 30, 2024

Hejsil left a comment

Hejsil Aug 30, 2024

illfygli Aug 30, 2024 •

edited

Loading

Hejsil Aug 30, 2024

illfygli Aug 30, 2024

illfygli Aug 30, 2024

Hejsil commented Aug 30, 2024

Count codepoints instead of bytes, when determining width #136

Count codepoints instead of bytes, when determining width #136

Conversation

illfygli commented Aug 30, 2024

Hejsil left a comment

Choose a reason for hiding this comment

Hejsil Aug 30, 2024

Choose a reason for hiding this comment

illfygli Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Hejsil Aug 30, 2024

Choose a reason for hiding this comment

illfygli Aug 30, 2024

Choose a reason for hiding this comment

illfygli Aug 30, 2024

Choose a reason for hiding this comment

Hejsil commented Aug 30, 2024

illfygli Aug 30, 2024 •

edited

Loading