Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count codepoints instead of bytes, when determining width #136

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

illfygli
Copy link
Contributor

A complete solution would be to count the display-width of grapheme clusters, but that would require adding a dependency on something like zg, which is quite heavy (the library is great, Unicode is just complex).

Counting codepoints will ensure that typical non-ASCII text is supported, at least, but you can still throw it off with more complex Unicode constructions, which might not be so useful in help text anyway.

I also made a commit for compatibility with current Zig.

Copy link
Owner

@Hejsil Hejsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Just one tiny thing

clap/codepoint_counting_writer.zig Show resolved Hide resolved
Comment on lines 22 to 24
const amt = try self.child_stream.write(bytes);
self.codepoints_written += try std.unicode.utf8CountCodepoints(bytes);
return amt;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth the complexity to handle, but there should at least be a comment here explaining that splitting a codepoint and writing it with two writes will fail.

Copy link
Contributor Author

@illfygli illfygli Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Do you think it makes sense to do utf8CountCodepoints(bytes[0..amt]), and disregard the TruncatedInput error, or something like that?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't even think about partial writes from the child_stream. That makes this more complicated.

Calling CodepointCountingWriter.write with partial codepoints we can specify as unsupported (currently) since clap is the only user of this writer.

We have to handle partial writes from the child_stream though, since we have no control here. I'd recommend calling self.child_stream.writeAll(bytes) instead, to ensure all the bytes are written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to go to an appointment now, but I have an idea for how to make it behave right without too much hassle. Will try it when I get back!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a modified version of the utf8CountCodepoints function, which returns the number of complete codepoints, if the string is truncated.

So writeing truncated UTF-8 will result in a partial write, leaving the incomplete codepoint for the next call.

This makes writeAll work as expected, and write works the same as before, reporting the same byte count as written to the child.

A bit more code then I would have liked, but conceptually simple

A complete solution would be to count grapheme clusters, but that would
require adding a dependency on something like zg.

Counting codepoints will ensure that typical non-ASCII text is
supported, but you can still throw it off with more complex Unicode
constructions, which might not be so useful in help text.

Fixes Hejsil#75
@Hejsil Hejsil merged commit 2d9db15 into Hejsil:master Aug 30, 2024
13 checks passed
@Hejsil
Copy link
Owner

Hejsil commented Aug 30, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants