-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count codepoints instead of bytes, when determining width #136
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Just one tiny thing
clap/codepoint_counting_writer.zig
Outdated
const amt = try self.child_stream.write(bytes); | ||
self.codepoints_written += try std.unicode.utf8CountCodepoints(bytes); | ||
return amt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth the complexity to handle, but there should at least be a comment here explaining that splitting a codepoint and writing it with two writes
will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Do you think it makes sense to do utf8CountCodepoints(bytes[0..amt])
, and disregard the TruncatedInput
error, or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't even think about partial writes from the child_stream
. That makes this more complicated.
Calling CodepointCountingWriter.write
with partial codepoints we can specify as unsupported (currently) since clap
is the only user of this writer.
We have to handle partial writes from the child_stream
though, since we have no control here. I'd recommend calling self.child_stream.writeAll(bytes)
instead, to ensure all the bytes are written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to go to an appointment now, but I have an idea for how to make it behave right without too much hassle. Will try it when I get back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a modified version of the utf8CountCodepoints
function, which returns the number of complete codepoints, if the string is truncated.
So write
ing truncated UTF-8 will result in a partial write, leaving the incomplete codepoint for the next call.
This makes writeAll
work as expected, and write
works the same as before, reporting the same byte count as written to the child.
A bit more code then I would have liked, but conceptually simple
A complete solution would be to count grapheme clusters, but that would require adding a dependency on something like zg. Counting codepoints will ensure that typical non-ASCII text is supported, but you can still throw it off with more complex Unicode constructions, which might not be so useful in help text. Fixes Hejsil#75
Thanks! |
A complete solution would be to count the display-width of grapheme clusters, but that would require adding a dependency on something like zg, which is quite heavy (the library is great, Unicode is just complex).
Counting codepoints will ensure that typical non-ASCII text is supported, at least, but you can still throw it off with more complex Unicode constructions, which might not be so useful in help text anyway.
I also made a commit for compatibility with current Zig.