Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add \U00000000 notation for large unicode literals #5443

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

eric-wieser
Copy link
Contributor

@eric-wieser eric-wieser commented Sep 24, 2024

This matches Python and Rust

Read this section before submitting

  • Ensure your PR follows the External Contribution Guidelines.
  • Please make sure the PR has excellent documentation and tests. If we label it missing documentation or missing tests then it needs fixing!
  • Include the link to your RFC or bug issue in the description.
  • If the issue does not already have approval from a developer, submit the PR as draft.
  • The PR title/description will become the commit message. Keep it up-to-date as the PR evolves.
  • A toolchain of the form leanprover/lean4-pr-releases:pr-release-NNNN for Linux and M-series Macs will be generated upon build. To generate binaries for Windows and Intel-based Macs as well, write a comment containing release-ci on its own line.
  • If you rebase your PR onto nightly-with-mathlib then CI will test Mathlib against your PR.
  • You can manage the awaiting-review, awaiting-author, and WIP labels yourself, by writing a comment containing one of these labels on its own line.
  • Remove this section, up to and including the --- before submitting.

Closes #0000 (RFC or bug issue number fixed by this PR, if any)

@@ -660,6 +660,9 @@ def quotedCharCoreFn (isQuotable : Char → Bool) (inString : Bool) : ParserFn :
andthenFn hexDigitFn hexDigitFn c (s.next' input i h)
else if curr == 'u' then
andthenFn hexDigitFn (andthenFn hexDigitFn (andthenFn hexDigitFn hexDigitFn)) c (s.next' input i h)
else if curr == 'U' then
(andthenFn hexDigitFn <| andthenFn hexDigitFn <| andthenFn hexDigitFn <| andthenFn hexDigitFn <|
andthenFn hexDigitFn <| andthenFn hexDigitFn <| andthenFn hexDigitFn hexDigitFn) c (s.next' input i h)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing missing is a check that this is less than 0x110000 and not in the range [0xd800, 0xdfff]. That said, there's a missing check for u that it's not in the surrogate range [0xd800, 0xdfff]. Right now, '\ud800' gives '\x00'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, '\ud800' gives '\x00'

I guess this is related to #5445. Is Char.ofNat 0xd80f = 0 really desirable? Python seems to make a different choice here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python passes these pairs through, right? And "\U00110000" is a syntax error?

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Sep 24, 2024
@leanprover-community-mathlib4-bot
Copy link
Collaborator

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 1b4ee185e879dddc6c50976b9467c16e9b0f864b --onto e551a366a0bbb27d5f853cc8e87cbd381a76ffc0. (2024-09-24 01:36:07)

@digama0
Copy link
Collaborator

digama0 commented Sep 24, 2024

Can we have rust's variadic syntax \u{12abc} instead? That would avoid the need to take more escape code letters. (This is also a way of saying that your "and Rust" is incorrect - this is not how rust does unicode escapes.)

@eric-wieser
Copy link
Contributor Author

Doesn't rust support both \U00123456 and \u{123456}?

@digama0
Copy link
Collaborator

digama0 commented Sep 24, 2024

nope, "\U00123456" yields Syntax Error: unknown character escape. Rust doesn't have the \u1234 form either, the only accepted forms are \x12 and \u{123}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants