Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dafny fails with most languages when using UTF-16 escape sequences #5737

Open
Tracked by #5561
ajewellamz opened this issue Aug 29, 2024 · 4 comments
Open
Tracked by #5561

Dafny fails with most languages when using UTF-16 escape sequences #5737

ajewellamz opened this issue Aug 29, 2024 · 4 comments
Labels
during 4: bad execution of correct program A bug in the Dafny compiler that causes a correct Dafny program to execute incorrectly kind: bug Crashes, unsoundness, incorrect output, etc. If possible, add a `part:` label part: code-generation Support for transpiling Dafny to another language. If relevant, add a `lang:` tag

Comments

@ajewellamz
Copy link
Collaborator

ajewellamz commented Aug 29, 2024

Dafny version

4.7.0 (master as of 29 Aug, except for Rust which is latest feat_rust)

Code to produce this issue

module foo {
  method bar() {
    var decoded := "\uD835\uDFC1";
    var expected_redecoded := "𝟁";
    print decoded, "\n", expected_redecoded, "\n";
    print |decoded|, "\n", |expected_redecoded|, "\n";
    assert decoded == expected_redecoded;
    expect "\uD835\uDFC1" ==  "𝟁";
  }

  method Main() {
    print "Hello World!\n";
    bar();
  }
}

Command to run and resulting output

This command succeeds as expected :
dafny run -t cs --unicode-char:false foo.dfy

js also works correctly.

With a target of python or rust, we get weird runtime errors.

With java, it fails to run, for reasons that don't seem string related.

go mostly works, but prints `��` instead of `𝟁`

With cpp a compiler error "invalid universal character"

What happened?

I would expect all languages to behave as cs.

What type of operating system are you experiencing the problem on?

Mac

@ajewellamz ajewellamz added the kind: bug Crashes, unsoundness, incorrect output, etc. If possible, add a `part:` label label Aug 29, 2024
@MikaelMayer MikaelMayer added the part: code-generation Support for transpiling Dafny to another language. If relevant, add a `lang:` tag label Aug 29, 2024
@MikaelMayer
Copy link
Member

It should be possible to fix this in Rust by passing the previous char when decoding in --unicode-char:false

struct DafnyPrintContext {
    in_seq: bool,
    previousUtf16Char: Option<u16>
}

@ajewellamz
Copy link
Collaborator Author

ajewellamz commented Sep 3, 2024

dafny run -t py --unicode-char:false will fail, regardless of the dafny code involved, if it includes a string like "𝟁".
That is, it will claim success when generating code, and the python will fail when it tries to compile.

@lucasmcdonald3
Copy link
Contributor

More information on the issue in Python:

If my Dafny source code has strings outside the BMP (ex. var a := "𝟁"), when I translate to Python with --unicode-char:false, then I expect the generated Python to contain a series of UTF-16 surrogate pairs representing "𝟁" (i.e. "\uD835\uDFC1").
However, Dafny currently writes "𝟁" literally.

My understanding is that --unicode-char:false means "Dafny strings are sequences of 16-bit UTF-16 code units".
Right now, Dafny code containing a string

var a :=  "𝟁"

translated to Python with --unicode-char:false results in

d_0_a_ = _dafny.Seq("𝟁")

However, Python will not interpret "𝟁" as a UTF-16 character by default:

Python 3.12.3 (main, May  9 2024, 13:24:59) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import _dafny
>>> s = _dafny.Seq("𝟁")
>>> s.Elements
['𝟁']
>>> len(s.Elements)
1
>>> str.encode(s.Elements[0])
b'\xf0\x9d\x9f\x81'
>>> len(str.encode(s.Elements[0]))
4

Python is interpreting this as a single 4-byte UTF-32 character, and not as 2 2-byte UTF-16 characters.
This causes issues when using the DafnyRuntimePython's .string_of method:

>>> s = _dafny.Seq("𝟁")
>>> _dafny.string_of(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 28, in string_of
    return value.__dafnystr__()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 172, in __dafnystr__
    return string_from_utf_16(self.Elements)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in string_from_utf_16
    return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in <genexpr>
    return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: int too big to convert

So Dafny generates invalid code with --unicode-char:false because the string it generates is not a "sequence of 16-bit UTF-16 code units".

I would expect that translating "𝟁" with --unicode-char:false would instead write "\uD835\uDFC1", which is interpreted as a series of UTF-16 code units:

>>> s = _dafny.Seq("\uD835\uDFC1")
>>> s.Elements
['\ud835', '\udfc1']
>>> len(s.Elements[0])
1
>>> _dafny.string_of(s)
'𝟁'

So, this seems to only be an issue with translating characters outside the BMP in strings in Dafny source code to Python (and possibly other runtimes).
This can be resolved by writing any character inside a Dafny string that is outside the BMP as a UTF-16 surrogate pair during translation.

@lucasmcdonald3
Copy link
Contributor

Dafny developers can work around this by replacing Unicode characters whose UTF-16 encodings lie outside the BMP in strings inside Dafny source code with UTF-16 surrogate pairs.
The MPL will work around this in aws/aws-cryptographic-material-providers-library#672

@robin-aws robin-aws added the during 4: bad execution of correct program A bug in the Dafny compiler that causes a correct Dafny program to execute incorrectly label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
during 4: bad execution of correct program A bug in the Dafny compiler that causes a correct Dafny program to execute incorrectly kind: bug Crashes, unsoundness, incorrect output, etc. If possible, add a `part:` label part: code-generation Support for transpiling Dafny to another language. If relevant, add a `lang:` tag
Projects
None yet
Development

No branches or pull requests

4 participants