Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MY_DATA - HTML doesn't handle an accented character in column name #432

Open
martinwork opened this issue May 28, 2024 · 14 comments
Open

MY_DATA - HTML doesn't handle an accented character in column name #432

martinwork opened this issue May 28, 2024 · 14 comments

Comments

@martinwork
Copy link
Collaborator

martinwork commented May 28, 2024

Arising from support ticket https://support.microbit.org/helpdesk/tickets/75447 (private)

The HTML display, Download and Copy are all truncated just before the accented character.

https://makecode.microbit.org/_U3ibMVU94dfH

image

image

@microbit-carlos
Copy link
Collaborator

microbit-carlos commented May 29, 2024

Thanks Martin!
Does that only break the column name or does is break the rest of the table rendering and/or data visualisation?
Is the offline table also affected?

@martinwork
Copy link
Collaborator Author

martinwork commented May 29, 2024

The HTML page doesn't show the data. Download and Copy also truncate at the accented character, so don't include any data. Visual preview shows no data.

Offline too.
image

The page source could be used to retrieve the data.
image

@martinwork
Copy link
Collaborator Author

The same happens with accents in the data
https://makecode.microbit.org/_93aHvuLvCJwq
image
image
image

@microbit-carlos
Copy link
Collaborator

Thanks Martin, if the table rows are not being shown due to a character in the table title we should prioritise this to be fixed in the milestone after next (which should come out really soon).

@microbit-carlos microbit-carlos modified the milestones: datalog, v0.2.68 May 29, 2024
@martinwork
Copy link
Collaborator Author

@microbit-carlos Knowing little about such things, I just discovered what the problem is... The browser expects UTF-8 because of meta charset=utf-8, but the CSV data is not UTF-8. Most characters >127 become character 65533 (U+FFFD) in the outerHTML string, and the script terminates the CSV at any such character.

@microbit-carlos
Copy link
Collaborator

Ah, thanks for checking Martin!

Do you know how the characters with UTF-8 value end up being 0xFFFD? Would that be a fix from the MakeCode side?

@martinwork
Copy link
Collaborator Author

Do you know how the characters with UTF-8 value end up being 0xFFFD? Would that be a fix from the MakeCode side?

Not really! AIUI, it's when the character is not valid, I think assuming from the charset that the source data is UTF-8. I don't know if the script can get the raw data before it's interpreted as UTF-8.

The script relies on finding FFFD (usually corresponding to an FF in the source) to terminate the string.
https://github.com/lancaster-university/codal-microbit-v2/blob/master/resources/logfs/basic-header.html#L72

MicroBitLog could avoid the problem by replacing non-ascii characters, with ? for example.

Assuming the script can't get at the source data, to be able to retrieve non-ascii characters, I guess the logged data needs to be UTF-8.

I wonder what code page MakeCode uses? If the charset is changed to a code page, all characters are valid and the script needs some other way to find the end of the data.

@microbit-carlos
Copy link
Collaborator

microbit-carlos commented Jul 26, 2024

Right, so if I understand that right, it's possible that it is MakeCode that is encoding non-ascii characters to 0xFFFD, and in that case we could update MicroBitLog to check for non-ascii characters and replace them, is that right?

I think the part that I find a bit weird is that if MakeCode didn't support UTF-8 encoding, then it'd likely encode non-ascii characters to a single byte instead of two bytes. Or, if it uses any different type of encoding, which is able to output the same two bytes to identify "bad data" (0xFFFD), then it's very likely that type of encoding would be able to deal with an accent correctly.

@microbit-carlos
Copy link
Collaborator

microbit-carlos commented Jul 26, 2024

Okay, so trying the original programme (https://makecode.microbit.org/_U3ibMVU94dfH) with MakeCode live and opening the HTML file with a hex editor and lumière appears correctly encoded in UTF-8, which should be 6c756d69e87265 (using https://www.hexhero.com/converters/utf8-to-hex), specifically where e8 = è.

image

But the HTML doesn't render, even if it does have data:
image
makecode.microbit.org version: 6.0.28
Microsoft MakeCode version: 9.0.19
microbit runtime version: v2.2.0-rc6
codal-microbit-v2 runtime version: v0.2.63


Similarly, with the other programme created by @martinwork https://makecode.microbit.org/_93aHvuLvCJwq the Zoé (5a6fe9) string is correctly encoded.
In this case I pressed A, then B, then, A, then B, so the table should have 4 rows (Zoe, Zoé, Zoe, Zoé). The hex viewer shows the right data for the 4 rows, but the HTML breaks at the first é (e9) character (where the highlight starts):

image image

MY_DATA.HTM.zip


@martinwork how did you read the 0xFFFD characters?
Is it possible that's CODAL (or the JS reading the data) that when reading characters outside of the 7-bit ASCII (like e8 for è or e9 for é) range does the wrong thing?

@martinwork
Copy link
Collaborator Author

We are typing at the same time!

The UTF-8 for e grave is C3A8
https://www.compart.com/en/unicode/U+00E8
I copied lumière into Windows notepad, saved it with UTF-8 encoding and loaded it as binary
image

The logged data in the MY_DATA.HTM file from the first example above contains single byte E8 characters.
image
and
image

I think in the process of loading the HTM file, the browser is interpreting this single byte text in the light of meta charset=utf-8 in the HTML header. The byte E8 after an ascii byte is not valid UTF-8 so the browser converts it to FFFD in it's 2 byte character set.

When the MY_DATA script scans the outerHTML string, it finds FFFD.
https://github.com/lancaster-university/codal-microbit-v2/blob/master/resources/logfs/basic-header.html#L72

In most cases micro:bit can only handle ASCII. I guess perhaps MicroBitLog could pass through UTF-8, but I'm not sure how the UTF-8 could be created.

@microbit-carlos
Copy link
Collaborator

microbit-carlos commented Jul 29, 2024

Yes, sorry, you are right Martin, I trusted a bad UTF8-to-hex converter, which I think it was actually using extended ASCII instead (as in that case è does equal e8), instead of UTF-8, which would be C3A8.

Right, so something in MakeCode or CODAL is either using extended ascii , CP-1252, or similar (è ->E8).
Or maybe the way the encoding is implemented is that each string character is converted in MakeCode using String.chartCodeat(), and as JavaScript is by default UTF-16, then that function would return 00E8, it might be easily stored as simply E8.
(Something like this implementation, which I've just randomly found in the pxt source code, so no idea if this code is related at all, just an example).

After that, as you mentioned, the browser reads the HTML file as UTF-8 and when it encounters the invalid byte E8 it converts it into (FFFD ).

In most cases micro:bit can only handle ASCII. I guess perhaps MicroBitLog could pass through UTF-8, but I'm not sure how the UTF-8 could be created.

Yeah, for things like displaying text CODAL only deals with ASCII, but strings that are processed outside of the device (like serial messages or data to display in an HTML file or CSV file), where CODAL doesn't need to parse the string, those should be fine in whatever encoding, no?
Of course, now that CODAL can read data back from the datalog, maybe that could be a problem when not using ASCII.

@martinwork
Copy link
Collaborator Author

Ah, yes I did not find the MakeCode implementation.

strings that are processed outside of the device (like serial messages or data to display in an HTML file or CSV file), where CODAL doesn't need to parse the string, those should be fine in whatever encoding, no?

I was immediately thinking "yes", but I found previous related posts...
https://github.com/microsoft/pxt-microbit/issues?q=is%3Aissue+unicode
https://github.com/microsoft/pxt-microbit/issues?q=is%3Aissue+accent

All the above and linked issues seem interesting, though I haven't looked very carefully yet. I don't know what is happening here yet...
microsoft/pxt-microbit#4467 example project: https://makecode.microbit.org/_UJTCHTRPc2Jf

@microbit-carlos
Copy link
Collaborator

microbit-carlos commented Jul 29, 2024

Right, yes, thanks for finding those!

I think mostly the issue is that MakeCode doesn't do UTF-8 (or 16 or whatever) encoding, so users are struggling to get strings working when the encoding is not done "correctly".

strings that are processed outside of the device (like serial messages or data to display in an HTML file or CSV file), where CODAL doesn't need to parse the string, those should be fine in whatever encoding, no?

I was immediately thinking "yes", but I found previous related posts...

My comment here was mostly along the lines of "if MakeCode encoded all strings as UTF-8 (like MicroPython does), there shouldn't be any problem to pass those strings along, as long as we didn't try to scroll them on the display or similar".

Based on the comments from microsoft/pxt-microbit#2372 (comment) the main drawback would be code size, however trying to enable it in a project as shown in microsoft/pxt#6988 didn't seem to work for me.


Doing this quick test it does look like MakeCode is doing something along the lines of encoding to UTF-16 and only keeping the LSB (e.g. String.chartCodeat() & 0xff).

By creating a simple programme with serial.writeLine("Ž") we can see that the data we get via serial, and it can also be seen in the string that ends up in flash, tested by adding an easy to find string followed by the non-ascii characters)
Ž in CP-1520/1252 should be 0x8E and in UTF-16 is 0x01 0x7D.
With this code we do get 0x7D, so it does look like the encoding was done in UTF-16, and only the LSB is kept.

Using character 𐀀 results in two code-units 0xD800 and 0xDC00 (each code-unit being a "surrogate", which is used to be able to encode more characters than one code-unit of 16-bits on its own would allow) and once it is compiled to a string in MakeCode I can confirm it ends up as two zeros (0x00 0x00) in flash (although running the programme doesn't work, I'm guessing the null characters in the middle of the string break something).

Which makes sense if encoded like this:
image

An easier example with surrogates that can be seen on serial would be 🍺 which is 0xD83C + 0xDF7A, and ends up being 0x3C7A, which is<z in ASCII.

@martinwork
Copy link
Collaborator Author

The utf8 option didn't seem to work for me either. I guess there might be problems any time a string is passed to C++ in CODAL or an extension and not simply passed on.

The example https://makecode.microbit.org/_UJTCHTRPc2Jf does work better than it appeared. CoolTerm sees the expected single byte characters sent from micro:bit, plus the padding spaces MakeCode adds, but MakeCode rejects it with console message "invalid utf8 serial data", and displays nothing in the console.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants