Provide macros corresponding to the Unicode general categories #126

sjakobi · 2018-07-27T05:51:13Z

Right now it seems very difficult to write a rule e.g. for words starting with an uppercase letter.

erikd · 2018-07-27T23:17:07Z

What have you tried? I thought this was as simple as:

   [A-Z][a-zA-Z]*

which means one upper case and then zero or more mixed case.

sjakobi · 2018-07-28T00:51:02Z

The Unicode uppercase letter category is quite a bit larger than [A-Z]. Then there are questions like, should e.g. $upper include titlecase letters?

It might make sense to have macros both for the predicates Haskell programmers are used to from Data.Char and for the Unicode general categories…

erikd · 2018-07-28T00:57:48Z

Ah ok. I think my response shows that your initial question was not specific enough.

If you specify exactly what it is you want to do and why the current functionality is insufficient, you will get much more useful responses than my one above.

sjakobi · 2018-07-28T14:03:53Z

If you specify exactly what it is you want to do and why the current functionality is insufficient

Right. :) I should have done that first. :)

I want to detect Haskell identifiers. For that I need the following character sets:

lowercase, uppercase and titlecase letters
decimal digits
symbols and punctuation

erikd · 2018-07-29T02:25:39Z

Forgive me for being pedantic here, but that is not what you are asking for.

You rejected my suggesting above saying that [A-Z] does not include Unicode. This suggests you want more than just "lowercase, uppercase and titlecase letters" because for most people with English as a first language, that means [a-z] for lowercase and [A-Z] for uppercase.

Furthermore "symbols and punctuation" can mean different things in different programming languages and even in different human languages so there is not one single solution.

Maybe looking at the lexer for CHG itself will provide you some inspiration.

sjakobi · 2018-07-29T10:04:29Z

Maybe looking at the lexer for CHG itself will provide you some inspiration.

Thanks. Yeah, in the end I want a lexer that detects the same identifiers that GHC itself will lex.

But I don't want to replicate GHC's strange Unicode workaround.

If Alex could provide macros corresponding to the Unicode general categories, building the lexer would be quite easy.

simonmar · 2018-08-19T09:55:04Z

I think my concern with lexing UTF-8 directly in Alex for Haskell source code was that the generated state machine might be huge. I didn't actually do that experiment though, I'd be interested in the results.

JKTKops · 2021-01-03T00:06:38Z

I'm not sure what to look at in the generated Haskell files to see if that's a problem. Language.Javascript lexes UTF-8 directly. I did something similar while experimenting with R7RS Scheme parsing. Perhaps someone who knows what they're looking for could check if this approach causes problems?

Ericson2314 · 2021-01-04T06:09:43Z

Per #165, I would like to unfuse the UTF-8 and user-written automata to decluttter the implementation, which we speculate is a bit confused because it might predate Char properly supporting Unicode.

(Even better would be to then go implement proper automaton composition to allow the the user to choose whether to fuse or not fuse the automata (when the underlying string is byte- rather than character-oriented), and start exploring the proper categorical semantics of the language specs themselves! But I am getting star-eyed and off-topic.)

Back to the point, once things can work Char-by-Char nice and simply, I hope character classes for arbitrary Unicode code-points will be a breeze.

Javran · 2021-03-18T08:37:25Z

This will be very useful indeed! I just did something similar to #126 (comment) and wish such a support exist.

Javran · 2021-03-20T04:18:47Z

Just want to add few notes on this issue:

(A bit of background: I'm following Java SE 16 Spec to write a parser for fun, so my knowledge below is based on my experience following that spec)

One workaround I tried is to let Alex accept a wider language, say Java forbids Unicode outside identifier and literals. So I can take advantage of that fact to be specific only on \x00~\x7F range:

$JavaIdentifierStartLite = [\x24\x41-\x5A\x5F\x61-\x7A\x80-\x10ffff]
$JavaIdentifierPartLite = [$JavaIdentifierStartLite\x00-\x08\x0E-\x1B\x30-\x39\x7F\x80-x10ffff]

and then I can deal with them in AlexAction. However this doesn't work for several reasons:

Java SE 16 follows Unicode 13.0, but Data.Char.generalCategory depends on UnicideData shipped with GHC itself so I'm observing some differences due to this misalignment
Another thing I attempted is to store the set of accepted Char as a set (this set can be obtained by calling Character.isJavaIdentifierStart and Character.isJavaIdentifierPart iterating through all Unicode codepoint values on a JVM language), however both Data.IntSet and Data.HashSet are very slow even on my small test suites, in comparision making Alex to accurately recognize identifers has way better performance.

My key takeaways:

Data.IntSet and Data.Set are way slower than having Alex handle it (please let me know if there are other alternatives, as my Alex approach kind of "cheated" by grouping consecutive ranges like x30000-\x3134A rather than storing every integer individually)
If we were to support general category macros, should we provide a way to allow user to supply a UnicideData version? (and how exactly should this be implemented?)

sjakobi changed the title ~~Provide macros corresponding to the Data.Char predicates~~ Provide macros corresponding to the Unicode general categories Jul 29, 2018

andreasabel mentioned this issue Feb 22, 2021

Character class printable in token regexes BNFC/bnfc#341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide macros corresponding to the Unicode general categories #126

Provide macros corresponding to the Unicode general categories #126

sjakobi commented Jul 27, 2018

erikd commented Jul 27, 2018

sjakobi commented Jul 28, 2018

erikd commented Jul 28, 2018

sjakobi commented Jul 28, 2018

erikd commented Jul 29, 2018 •

edited

Loading

sjakobi commented Jul 29, 2018

simonmar commented Aug 19, 2018

JKTKops commented Jan 3, 2021

Ericson2314 commented Jan 4, 2021

Javran commented Mar 18, 2021

Javran commented Mar 20, 2021

Provide macros corresponding to the Unicode general categories #126

Provide macros corresponding to the Unicode general categories #126

Comments

sjakobi commented Jul 27, 2018

erikd commented Jul 27, 2018

sjakobi commented Jul 28, 2018

erikd commented Jul 28, 2018

sjakobi commented Jul 28, 2018

erikd commented Jul 29, 2018 • edited Loading

sjakobi commented Jul 29, 2018

simonmar commented Aug 19, 2018

JKTKops commented Jan 3, 2021

Ericson2314 commented Jan 4, 2021

Javran commented Mar 18, 2021

Javran commented Mar 20, 2021

erikd commented Jul 29, 2018 •

edited

Loading