Get rid of overly long lines in generated .hs file #84

alanz · 2016-01-11T12:36:12Z

Some packages provide the generated files to ease installation. If a user wants to open these in an editor or run tooling over them, this presents challenges.

e.g. http://hackage.haskell.org/package/Agda-2.4.2.2/docs/src/Agda-Syntax-Parser-Lexer.html#lexer

Consider splitting the generated code over multiple lines for

alex_base
alex_table
alex_check
alex_dflt
alex_accept

The text was updated successfully, but these errors were encountered:

sergv · 2017-06-09T15:30:02Z

Apparently, ghc 8.0.2 is currently picky about the way unboxed byte arrays are written.

For instance, consider basic_typeclass.g test. When I fiddle the way alex_base is defined, the compiled executable produces segfaults in the generated code on Debian 9.0 x64. In particular, the following (which is generated currently), works

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

while this produces a segfault

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\
                   \\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

simonmar · 2017-06-11T12:42:31Z

This could be due to CPP interacting badly with the backslash-end-of-line sequence, see https://downloads.haskell.org/~ghc/master/users-guide/phases.html#cpp-and-string-gaps

alanz · 2017-06-11T12:54:34Z

I think the haskell multi-line string syntax is bad, given standard linux backslash at end of line behaviour.

sergv · 2017-06-11T12:57:04Z

But, then, what would be the best way to split these long bytearrays into manageable chunks?

alanz · 2017-06-11T13:02:01Z

Well, knowing that there is a CPP preprocessor for them always gives you options. Just use the CPP standard, so you end up with ``` alex_base :: AlexAddr alex_base = AlexA# "\xf8\\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\ \x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\ \xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"# ```

…

On 11 June 2017 at 14:57, Sergey Vinokurov ***@***.***> wrote: But, then, what would be the best way to split these long bytearrays into manageable chunks? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZAB6b_Ke_pesgsh-p1bWJVC1Ej5Fxeks5sC-QhgaJpZM4HCQZX> .

andreasabel · 2022-01-25T08:33:23Z

Apparently fixed by #107.
The CHANGELOG is silent about it, though.
At least in 3.2.6-generated files we do not see long hexstrings anymore, but long lists with just one entry per line.

sergv · 2022-01-25T21:34:40Z

Unfortunately the #107 fix was reverted back since it didn't play well with cpphs which didn't like multiline string literals like

"foo\
\bar\
baz\
quux"

This is the original discussion that led to revert #116.

Perhaps cpphs is less of an issue nowadays and the fix could be restored? Thinking today about the cpphs not being able to handle the backslashes at end of lines at the time I'm not sure that it was cpphs doing the right thing and C preprocessors found in gcc and clang doing the worng thing. Given that standard C preprocessors could handle it I think cpphs being unable to do so is/was a bug, but admittedly I haven't checked the spec.

andreasabel · 2022-01-27T09:29:10Z

Ah sorry, I was mistaken. The problem still exists for alex -g. Without option -g, it uses listArray to construct the automaton table, with -g it hacks it together with Addr# and a huge string literal.

I suppose the -g optimization would only affect the startup-time of the lexer (once for each startup of the executable containing the lexer). So how valuable is that optimization?
@simonmar would probably know more...

sergv · 2022-02-02T21:26:59Z

I did some benchmarking of fast-tags lexer on a set of hackage packages (up to 10000 Haskell files totalling 70Mb) and ghc bytestring literal is slightly but consistently better than arrays performancewise.

I also tried to mess around with generated lexers and manually switch boxed arrays to unboxed ones, replace indexing with unsafe indexing that doesn't do bounds checking and even replace arrays with vectors from the vector package. The best I managed to get is a version that uses unboxed arrays in the generated lexer where possible and uses unsafe indexing (TestLexerArrayUnboxed below).

My benchmark results as reported by tasty-bench (look at X s ± Y ms bits, total time is not relevant). TestLexerGhcOpt is the lexer produced by alex -g, TestLexerVanilla is the lexer produced by alex without any options, and TestLexerArrayUnboxed and TestLexerVector are my modifications of TestLexerVanilla to try different arrays:

number of files   = 9801
total text length = 71068208
All
  TestLexerGhcOpt:       OK (42.86s)
    6.013 s ± 140 ms
  TestLexerVanilla:      OK (51.18s)
    7.225 s ±  76 ms
  TestLexerArrayUnboxed: OK (47.70s)
    6.679 s ±  53 ms
  TestLexerVector:       OK (20.92s)
    6.888 s ± 175 ms

However, runtime is not the only thing that changes when alex array backend is used. Big array constants that are generated when -g is not passed tend to take a lot of time to compile and somehow produce significantly more (order of magnitude?) object code. For the results below, the object files for modules generated by alex (and some of them further modified by me to use different arrays) look like:

-rw-r--r-- 1 sergey sergey  39M Feb  2 14:52 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 1.4M Feb  2 02:23 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  38M Feb  2 02:24 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  39M Feb  2 14:55 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

Even after stripping the difference is dramatic

-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 763K Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

I think the -g flag is a clear winner so far and personally I wouldn't like to get 1. longer compile times, 2. blowup in executable size and 3. performance hit (10% at least) just to be able to very occasionally look at the modules generated by alex. Surely, I tested only one lexer which is somewhat involved but doesn't exercise all alex features and it's not clear how representative it is of the more general uses of alex. I also have not tested standard backends that alex comes with.

PS I can share my messy benchmarks if anyone is curious.

andreasabel · 2022-02-02T21:54:49Z

Great investigation, @sergv !

sergv · 2022-07-09T08:02:37Z

@andreasabel I have an idea. I propose to restore my previous commit that makes lines smaller at the cost of meddling with preprocessor - this should mostly work with recent gcc and clang. For others I'll add a flag for alex, similar to --ghc or --debug that will disable long lines and thus e.g. projects that use only cpphs could always pass this flag and it'll work for them. Please share any objections.

andreasabel · 2023-04-14T19:44:16Z

@sergv : Sorry for the delayed response and thanks for the investigation.

What you suggest could be a solution but I wonder whether this issue is worth any complications, as it is purely aesthetic.

this should mostly work with recent gcc and clang.

To verify this, we would have to diversify our CI.

sergv · 2023-04-14T19:55:15Z

I can take a look at restoring commit and diversifying CI. It wouldn't cause that much complications (see my old PR) logicwise, yet I recall being pissed whenever I had to open Alex generated files. I didn't open any lately and admittedly it's the editors who are wrong at not being able to handle files with very long lines. Yet in practice we have the editors we have and looks like the issue can be alleviated in alex. As a user I definitely did want this - that's why I did the PR before that fixed it.

andreasabel · 2023-04-14T20:06:22Z

I can take a look at restoring commit and diversifying CI.

I suggest to create another workflow (independent of the semi-autogenerated haskell-ci.yml) that tests under Windows and macOS.

alanz mentioned this issue Jan 15, 2016

ghc-mod eating up my memory DanielG/ghc-mod#716

Open

This was referenced Jun 15, 2017

Get rid of long lines in produced files #107

Merged

Feature request: provide Template Haskell quasiquoter that generates lexer definition in-place #108

Open

This was referenced Sep 7, 2017

Segfault using Alex 3.2.2 and cpphs agda/agda#2736

Closed

Compiled with alex-3.2.2, Agda segfaults #116

Closed

andreasabel closed this as completed Jan 25, 2022

andreasabel added the feature request label Jan 25, 2022

andreasabel changed the title ~~Feature request: manage generated line length~~ Get rid of overly long lines in generated .hs file Jan 25, 2022

andreasabel mentioned this issue Jan 25, 2022

Is this long columns of ints expected behaviour on the example? #161

Closed

andreasabel reopened this Feb 2, 2022

andreasabel added the output formatting Conerning the layout of the generated files label Feb 2, 2022

sergv mentioned this issue May 13, 2023

Simplified test suite that runs on Windows and MacOS #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of overly long lines in generated .hs file #84

Get rid of overly long lines in generated .hs file #84

alanz commented Jan 11, 2016

sergv commented Jun 9, 2017

simonmar commented Jun 11, 2017

alanz commented Jun 11, 2017

sergv commented Jun 11, 2017

alanz commented Jun 11, 2017 via email

andreasabel commented Jan 25, 2022

sergv commented Jan 25, 2022 •

edited

Loading

andreasabel commented Jan 27, 2022

sergv commented Feb 2, 2022

andreasabel commented Feb 2, 2022

sergv commented Jul 9, 2022

andreasabel commented Apr 14, 2023

sergv commented Apr 14, 2023

andreasabel commented Apr 14, 2023

Get rid of overly long lines in generated .hs file #84

Get rid of overly long lines in generated .hs file #84

Comments

alanz commented Jan 11, 2016

sergv commented Jun 9, 2017

simonmar commented Jun 11, 2017

alanz commented Jun 11, 2017

sergv commented Jun 11, 2017

alanz commented Jun 11, 2017 via email

andreasabel commented Jan 25, 2022

sergv commented Jan 25, 2022 • edited Loading

andreasabel commented Jan 27, 2022

sergv commented Feb 2, 2022

andreasabel commented Feb 2, 2022

sergv commented Jul 9, 2022

andreasabel commented Apr 14, 2023

sergv commented Apr 14, 2023

andreasabel commented Apr 14, 2023

sergv commented Jan 25, 2022 •

edited

Loading