Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of overly long lines in generated .hs file #84

Open
alanz opened this issue Jan 11, 2016 · 14 comments
Open

Get rid of overly long lines in generated .hs file #84

alanz opened this issue Jan 11, 2016 · 14 comments
Labels
feature request output formatting Conerning the layout of the generated files

Comments

@alanz
Copy link

alanz commented Jan 11, 2016

Some packages provide the generated files to ease installation. If a user wants to open these in an editor or run tooling over them, this presents challenges.

e.g. http://hackage.haskell.org/package/Agda-2.4.2.2/docs/src/Agda-Syntax-Parser-Lexer.html#lexer

Consider splitting the generated code over multiple lines for

alex_base
alex_table
alex_check
alex_dflt
alex_accept
@sergv
Copy link
Contributor

sergv commented Jun 9, 2017

Apparently, ghc 8.0.2 is currently picky about the way unboxed byte arrays are written.

For instance, consider basic_typeclass.g test. When I fiddle the way alex_base is defined, the compiled executable produces segfaults in the generated code on Debian 9.0 x64. In particular, the following (which is generated currently), works

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

while this produces a segfault

alex_base :: AlexAddr
alex_base = AlexA# "\xf8\
                   \\xff\xff\xff\xd5\xff\xff\xff\xfa\xff\xff\xff\x00\x00\x00\x00\xa5\xff\xff\xff\xa6\xff\xff\xff\xa7\xff\xff\xff\xa8\xff\xff\xff\xa9\xff\xff\xff\xaa\xff\xff\xff\xab\xff\xff\xff\xac\xff\xff\xff\xad\xff\xff\xff\xae\xff\xff\xff\xaf\xff\xff\xff\xbb\xff\xff\xff\xbc\xff\xff\xff\xf6\xff\xff\xff\x10\x00\x00\x00\x11\x00\x00\x00\x2c\x00\x00\x00\x2d\x00\x00\x00\x2e\x00\x00\x00\x2f\x00\x00\x00\x30\x00\x00\x00\x31\x00\x00\x00\x32\x00\x00\x00\x33\x00\x00\x00\x34\x00\x00\x00\x35\x00\x00\x00\x00\x00\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00"#

@simonmar
Copy link
Member

This could be due to CPP interacting badly with the backslash-end-of-line sequence, see https://downloads.haskell.org/~ghc/master/users-guide/phases.html#cpp-and-string-gaps

@alanz
Copy link
Author

alanz commented Jun 11, 2017

I think the haskell multi-line string syntax is bad, given standard linux backslash at end of line behaviour.

@sergv
Copy link
Contributor

sergv commented Jun 11, 2017

But, then, what would be the best way to split these long bytearrays into manageable chunks?

@alanz
Copy link
Author

alanz commented Jun 11, 2017 via email

@andreasabel
Copy link
Member

Apparently fixed by #107.
The CHANGELOG is silent about it, though.
At least in 3.2.6-generated files we do not see long hexstrings anymore, but long lists with just one entry per line.

@andreasabel andreasabel changed the title Feature request: manage generated line length Get rid of overly long lines in generated .hs file Jan 25, 2022
@sergv
Copy link
Contributor

sergv commented Jan 25, 2022

Unfortunately the #107 fix was reverted back since it didn't play well with cpphs which didn't like multiline string literals like

"foo\
\bar\
baz\
quux"

This is the original discussion that led to revert #116.

Perhaps cpphs is less of an issue nowadays and the fix could be restored? Thinking today about the cpphs not being able to handle the backslashes at end of lines at the time I'm not sure that it was cpphs doing the right thing and C preprocessors found in gcc and clang doing the worng thing. Given that standard C preprocessors could handle it I think cpphs being unable to do so is/was a bug, but admittedly I haven't checked the spec.

@andreasabel
Copy link
Member

Ah sorry, I was mistaken. The problem still exists for alex -g. Without option -g, it uses listArray to construct the automaton table, with -g it hacks it together with Addr# and a huge string literal.

I suppose the -g optimization would only affect the startup-time of the lexer (once for each startup of the executable containing the lexer). So how valuable is that optimization?
@simonmar would probably know more...

@sergv
Copy link
Contributor

sergv commented Feb 2, 2022

I did some benchmarking of fast-tags lexer on a set of hackage packages (up to 10000 Haskell files totalling 70Mb) and ghc bytestring literal is slightly but consistently better than arrays performancewise.

I also tried to mess around with generated lexers and manually switch boxed arrays to unboxed ones, replace indexing with unsafe indexing that doesn't do bounds checking and even replace arrays with vectors from the vector package. The best I managed to get is a version that uses unboxed arrays in the generated lexer where possible and uses unsafe indexing (TestLexerArrayUnboxed below).

My benchmark results as reported by tasty-bench (look at X s ± Y ms bits, total time is not relevant). TestLexerGhcOpt is the lexer produced by alex -g, TestLexerVanilla is the lexer produced by alex without any options, and TestLexerArrayUnboxed and TestLexerVector are my modifications of TestLexerVanilla to try different arrays:

number of files   = 9801
total text length = 71068208
All
  TestLexerGhcOpt:       OK (42.86s)
    6.013 s ± 140 ms
  TestLexerVanilla:      OK (51.18s)
    7.225 s ±  76 ms
  TestLexerArrayUnboxed: OK (47.70s)
    6.679 s ±  53 ms
  TestLexerVector:       OK (20.92s)
    6.888 s ± 175 ms

However, runtime is not the only thing that changes when alex array backend is used. Big array constants that are generated when -g is not passed tend to take a lot of time to compile and somehow produce significantly more (order of magnitude?) object code. For the results below, the object files for modules generated by alex (and some of them further modified by me to use different arrays) look like:

-rw-r--r-- 1 sergey sergey  39M Feb  2 14:52 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 1.4M Feb  2 02:23 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  38M Feb  2 02:24 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  39M Feb  2 14:55 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

Even after stripping the difference is dramatic

-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerArrayUnboxed.o
-rw-r--r-- 1 sergey sergey 763K Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerGhcOpt.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVanilla.o
-rw-r--r-- 1 sergey sergey  16M Feb  2 21:15 /tmp/dist/build/x86_64-linux/ghc-8.10.7/fast-tags-2.0.1/b/benchmarkAlexOpt/build/benchmarkAlexOpt/benchmarkAlexOpt-tmp/TestLexerVector.o

I think the -g flag is a clear winner so far and personally I wouldn't like to get 1. longer compile times, 2. blowup in executable size and 3. performance hit (10% at least) just to be able to very occasionally look at the modules generated by alex. Surely, I tested only one lexer which is somewhat involved but doesn't exercise all alex features and it's not clear how representative it is of the more general uses of alex. I also have not tested standard backends that alex comes with.

PS I can share my messy benchmarks if anyone is curious.

@andreasabel
Copy link
Member

Great investigation, @sergv !

@andreasabel andreasabel reopened this Feb 2, 2022
@andreasabel andreasabel added the output formatting Conerning the layout of the generated files label Feb 2, 2022
@sergv
Copy link
Contributor

sergv commented Jul 9, 2022

@andreasabel I have an idea. I propose to restore my previous commit that makes lines smaller at the cost of meddling with preprocessor - this should mostly work with recent gcc and clang. For others I'll add a flag for alex, similar to --ghc or --debug that will disable long lines and thus e.g. projects that use only cpphs could always pass this flag and it'll work for them. Please share any objections.

@andreasabel
Copy link
Member

@sergv : Sorry for the delayed response and thanks for the investigation.

What you suggest could be a solution but I wonder whether this issue is worth any complications, as it is purely aesthetic.

this should mostly work with recent gcc and clang.

To verify this, we would have to diversify our CI.

@sergv
Copy link
Contributor

sergv commented Apr 14, 2023

I can take a look at restoring commit and diversifying CI. It wouldn't cause that much complications (see my old PR) logicwise, yet I recall being pissed whenever I had to open Alex generated files. I didn't open any lately and admittedly it's the editors who are wrong at not being able to handle files with very long lines. Yet in practice we have the editors we have and looks like the issue can be alleviated in alex. As a user I definitely did want this - that's why I did the PR before that fixed it.

@andreasabel
Copy link
Member

I can take a look at restoring commit and diversifying CI.

I suggest to create another workflow (independent of the semi-autogenerated haskell-ci.yml) that tests under Windows and macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request output formatting Conerning the layout of the generated files
Projects
None yet
Development

No branches or pull requests

4 participants