Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse only one part of the file #298

Open
jvratanar opened this issue Jul 12, 2018 · 5 comments
Open

Unable to parse only one part of the file #298

jvratanar opened this issue Jul 12, 2018 · 5 comments
Labels

Comments

@jvratanar
Copy link

Is it possible to parse only let's say first half of the file with antlr4? I am parsing large files and am using UnbufferedCharStream, UnbufferedTokenStream. I am not bulding parse tree and am using parse actions instead of visitor/listener patterns. With these i was able to save significant amount of RAM and parse speed.

However it still takes around 15s to parse the whole file. The parsed file is divided in two sections. First half of the file has metadata, the second one the actual data. The majority of the time is spent in data section as there are more than 3mio. lines to be parsed. The metadata section has only around 20000 lines. Is it possible to parse only the first half, which would improve parse speed significantly? Is it possible to inject EOF manually after the metadata section? How about dividing the file into two?

@sharwell
Copy link
Member

sharwell commented Jul 12, 2018

Without manipulating data structures during the parse or creating customized implementations of ICharStream and/or ITokenStream, dividing the file into smaller units is the only straightforward way to limit the buffering during parsing. UnbufferedCharStream and UnbufferedTokenStream place no limits on lookahead (all of which is buffered), so it's easy and common for them to provide no advantage. I would not recommend the use of unbuffered streams to address file size issues; rather break up the inputs separately and parse them using normal streams.

@jvratanar
Copy link
Author

Is it possible to give an example of how to divide the file into smaller units ?
I would not want to create multiple files on disk from one big one as parsing will be done on a web server as a part of file upload validation.

@sharwell
Copy link
Member

@jvratanar The process would be highly application-specific, but the basic assumption to operate under is ANTLR will buffer all of the data available via StreamReader.ReadToEnd. The unbuffered char streams "try" to avoid buffering that much data, but in any decision of parsing accuracy vs. minimal buffering, ANTLR will always choose parsing accuracy.

A custom Stream implementation should allow control over the amount of data fed into ANTLR.

@jvratanar
Copy link
Author

I thought UnbufferedCharStream and UnbufferedTokenStream were the way to go when dealing with large files as it is recommended in the ANTLR book, page 241, 245.
I understand the process is application-specific but i would really appreciate it if there are any examples
of dividing input into smaller units or manually implementing ICharStream and/or ITokenStream on the
net. I was not able to find any and do not have a good idea how to deal with this on a web application.

On this web application users upload files which are then parsed and validated. However files are big
and parsing takes long time if i include parsing of the whole file. Do you have any ideas how i go about?

@jvratanar
Copy link
Author

jvratanar commented Jul 13, 2018

I was able to solve it like this:

public override IToken Emit()
{
    string tokenText = base.Text;
    if (this.metaDataOnly && tokenText == "DATA")
        return base.EmitEOF();
    return base.Emit();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants