Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CoNLL output. #31

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Enable CoNLL output. #31

wants to merge 1 commit into from

Conversation

ayrtonmassey
Copy link

This patch adds the CoNLL output of Stanford CoreNLP to the JSON annotation.

The data is returned in two forms:

  • In its raw form as conll_raw, in the same format as given when CoreNLP is run
    from the command line using the flag -outputFormat conll
  • Per-sentence as deps_conll, which adds CoNLL dependencies to each sentence.

To enable the CoNLL output, pass "outputFormat": "conll" in the
configdict when creating a new CoreNLP instance.

This patch adds the CoNLL output of Stanford CoreNLP to the JSON annotation.

The data is returned in two forms:

 - In its raw form as `conll_raw`, in the same format as given when CoreNLP is run
   from the command line using the flag `-outputFormat conll`

 - Per-sentence as `deps_conll`, which adds CoNLL dependencies to each sentence.

To enable the CoNLL output, pass `"outputFormat": "conll"` in the
`configdict` when creating a new `CoreNLP` instance.
@ayrtonmassey
Copy link
Author

There's a couple of issues with the code I've written - firstly, both the new functions throw exceptions (IOException due to writing to an OutputStream and NumberFormatException because of the use of Integer.parseInt()).

I added catch blocks for them but I wasn't sure how to respond - in the case of an IOException the CoNLL annotation will not occur but the rest of the annotation will be returned. However, a NumberFormatException will result in sentences which have been annotated having a CoNLL annotation and others not.

I doubt either of these will occur since the output is taken directly from CoreNLP, but it's possible.

I'm also not sure what happens if a blank document is given - it just occurred to me to test that now.

This is my first pull request, so I apologise if it's a bit messed up!

@brendano
Copy link
Owner

thanks! one question i have is, what's the purpose of having conll output?
if it's to be compatible with other systems that want to input or output
conll format, why is the version here slightly different ... using json
objects instead of the tab-separated format in conll? or, why use this
wrapper code at all instead of using corenlp directly? what exactly is the
use case?

On Thu, Aug 20, 2015 at 11:36 AM, Ayrton Massey [email protected]
wrote:

There's a couple of issues with the code I've written - firstly, both the
new functions throw exceptions (IOException due to writing to an
OutputStream and NumberFormatException because of the use of
Integer.parseInt()).

I added catch blocks for them but I wasn't sure how to respond - in the
case of an IOException the CoNLL annotation will not occur but the rest
of the annotation will be returned. However, a NumberFormatException will
result in sentences which have been annotated having a CoNLL annotation and
others not.

I'm also not sure what happens if a blank document is given - it just
occurred to me to test that now.

This is my first pull request, so I apologise if it's a bit messed up!


Reply to this email directly or view it on GitHub
#31 (comment)
.

@ayrtonmassey
Copy link
Author

I'm trying to use SEMAFOR to perform Semantic Frame Analysis, which accepts CoNLL data as input. Since I'm already using the wrapper for NER/coref it'd be nice to get the CoNLL output as well, rather than running a separate program. This means I don't have to:

  • Run two instances of Stanford CoreNLP - one with the wrapper for NER/coref, the other directly to obtain CoNLL output.
  • Try to integrate a separate system e.g. MaltParser.

If the wrapper is already doing the annotation, I may as well have it produce the CoNLL output too - especially as the wrapper is already integrated with my software.

I did include the raw tab-separated CoNLL data under "conll_raw" since I wasn't sure which was preferable - for some reason Stanford uses their own CoNLL format instead of CoNLL-X or CoNLL-U. For me, including the CoNLL data per-sentence as JSON objects allows me to reconstruct the data in CoNLL-X format, although I assume people looking to use this feature would want the raw data, so I included both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants