Best way to output Stanford NLP results [closed]

https://stackoverflow.com//questions/24028492

21-12-2019
|

Question

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 5 years ago.

Hi folks: I'm using the Stanford CoreNLP software to process hundreds of letters by different people (each about 10KB). After I get the output, I need to further process it and add information at the level of tokens, sentences, and letters. I'm quite new to NLP and was wondering what the typical or best way would be to output the pipeline results from Stanford CoreNLP to permit further processing?

I'm guessing the typical approach would be to output to XML. If I do, I estimate that will take about a GB of disk space, and I wonder, then, how quick and easy it would be to load that much XML back into Java for further processing and adding of information?

An alternative might be to have CoreNLP serialize the annotation objects it produces and load those back for processing. An advantage: not having to figure out how to convert a sentence parse string back into a tree for further processing. A disadvantage: annotation objects contain a lot of different types of objects I'm still quite rough on manipulating and the documentation on these in Stanford CoreNLP seems slim to me.

Solution

This is really matter of what you want to do afterwards. Doing serialization is probably the most straightforward and fast approach, the con is that you need to understand the CoreNLP data structure.

What if you want to read it in another language or read into your own data structure, save as XML.

I would go the first way.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow