Question

Hi folks: I'm using the Stanford CoreNLP software to process hundreds of letters by different people (each about 10KB). After I get the output, I need to further process it and add information at the level of tokens, sentences, and letters. I'm quite new to NLP and was wondering what the typical or best way would be to output the pipeline results from Stanford CoreNLP to permit further processing?

I'm guessing the typical approach would be to output to XML. If I do, I estimate that will take about a GB of disk space, and I wonder, then, how quick and easy it would be to load that much XML back into Java for further processing and adding of information?

An alternative might be to have CoreNLP serialize the annotation objects it produces and load those back for processing. An advantage: not having to figure out how to convert a sentence parse string back into a tree for further processing. A disadvantage: annotation objects contain a lot of different types of objects I'm still quite rough on manipulating and the documentation on these in Stanford CoreNLP seems slim to me.

Was it helpful?

Solution

This is really matter of what you want to do afterwards. Doing serialization is probably the most straightforward and fast approach, the con is that you need to understand the CoreNLP data structure.

What if you want to read it in another language or read into your own data structure, save as XML.

I would go the first way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top