Question

I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.

But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.

For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.

I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!

Any ideas??

Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...

Was it helpful?

Solution

First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.

OTHER TIPS

Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.

I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.

EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.

There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.

Exact pipeline for your purpose is:

  1. Document Reset PR.
  2. ANNIE English Tokenizer.
  3. ANNIE RegEx Sentence Splitter.

Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).

If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.

Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!

But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"

I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!

For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same. And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top