Question

I need to detect sentence boundaries in HTML. There is lots of sentence boundary detection software out there (java.text.BreakIterator is the one I'm using), but all of it assumes plain text. HTML is richer than that, and includes some clues as to where sentences break.

For example, <p>, <ul>/<li>, <td> and other tags mark sentence boundaries, or at least indicate that a sentence probably doesn't extend across them. <b>, <i>, <em>, <span>, <a> and a few others tags could appear inside a sentence.

Is anyone aware of any software that takes advantage of HTML markup, in addition to the normal NLP stuff, in determining sentence boundaries?

Was it helpful?

Solution

The solution I implemented was 1. split the document into separate blocks on all html tags except the inline tags (<i>, <b>, <span>, etc.), 2. strip the inline tags from each block, 3. look for sentences within each block using traditional methods.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top