Frage

I'm looking for a solution for trimming an unknown text to certain length. Keeping only full sentences.

So text like this

"Were you born 1. 3. 1987 in Prague? Štěpán Jr. lives there for 3 years now! "

should be turned into

"Were you born 1. 3. 1987 in Prague? "

for character limit 50, 40 (and 20 with --find-next-sentence-ending).

I've read many SO question - most of the answers were variations of

substr($text, 0, strrpos('.', $text) + 1);

But that obviously fails for the above mentioned sentence and other such. Others suggest using Stanford Text Parser or OpenNLP. They are really cool, but not useable for typical application. You would not install Java on your Ruby/PHP server, just to trim a text, right. So I'm looking for some 80/20 solution, that would be language-agnosic and would be able to handle typical cases that appear.

I couldn't think of more problematic sentence than this (contains a date, non-dot sentence ending and non-ascii character at the begining of the next sentence and non-ending dot in the middle of the "limit" sentence).

I also created a GIST (https://gist.github.com/4051035) for you to fork and play with - forking assures that users can click-through to different solutions of this problem, so please use it ;) I wanted to make this question comunity-wiki, but it seems it does not work for questions - only for answers. So please add any suggestions/relevant SO questions to comments. Thanks.

War es hilfreich?

Lösung

If it is enough for you to get 80% precision then you can apply simple rule:

  • Each '?' and '!' mark end of sentence
  • When you find dot check if next word starts with upper case letter but is not all in upper case (remember it is only for 80/20 rule)

If you need something better then I'm afraid you need nlp library. If you have php/ruby hosting then you should be able to use NLTK. It is written in python and have great support.

Andere Tipps

My rough idea of solving this would be to check for last sentence separator (ie. dot-space), check whether there are dot-space separated numbers or known pattern around that position and if there are, cowardly pick the previous sentence. Maybe calculate next non-number-dot-space position and if it's in some soft limit (say $limit+10), pick current sentence.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top