Question

I'm curious if there's a library for python OR javascript to tokenize sentences of a string of sentences and put new line at each sentence?

IE:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi.

to

Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n
Vestibulum aliquet leo in urna hendrerit placerat.\n
Donec adipiscing dignissim adipiscing. \n
Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. \n
Suspendisse a consequat turpis. \n
Morbi eget ante leo, a dignissim mi.
Was it helpful?

Solution

You are looking for a natural language library.

For Python there is Natural Language Toolkit (NLTK). For example you could take a look at the PunktSentenceTokenizer.

The PunktSentenceTokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the taret language before it can be used. The algorithm for this tokenizer is described in Kiss & Strunk (2006):

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

OTHER TIPS

In Python, use str.replace():

>>> s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi."
>>> print s.replace('. ', '.\n')
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vestibulum aliquet leo in urna hendrerit placerat.
Donec adipiscing dignissim adipiscing.
Duis adipiscing mollis cursus.
Etiam fringilla elit nec enim sagittis a auctor nisi gravida.
Nunc sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat.
Suspendisse a consequat turpis.
Morbi eget ante leo, a dignissim mi.

Also, you make be interested in the textwrap module.

If you're just looking for javascript that would do that, you could do this:

var str = "Lorem ipsum 4.00 dolor sit amet, consectetur adipiscing elit. Vestibulum aliquet leo in urna hendrerit placerat. Donec adipiscing dignissim adipiscing. Duis adipiscing mollis cursus. Etiam fringilla elit nec enim sagittis a auctor nisi gravida. Nunc etc.... sollicitudin, leo sit amet consequat pharetra, mi orci vestibulum mi, a suscipit odio tellus tincidunt erat. Suspendisse a consequat turpis. Morbi eget ante leo, a dignissim mi."

str = str.replace(/(\S\.)\s*([A-Z])/g, "$1\n$2");

You can see it work here: http://jsfiddle.net/jfriend00/NR5Nc/.

This particular algorithm only adds a newline if it's a non white space followed by a period followed by whitespace followed by a capital letter. So, it's safe from things like $4.00 and etc... which don't actually end lines. It's also flexible on the amount of whitespace between lines.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top