Question

I know this might sound easy. I thought about using the first dot(.) which comes as the benchmark, but when abbreviations and short forms come, I am rendered helpless.

e.g. -

Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.

Here, the 1st dot is Hon., but I want the complete first line ending at Second World War .

Is it possible people ???

Was it helpful?

Solution

If you use nltk you can add abbreviations, like this:

>>> import nltk
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent_detector._params.abbrev_types.add('hon')
>>> sent_detector.tokenize(your_text)
['Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA 
(30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and 
statesman known for his leadership of the United Kingdom during the Second 
World War.', 
'He is widely regarded as one of the great wartime leaders and served as Prime 
Minister twice.', 
'A noted statesman and orator, Churchill was also an officer in the British Army,
a historian, a writer, and an artist.']

This approach is based on Kiss & Strunk 2006, which reports that the F-score (harmonic mean of precision and recall) is between 91% and 99% for Punkt, depending on the test corpus.

Kiss, Tibor, and Jan Strunk. 2006. "Unsupervised Multilingual Sentence Boundary Detection". Computational Linguistics, (32) 485-525.

OTHER TIPS

This is in general impossible. Abbreviations, numeric values ("$23.45", "32.5 degrees"), quotations ("he said: 'ha! you'll never [...]'") or names with punctuation (e.g. "Panic! At the Disco") or even whole subordinate clauses in brackets that are basically their own sentence ("the cook (who is also an excellent painter!) [...]") mean that you can't just split the text by dots and exclamation/question marks or use any other 'simple' approach.

Basically, to solve the general case, you'd need a parser for natural language (and in that case you may be better off using prolog instead of python) with a grammar that handles all these special cases. If you can reduce the problem to a less general one, e.g. only needing to deal with abbreviations and quotations, you may be able to cheese something - but you'd nevertheless need any sort of parser or state machine as regular expressions are not powerful enough for these kinds of things.

Have you looked into the natural language toolkit, nltk? It appears to have a sentence tokenizer available. http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize-module.html

First sentence on wikipedia almost always says what something is, was, are or were. Therefore, a possible solution would be not to end the sentence until a linking verb (is, was, are, were) was reached. This will not work 100% accurately of course, but here is a possible solution:

def get_first_sentence(my_string):

    linking_verbs = set(['was', 'is', 'are', 'were'])

    split_string = my_string.split(' ')

    first_sentence = []
    linked_verb_booly = False
    for ele in split_string:
        first_sentence.append(ele)
        if ele in linking_verbs:
            linked_verb_booly = True
        if '.' in ele and linked_verb_booly == True:
            break

    return ' '.join(first_sentence)

Example 1:

Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.

my_string_1 = 'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.'
first_sentence_1 =  get_first_sentence(my_string_1)

Result:

>>> first_sentence_1
'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War.'

Example 2:

Python is a general-purpose, high-level programming language[11] whose design philosophy emphasizes code readability. Its syntax is said to be clear[12] and expressive.[13] Python has a large and comprehensive standard library.[14]

Result:

>>> first_sentence_2
'Python is a general-purpose, high-level programming language[11] whose design philosophy emphasizes code readability.'

Example 3:

China (Listeni/ˈtʃaɪnə/; Chinese: 中国; pinyin: Zhōngguó; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3 billion. Covering approximately 9.6 million square kilometres, the East Asian state is the world's second-largest country by land area,[13] and the third- or fourth-largest in total area, depending on the definition of total area.[14]

my_string_3 = "China (Listeni/ˈtʃaɪnə/; Chinese: 中国; pinyin: Zhōngguó; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3 billion. Covering approximately 9.6 million square kilometres, the East Asian state is the world's second-largest country by land area,[13] and the third- or fourth-largest in total area, depending on the definition of total area.[14]"
first_sentence_3 = get_first_sentence(my_string_3)

Result:

>>> first_sentence_3

    "China (Listeni/\xcb\x88t\xca\x83a\xc9\xaan\xc9\x99/; Chinese: \xe4\xb8\xad\xe5\x9b\xbd; pinyin: Zh\xc5\x8dnggu\xc3\xb3; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3"

You can see the limitation in the last example, where the sentence was cut off to early, because a '.' is in 1.3.

Also the above is probably better done with regex.

Just an idea.

While a lot of the people here have good points, natural language processing is in fact a very difficult task and a huge amount of research has been done into it with very unreliable results. However, there solutions out there. Many people have mentioned the natural language toolkit, which is one of the most powerful natural language processing tools in existence. NLTK does in fact have a sentence tokenizer that comes ready built, and while it is not perfect, it is very good. It is called the PunktSentenceTokenizer, and it filters for abbreviations fairly well. It has a good deal of trouble with more slangy speech, but for a sentence of fiction like you have above it works wonderfully. Documentation can be found here: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html

from nltk import tokenize

def print_sentences(text):
    test = tokenize.punkt.PunktSentenceTokenizer()
    return test.sentences_from_text(text)

Sadly, it doesn't actually work for the example you have put forth, but it does have a very detailed look up and it catches a lot of abbreviations. I think a good deal of the project with this example is that "Hon." is also a proper noun and a dictionary will likely see it as such. It is possible to custom configure your dictionary in nltk to catch this particular case, as in fraxel's answer however, the simple tokenizer will not catch a lot of other abbreviations, or price notation or other such common cases, which the punkt tokenizer will catch.

If you stick to the convention that a period ends a sentence only if it is followed by a space or a new line you can do something like:

s="Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist."
sentence_delimiters = ['. ', '.\n', '? ', '?\n', '! ', '!\n']
pos = [s.find(delimiter) for delimiter in sentence_delimiters]
pos = min([p for p in pos if p >= 0])
print s[:pos]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top