Frage

Is is possible to tokenize a text in tokens such that first and last name are combined in one token? For example if my text is:

text = "Barack Obama is the President"

Then:

text.split()

results in:

['Barack', 'Obama', 'is', 'the, 'President']

how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President'] as tokens.

Is there a way to achieve it in Python?

War es hilfreich?

Lösung

What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.

For python you can use https://pypi.python.org/pypi/ner/

Example from the site

>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'

Andere Tipps

Here's a regular expression that meets the needs of your question. It will find individual words beginning with a lowercase character, or match singleton or pairs of capitalized words.

import re
re.findall(r"[a-z]\w+|[A-Z]\w+(?: [A-Z]\w+)?",text)

outputs

['Barack Obama', 'is', 'the', 'President']
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top