Question

I was curious what were some accurate ways to grab person and organization names from text. I'd like to draw affiliation networks based on partnerships, etc from text.

I've tried several methods: • using nltk POS - which works too slowly, so I dropped it • using regex's that match on whether or not you have consecutive words with the first capitalized. However, this led to a lot of exceptions and captures, mahy of which were not very relevant (e.g., when someone randomly capitalized 'Social Innovation Award'). Furthermore, this missed out on names that were only one word.

Do othersh ave any other ideas?

Example of text

obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha 
Piper\r\n\r\nThe award was presented during the closing dinner of the Social 
Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event 
gathered\r\nover 250 business, academic and social thought leaders from the 
social\r\nentrepreneurship sector in Canada who had convened for a full day of 
inspiration\r\nand engagement on ways to address some of the most pressing issues of our 

times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead 

an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, 

products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by 

MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit 
Was it helpful?

Solution

First clean your data:

>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = [i.replace('\r\n','').strip() for i in text.split('\r\n\r')]>>> text
['obin Cardozo', 'Ed Greenspon', 'Farouk Jiwa', 'David Pecaut', 'Martha Piper', 'The award was presented during the closing dinner of the Social EntrepreneurshipSummit held at MaRS Centre for Social Innovation in Toronto. The event gatheredover 250 business, academic and social thought leaders from the socialentrepreneurship sector in Canada who had convened for a full day of inspirationand engagement on ways to address some of the most pressing issues of our times.', 'An often under-recognized community, social entrepreneurs create and lead anorganization that are aimed at catalyzing systemic social change through newideas, products, services, methodologies and changes in attitude.', 'Hosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), theCentre for Social Innovation and the Toronto City Summit Alliance, the SocialEntrepreneurship Summit']

Then you will need a full blown Name Entity Recognizer, try NLTK ne_chunk as a starting point and then move on to more "state-of-art" NER recognizer:

from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.tree import Tree
from nltk import batch_ne_chunk as bnc
chunked_text = [[bnc(pos_tag(word_tokenize(j)) for j in sent_tokenize(i))] for i in text]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top