Question

I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff.

What I need is either of the following:

  • a list of common names (I'd need the >5k most common names)
  • a list of names that also happen to be words

I figure between them, I can do a combined blacklist/whitelist to get what I need.

Was it helpful?

Solution

US Census name list: http://www.census.gov/genealogy/www/

That should get you one angle on the problem, anyway.

edited changed URL, per comment below about page moving. Nobody believes in HTTP 302 anymore?

OTHER TIPS

From a post I found at Quora:

CMU's NELL project has collected a huge list of proper nouns from the web and categorized them by type. You can browse online at: NELL KnowledgeBase Browser and download the data at: Resources & Data.

Web scraping the results for, say, personUS seems more efficient than what I did, which is extracting a list of names from phrases tagged as "person" in their big tab-delimited CSV file. Either way you'll be using regex.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top