List of proper names?
-
23-09-2019 - |
문제
I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff.
What I need is either of the following:
- a list of common names (I'd need the >5k most common names)
- a list of names that also happen to be words
I figure between them, I can do a combined blacklist/whitelist to get what I need.
해결책
US Census name list: http://www.census.gov/genealogy/www/
That should get you one angle on the problem, anyway.
edited changed URL, per comment below about page moving. Nobody believes in HTTP 302 anymore?
다른 팁
From a post I found at Quora:
CMU's NELL project has collected a huge list of proper nouns from the web and categorized them by type. You can browse online at: NELL KnowledgeBase Browser and download the data at: Resources & Data.
Web scraping the results for, say, personUS seems more efficient than what I did, which is extracting a list of names from phrases tagged as "person" in their big tab-delimited CSV file. Either way you'll be using regex.