Question

I want to recognize named entities in a specific field (e.g. baseball). I know there are tools available like StanfordNER, LingPipe, AlchemyAPI and I have done a little testing with them. But what I want them to be is field specific as I mentioned earlier. How this is possible?

Was it helpful?

Solution

One approach may be to

  1. Use a general (non-domain specific) tool to detect people's names

  2. Use a subject classifier to filter out texts that are not in the domain

If the total size of the data set is sufficient and the accuracy of the extractor and classifier good enough, you can use the result to obtain a list of people's names that are closely related to the domain in question (e.g. by restricting the results to those that are mentioned significantly more often in domain-specific texts than in other texts).

In the case of baseball, this should be a fairly good way of getting a list of people related to baseball. It would, however, not be a good way to obtain a list of baseball players only. For the latter it would be necessary to analyse the precise context in which the names are mentioned and the things said about them; but perhaps that is not required.

Edit: By subject classifier I mean the same as what other people might refer to simply as categorization, document classification, domain classification, or similar. Examples of ready-to-use tools include the classifier in Python-NLTK (see here for examples) and the one in LingPipe (see here).

OTHER TIPS

Have a look at smile-ner.appspot.com which covers 250+ categories. In particaul, it covers a lot of persons/teams/clubs on sports. May be useful for your purpose.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top