Question

I was checking out Stanford CoreNLP in order to understand NER and POS tagging. But what if I want to create custom tags for entities like<title>Nights</title>, <genre>Jazz</genre>, <year>1992</year> How can I do it? is CoreNLP useful in this case?

Was it helpful?

Solution

CoreNLP out-of-the-box will be restricted to types they mention : PERSON, LOCATION, ORGANIZATION, MISC, DATE, TIME, MONEY, NUMBER. No, you won't be able to recognize other entities just by assuming it could "intuitively" do it :)

In practice, you'll have to choose, either:

  1. Find another NER systems that tags those types
  2. Address this tagging task using knowledge-based / unsupervised approaches.
  3. Search for extra resources (corpora) that contain types you want recognize, and re-train a supervised NER system (CoreNLP or other)
  4. Build (and possibly annotate) your own resources - then you'll have to define an annotation scheme, rules, etc. - quite an interesting part of the work!

Indeed, unless you find an existing system that fulfills your needs, some effort will be required! Unsupervised approaches may help you bootstrapping a system, so as to see if you need to find / annotate a dedicated corpus. In the latter case, it would be better to separate data as train/dev/test parts, so as to be able to assess how much the resulting system performs on unseen data.

OTHER TIPS

Look into this FAQ (http://nlp.stanford.edu/software/crf-faq.shtml) to use CRF classifier to train your model for new classes. You may find it useful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top