Pregunta

I have read through Lingpipe for NLP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a training set of documents that have mentions of let's say software projects inside the text, can I use this training set to train a named entity recognizer? Once the training is complete, I should be able to feed a test set of textual documents to the trained model and I should be able to identify mentions of software projects there.

Is this generic NER possible using NER? If so, what features should I be using that I should feed?

Thanks Abhishek S

¿Fue útil?

Solución

Provided that you have enough training data with tagged software projects that would be possible.

If using Lingpipe, I would use character n-grams model as the first option for your task. They are simple and usually do the work. If results are not good enough some of the standard NER features are:

  • tokens
  • part of speech (POS)
  • capitalization
  • punctuaction
  • character signatures: these are some ideas: ( LUCENE -> AAAAAA -> A) , (Lucene -> Aaaaaa -> Aa ), (Lucene-core --> Aaaaa-aaaa --> Aa-a)
  • it may also be useful to compose a gazzeteer (list of software projects) if you can obtain that from Wikipedia, sourceforge or any other internal resource.

Finally, for each token you could add contextual features, tokens before the current one (t-1, t-2...), tokens after the current one (t+1,t+2...) as well as their bigram combinations (t-2^t-1), (t+1^t+2).

Otros consejos

Of course you can. Just get train data with all categories you need and follow tutorial http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html. No feature tuning is required since lingpipe uses only hardcoded one (shapes, sequnce word and ngramms)

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top