Question

I'm developing a semi-automatic annotation tool for medical texts and I am completely lost in finding the RDF triplets for annotation.

I am currently trying to use an NLP based approach. I have already looked into Stanford NER and OpenNLP and they both do not have models for extracting disease names.

My question is: * How can I create a new NER model for extracting disease names? and can I get any help from the OpenNLP or Standford NERs? * Is there another approach all-together - other than NLP - to extracting the RDF triplets from a text?

Any help would be appreciated! Thanks.

Was it helpful?

Solution

I have done something similar to what you need both with OpenNLP and LingPipe. I found the exact dictionary-based chunking of LingPipe good enough for my use case and used that. Documentation available here: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

You can find a small demo here:

If a gazetteer/dictionary approach isn't good enough for you, you can try creating your own model, OpenNLP has API for training models as well. Documentation is here: http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training

Extracting RDF triples from natural language is a different problem than identify named entities. NER is a related and perhaps necessary step, but not enough. To extract an RDF statement from natural language not only you need to identify entities such as the subject and the object of a statement. But you also need to identify the verb and/or relationship of those entities and also you need to map those to URIs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top