Stanford Named Entity Tagger - Inconsistency ?

https://stackoverflow.com//questions/25050041

21-12-2019
|

Question

I have a strange problem.

I have a list of sentences (around 0.1 million) which is want to tag using the stanford named entity recognition(ner) tagging. I was tagging using the following line of code that is provided from the stanford ner demo website (Java Demo Code).

for (String str : List<sentences>) {
   System.out.print(classifier.classifyToString(str, "slashTags", false));
}

I thought everything is going right until I manually checked for some of the sentences that were not tagged at all which are supposed to be tagged. But when these sentences which are not tagged are hand picked into some sample list and tested with the above code they are getting tagged then. So I am confused where I am going wrong. The sentences which are not tagged correctly are like in the range of 1000 - 1500 sentences. so when I ran these incorrectly tagged sentences in a separate list then they are getting tagged. Is the size of the dataset (0.1 million) having any impact on the classifier ?

For example: consider the following sentence - "IBM Corporation Introduction" Sentences like above are present in considerable number in my 0.1 million dataset. So when I do the tagging using the above code on the 0.1 million dataset, many of sentences like these have got no tagging at all. But When I hand pick those and put in in a list and then do the tagging then they are getting tagged.

I have tried all the approaches and I end up in the same result of no tagging for the sentences like above when tagging on the entire dataset.

I tried the following 3 different ways 1. classifier.classifyToString(inputString, "slashTags", false) 2. classifier.classify(inputString) 3. classifier.classifyToCharacterOffsets(inputString)

Any ideas or suggestions where I am going wrong ?

Thanks

Solution

I think you got answer from the below link:

https://mailman.stanford.edu/pipermail/java-nlp-user/2014-July/006045.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow