Training n-gram NER with Stanford NLP

Question 1

It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

Lingpipe provides various NER methods.

1) Dictionary Based NER

2) Statistical NER (HMM Based)

3) Rule Based NER etc.

I have used the Dictionary as well as the statistical approaches.

First one is a direct look up methodology and the second one being a training based.

An example for the dictionary based NER can be found here

The statstical approach requires a training file. I have used the file with the following format -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

I then used the following code to train the entities.

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

And to test the NER I used the following class

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

Code Courtesy : Google :)

Question 2

The answer is basically given in your quoted example, where "Emma Woodhouse" is a single name. The default models we supply use IO encoding, and assume that adjacent tokens of the same class are part of the same entity. In many circumstances, this is almost always true, and keeps the models simpler. However, if you don't want to do that you can train NER models with other label encodings, such as the commonly used IOB encoding, where you would instead label things:

Emma    B-PERSON
Woodhouse    I-PERSON

Then, adjacent tokens of the same category but not the same entity can be represented.

Question 3

I faced the same challenge of tagging ngram phrases for automative domain.I was looking for an efficient keyword mapping that can be used to create training files at a later stage. I ended up using regexNER in the NLP pipeline, by providing a mapping file with the regular expressions (ngram component terms) and their corresponding label. Note that there is no NER machine learning achieved in this case. Hope this information helps someone!