Stanford NER prop file meaning of DistSim

Question

"DistSim" refers to using features based on word classes/clusters, built using distributional similarity clustering methods (e.g., Brown clustering, exchange clustering). Word classes group words which are similar, semantically and/or syntactically, and allow an NER system to generalize better, including handling words not in the training data of the NER system better. Many of our distributed models use a distributional similarity clustering features as well as word identity features, and gain significantly from doing so. In Stanford NER, there are a whole bunch of flags/properties that affect how distributional similarity is interpreted/used: useDistSim, distSimLexicon, distSimFileFormat, distSimMaxBits, casedDistSim, numberEquivalenceDistSim, unknownWordDistSimClass, and you need to look at the code in NERFeatureFactory.java to decode the details, but in the simple case, you just need the first two, and they need to be used while training the model, as well as at test time. The default format of the lexicon is just a text file with a series of lines with two tab separated columns of word clusterName. The cluster names are arbitrary.