Frage

I used Weka Explorer:

  • Loaded the arff file
  • Applied StringToWordVector filter
  • Selected IBk as the best classifier
  • Generated/Saved my_model.model binary

In my java code I deserialize the model:

    URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
    final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );

Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new atributes)

I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector

    final StringToWordVector filter = new StringToWordVector();
    filter.setOptions(new String[]{"-C", "-P x_", "-L"});
    FilteredClassifier fcls = new FilteredClassifier();
    fcls.setFilter(filter);
    fcls.setClassifier(cls);

The above does not work either. It throws the exception:

Exception in thread "main" java.lang.NullPointerException: No output instance format defined

What I am trying to avoid is doing the training in the java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.

War es hilfreich?

Lösung

Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.

Using a FilteredClassifier is the correct idea, but you have to use it from the beginning:

  • Load the ARFF file
  • Select FilteredClassifier as classifier
  • Select StringToWordVector as filter for it
  • Select IBk as classifier for the FilteredClassifier
  • Generate/Save the model to my_model.binary

The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.

Andere Tipps

Another way to do this is to use the same filter to your testing data as the one used on training data. I describe the procedure analytically. In your case you just need to follow steps after the loading of your serialized classifier.

  • Create your training file (e.g training.arff)
  • Create Instances from training file. Instances trainingData = ..
  • Use StringToWordVector to transform your string attributes to number representation:

sample code:

    StringToWordVector() filter = new StringToWordVector(); 
    filter.setWordsToKeep(1000000);
    if(useIdf){
        filter.setIDFTransform(true);
    }
    filter.setTFTransform(true);
    filter.setLowerCaseTokens(true);
    filter.setOutputWordCounts(true);
    filter.setMinTermFreq(minTermFreq);
    filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
    NGramTokenizer t = new NGramTokenizer();
    t.setNGramMaxSize(maxGrams);
    t.setNGramMinSize(minGrams);    
    filter.setTokenizer(t);  
    WordsFromFile stopwords = new WordsFromFile();
    stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
    filter.setStopwordsHandler(stopwords);
    if (useStemmer){
        Stemmer s = new /*Iterated*/LovinsStemmer();
        filter.setStemmer(s);
    }
    filter.setInputFormat(trainingData);
  • Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);

  • Select a classifier to create your model

sample code for LibLinear classifier

        Classifier cls = null;
        LibLINEAR liblinear = new LibLINEAR();
        liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
        liblinear.setProbabilityEstimates(true);
        // liblinear.setBias(1); // default value
        cls = liblinear;
        cls.buildClassifier(trainingData);
  • Save model

sample code

    System.out.println("Saving the model...");
    ObjectOutputStream oos;
    oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
    oos.writeObject(cls);
    oos.flush();
    oos.close();
  • Create a testing file (e.g testing.arff)

  • Create Instances from training file: Instances testingData=...

  • Load classifier

sample code

Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
  • Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will keep the format of training set and will not add words that are not in training set.

  • Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);

  • Classify!

sample code

 for (int j = 0; j < testingData.numInstances(); j++) {
    double res = myCls.classifyInstance(testingData.get(j));
 }
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top