Domanda

I have about 44 Million training examples across about 6200 categories. After training, the model comes out to be ~ 450MB

And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.

How can speed things up? One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.

Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.

Any other thoughts?

Edit :

After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.

I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)

I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)

È stato utile?

Soluzione 3

"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:

  • Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
  • The majority of features are not helpful, as @Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.

I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.

PS Could you tell us a bit more about your problem and data set?


Edit after question was updated: Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.

I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!

I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:

  • RBF SVM- 85% F1 score - training time ~ month
  • Linear SVM- 83% F1 score - training time ~ day
  • Naive Bayes- 82% F1 score - training time ~ day

You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Altri suggerimenti

Prune stopwords and otherwise useless words (too low support etc.) as early as possible.

Depending on how you use clustering, it may actually make in particular the test phase even more expensive.

Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.

Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.

Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top