Why getting different results with MALLET topic inference for single and batch of documents?

https://stackoverflow.com/questions/7636959

06-02-2021
|

Question

I'm trying to perform LDA topic modeling with Mallet 2.0.7. I can train a LDA model and get good results, judging by the output from the training session. Also, I can use the inferencer built in that process and get similar results when re-processing my training file. However, if I take an individual file from the larger training set, and process it with the inferencer I get very different results, which are not good.

My understanding is that the inferencer should be using a fixed model, and only features local to that document, so I do not understand why I would get any different results while processing 1 file or the 1k from my training set. I am not doing frequency cutoffs which would seem to be a global operation that would have this type of an effect. You can see other parameters I'm using in the commands below, but they're mostly default. Changing # of iterations to 0 or 100 didn't help.

Import data:

bin/mallet import-dir \
  --input trainingDataDir \
  --output train.data \
  --remove-stopwords TRUE \
  --keep-sequence TRUE \
  --gram-sizes 1,2 \
  --keep-sequence-bigrams TRUE

Train:

time ../bin/mallet train-topics
  --input ../train.data \
  --inferencer-filename lda-inferencer-model.mallet \
  --num-top-words 50 \
  --num-topics 100 \
  --num-threads 3 \
  --num-iterations 100 \
  --doc-topics-threshold 0.1 \
  --output-topic-keys topic-keys.txt \
  --output-doc-topics doc-topics.txt

Topics assigned during training to one file in particular, #14 is about wine which is correct:

998 file:/.../29708933509685249 14  0.31684981684981683 
> grep "^14\t" topic-keys.txt 
14  0.5 wine spray cooking car climate top wines place live honey sticking ice prevent collection market hole climate_change winery tasting california moldova vegas horses converted paper key weather farmers_market farmers displayed wd freezing winter trouble mexico morning spring earth round mici torrey_pines barbara kinda nonstick grass slide tree exciting lots

Run inference on entire train batch:

../bin/mallet infer-topics \
  --input ../train.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-train.1 \
  --num-iterations 100

Inference score on train -- very similar:

998 /.../29708933509685249 14 0.37505087505087503

Run inference on another training data file comprised of only that 1 txt file:

../bin/mallet infer-topics \
  --input ../one.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-one.2 \
  --num-iterations 100

Inference on one document produces topic 80 and 36, which are very different (14 is given near 0 score):

0 /.../29708933509685249 80 0.3184778184778185 36 0.19067969067969068
> grep "^80\t" topic-keys.txt 
80  0.5 tips dog care pet safety items read policy safe offer pay avoid stay important privacy services ebay selling terms person meeting warning poster message agree sellers animals public agree_terms follow pets payment fraud made privacy_policy send description puppy emailed clicking safety_tips read_safety safe_read stay_safe services_stay payment_services transaction_payment offer_transaction classifieds_offer

Solution

The problem was incompatibility between small.data and one.data training data files. Even though I had been careful to use all of the same options, two data files will by default use different Alphabets (mapping between words and integers). To correct this, use the --use-pipe-from [MALLET TRAINING FILE] option, and then specifying other options seems to be unnecessary. Thanks to David Mimno.

bin/mallet import-dir \
  --input [trainingDataDirWithOneFile] \
  --output one.data \
  --use-pipe-from small.data

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow