replicability / reproducibility in topic modeling (LDA)

https://datascience.stackexchange.com/questions/8193

16-10-2019
|

Question

If I'm not wrong, topic modeling (LDA) is not replicable, i.e. it gives different results in different runs. Where does this come from (where does this randomness come from and why is it necessary?) and what can be done to solve this issue or gain more stability across runs?
Thanks for the help.

Solution

LDA is Bayesian model. This means the desired result is a posterior probability distribution over the random vectors of interest (probability of topics etc. having seen some data).

Inference for many Bayesian models is done by Markov Chain Monte Carlo. Indeed the wiki on LDA suggests that Gibbs sampling is a popular inference technique for LDA. MCMC draws random samples to provide an approximation of the posterior distribution.

Variational inference methods should typically be deterministic, but I'm not too familiar with the VB inference for this particular model.

Also one can typically replicate runs of random algorithms by setting of random number generation seeds (if your purpose is scientific).

In either case if the results from a Bayesian model show huge variability in the the parameters of interest, it may be telling you that the model is not a good fit for the dataset, or the dataset is not big enough for the model you are fitting.

EDIT: I don't know which inference method (Gibbs, VB etc.) the backend of your software is using, so it's not possible to determine what type (if any) of randomisation is going on.

For scientific purposes, you'll probably want to read up some more on Bayesian inference. Standard software (e.g. the LDA in scikits.learn) will give you a summary of the outputs of the inference (e.g. most coders just want the best assignment of docs to topics). There's more information hanging around behind the scenes which you might be able to get access to, and could be useful.

E.g. (roughly) for scientific applications of Gibbs sampling methods we'd typically run multiple chains, drop the first N samples generated by each, and check if the resulting samples look like they came from the same distribution. If you're concerned about dependence on seeds etc. and your backend is the Gibbs sampler you will want to check out MCMC convergence diagnostics for this model.

OTHER TIPS

When you are using mallet, you can fix a random seed using the command line flag --random-seed. This makes your results reproducible.

It does not and cannot remove the fact, that different random seeds produce different topic models.

The answers about Gibbs Sampling are misleading, in my opinion.

The reason for this is that Gibbs Sampling leads to some error due to resampling. All Bayesian methods that use MCMC techniques face this issue, and it's really not that much of a problem, as under standard conditions, you can put bounds on your MCMC error.

But this is not one of those standard conditions, as Latent Dirichlet Allocation leads to a multi-modal posterior! As such, standard sampling methods will typically find a single mode and sample around that single mode. The sampling techniques may well miss other modes, which may in fact have a higher posterior probability than the one found. Often, in these types of problems, several starts from different initial parameters are run and the solutions are compared. There's more advanced samplers that attempt to deal with this problem as well, such as a sampler that occasionally attempts to sample far from the current mode in the hopes of hitting another mode. Odds are that if you find a mode, it's probably a pretty good solution, even if not optimal, but that is certainly not insured!

In regards to making results replicable, if you want get the same solution, you have to start from the same place, with the same random seed.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange