Question

LDA has two hyperparameters, tuning them changes the induced topics.

What does the alpha and beta hyperparameters contribute to LDA?

How does the topic change if one or the other hyperparameters increase or decrease?

Why are they hyperparamters and not just parameters?

Was it helpful?

Solution

The Dirichlet distribution is a multivariate distribution. We can denote the parameters of the Dirichlet as a vector of size K of the form ~$\frac{1}{B(a)} \cdot \prod\limits_{i} x_i^{a_{i-1}}$, where $a$ is the vector of size $K$ of the parameters, and $\sum x_i = 1$.

Now the LDA uses some constructs like:

  • a document can have multiple topics (because of this multiplicity, we need the Dirichlet distribution); and there is a Dirichlet distribution which models this relation
  • words can also belong to multiple topics, when you consider them outside of a document; so here we need another Dirichlet to model this

The previous two are distributions which you do not really see from data, this is why is called latent, or hidden.

Now, in Bayesian inference you use the Bayes rule to infer the posterior probability. For simplicity, let's say you have data $x$ and you have a model for this data governed by some parameters $\theta$. In order to infer values for this parameters, in full Bayesian inference you will infer the posterior probability of these parameters using Bayes' rule with $$p(\theta|x) = \frac{p(x|\theta)p(\theta|\alpha)}{p(x|\alpha)} \iff \text{posterior probability} = \frac{\text{likelihood}\times \text{prior probability}}{\text{marginal likelihood}}$$ Note that here comes an $\alpha$. This is your initial belief about this distribution, and is the parameter of the prior distribution. Usually this is chosen in such a way that will have a conjugate prior (so the distribution of the posterior is the same with the distribution of the prior) and often to encode some knowledge if you have one or to have maximum entropy if you know nothing.

The parameters of the prior are called hyperparameters. So, in LDA, both topic distributions, over documents and over words have also correspondent priors, which are denoted usually with alpha and beta, and because are the parameters of the prior distributions are called hyperparameters.

Now about choosing priors. If you plot some Dirichlet distributions you will note that if the individual parameters $\alpha_k$ have the same value, the pdf is symmetric in the simplex defined by the $x$ values, which is the minimum or maximum for pdf is at the center.

If all the $\alpha_k$ have values lower than unit the maximum is found at corners

or can if all values $\alpha_k$ are the same and greater than 1 the maximum will be found in center like

It is easy to see that if values for $\alpha_k$ are not equal the symmetry is broken and the maximum will be found near bigger values.

Additional, please note that values for priors parameters produce smooth pdfs of the distribution as the values of the parameters are near 1. So if you have great confidence that something is clearly distributed in a way you know, with a high degree of confidence, than values far from 1 in absolute value are to be used, if you do not have such kind of knowledge than values near 1 would be encode this lack of knowledge. It is easy to see why 1 plays such a role in Dirichlet distribution from the formula of the distribution itself.

Another way to understand this is to see that prior encode prior-knowledge. In the same time you might think that prior encode some prior seen data. This data was not saw by the algorithm itself, it was saw by you, you learned something, and you can model prior according to what you know (learned). So in the prior parameters (hyperparameters) you encode also how big this data set you apriori saw, because the sum of $\alpha_k$ can be that also as the size of this more or less imaginary data set. So the bigger the prior data set, the bigger is the confidence, the bigger the values of $\alpha_k$ you can choose, the sharper the surface near maximum value, which means also less doubts.

Hope it helped.

OTHER TIPS

Assuming symmetric Dirichlet distributions (for simplicity), a low alpha value places more weight on having each document composed of only a few dominant topics (whereas a high value will return many more relatively dominant topics). Similarly, a low beta value places more weight on having each topic composed of only a few dominant words.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top