Constructing a Maximum Entropy Classifier for Sentence Extraction

https://datascience.stackexchange.com/questions/18249

22-10-2019
|

Question

So I'm reading this paper which uses a max ent classifier for sentence extraction. The parametric form for a conditional max ent model is :

$$ P(c|s) = \frac{1}{Z(s)} \exp \sum_i \lambda_if_i(c,s)$$

where $f_i(c,s)$ is a feature which has a weight $\lambda_i$ associated with it.

Now, the paper states that conjugate descent was used to find the optimal set of weights (page 3) - and this is what I'm unable to comprehend. How to use/apply conjugate gradient descent to calculate the optimal set of weights?

Solution

Conjugate gradient descent is a variation on gradient descent. Gradient descent is a method of finding the minimum of a function by taking steps that reduce the error between the data and the model parameters. Conjugate gradient descent extends gradient descent by searching a plane, instead of a line. The plane is defined as a linear combination of the gradient vector and the previous descent step vector. Conjugate gradient descent is very good at finding the solutions to a set of sparse linear equations.

Gradient descent and variations are general methods for finding the best parameters. The best practice is to define your specific model and then call on a separate gradient descent package to search for optimal values (i.e., hyperparameters) for the model.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange