Computing Gaussian Mixture Model with Expectation Maximization using Mapreduce
-
21-12-2019 - |
Pregunta
Where I can find a Mapreduce implementation(or tutorial) for computing GMM using EM?
Also is it really feasible to do it using mapreduce framework?
I have found this lecture link but it does not contains detail description, also if my data(mean and variance) is large then how do I pass it from mapper to reducer?
Solución
OK, You said, you have experience implementing EM algorithm. It's easy to explain.
EM algorithm
As you know, EM algorithm is an iterative method for finding maximum likelihood. One iteration is consist of two steps that Expectation(E-step) and Maximization step (M-step).
In E-step, the likelihoods of the samples are calculated from the previous model. Let n
be the number of samples, we can get n
likelihoods.
Here, the likelihood calculation is performed independently. So this can be performed using multi-processor environment.
Assume that we have 4 CPUs on single machine, n/4
likelihoods can are calculated by each CPU. It's 4 times faster (the IO time is ignored)
In M-step, the new model is derived by the likelihoods.
To mapreduce
Computation EM
- E-step can be extend to
mapper
tasks on mapreduce because of independent each other samples.- input
- key: anything
- value: a sample
- output
- key: anything
- value: likelihood of the input sample
- input
- M-step can be extend to
reducer
task (this can be multiple reducers, but simply I recommend one reducer).- input
- key: anything
- value: likelihood
- output
- key: anything
- value: the next model
- input
data representation
- the training samples are located at some directory on
HDFS
- this will be input of
mapreduce
- this will be input of
- the previous model is also located at
HDFS
but different directory to training samples- USE distributed cache to let
mappers
know where the model is
- USE distributed cache to let
iteration
One mapreduce task is analogues to one iteration of EM algorithm. So you need to iterate the next mapreduce task until converged
I have explained briefly. You will face many problems while you implement.
I hope this helps.