Expectation–maximization algorithm

The Expectation-Maximization algorithm ( EM algorithm short, rarely Estimation -Maximization algorithm ) is an algorithm of mathematical statistics.

The core idea of the EM algorithm is to start with a model of random, and alternating to improve the allocation of the data to the individual parts of the model ( Expectation step) and the parameters of the model to the most recent assignment ( Maximization step ).

In both steps, while the quality of the result is improved: the E- step, points are better allocated, in the M- step, the model is modified so that it better fits the data. If no substantial improvement rather than more, one ends the procedure.

The procedure typically takes only "local" optima. Thus, it is often necessary to repeatedly enter the process, and select the best result thus found.

  • 2.1 Advantages of the EM clustering

Mathematical formulation


There are objects in front of a certain number of properties. The properties take on random values ​​. Some properties can be measured, but not others. Formally speaking, the objects are instances of a multi-dimensional random variable X, which is subject to an unknown probability distribution p ( x); some dimensions are " observable ", others are "hidden". The aim is to determine the unknown probability distribution.

First, it is believed, p ( x) is from a few parameters are known. Usually one chooses p as a mixing distribution, ie as a weighted sum of so many standard distributions, such as are present observable properties. If one chooses, for example, as a standard distribution, the normal distribution, the unknown parameters are, respectively, the mean μ and the variance. The goal is now to determine the parameters Θ of the presumed probability distribution. Choose a mixed distribution, as Θ also includes the sum of the weighting factors.

Usually it is of such a problem with the maximum likelihood method. This is not possible here, however, since the desired parameters of the hidden properties depend - and these are unknown. To reach the goal anyway, so a method is needed that estimates the hidden parameter with this Endparametern. That is the function of the EM algorithm.

The EM algorithm works iteratively and leads in each passage two steps:

The algorithm ends when the determined parameters no longer change significantly.

Proven to, the sequence of the particular Θ, ie, the algorithm terminates in any case. However, the result is usually only a local optimum, which is good, but not optimal though. It is therefore necessary to repeat the algorithm many times with different starting values ​​to find an optimal possible outcome.

Formulation as a random experiment

Formally, it is assumed the EM algorithm that the values ​​of the observed stochastic variable come about in the following way: First, choose one of the incoming random variables and gain their value as the final result. This means that just a weight takes on a value of one, and all others are zero. In the approximation of the weights by the EM algorithm, but this is usually not the case. The probability density of a target value can be in normal distribution assumption and constant variance of the individual random variables represented as:

The terms used have the following meanings:

  • : Weight of the - th random variable for the th value of the target
  • : Number of weights
  • : Value of the -th target
  • : sample
  • : Expectation value of the - th random variable
  • : variance

EM clustering

The EM - clustering is a method of cluster analysis, the data with a " Mixture of Gaussians " model - represents - ie as a superposition of normal distributions. This model is randomly or heuristically initialized and then refined with the general EM principle.

The EM - clustering consists of several iterations of the Expectation and Maximization steps.

  • In the initialization step has to be chosen freely. Take part to the fact that exactly one arbitrary random variable (just any training instance ) corresponds to this expected value, ie laws.
  • The Expectation step, the expected values ​​of the weights are calculated on the assumption that the expected values ​​of the random variable corresponding to the calculated incoming in the Maximization - step. This is only possible if it is not the first iteration. The expectation values ​​can be represented as
  • In the maximization step, the expected values ​​of the probability distributions of the individual random variables are determined in which the probability of the arrival of the sample result is maximized. This is done under the assumption that the exact values ​​of the weights respectively corresponding to their expected values ​​( maximum-likelihood algorithm). The estimated in this way expectation values ​​result from at normal distribution assumption
  • The parameters are further approximated their fair values ​​by repeating the Expectation Maximization and steps.

Advantages of the EM clustering

  • Mathematical model of the data with normal distributions
  • Allows clusters of different size (as opposed to k -means )
  • Can detect correlated clusters and represent the covariance matrix
  • Can deal with incomplete observations

K -Means as EM method

And the k- means algorithm can be understood as the EM algorithm that models the data as a Voronoi cell.

An EM variant of the k- means algorithm is as follows:

  • In the initialization step, the cluster centers are selected randomly or with a heuristic.
  • In the Expectation- step data objects are assigned to their nearest cluster center, ie
  • In the maximization step, the cluster centers are recalculated as the mean of their associated data points:

Other instances of the EM algorithm

  • Estimating the parameters of a mixed distribution
  • Baum-Welch algorithm for estimating parameters of a hidden Markov model,
  • Learning a Bayesian network with hidden variables

Evidence for the maximization step in the normal distribution assumption

To be proved:

Rather than maximize, can also be maximized, since the logarithm is a strictly monotonically increasing function.

The minuend is a constant, so it is sufficient to minimize the subtrahend:

A square sum is always a non-negative value. Therefore, the absolute minimum is limited by 0 and thus a local minimum. You set the 1st derivative of this term after each to 0:


Because the derivation of all summands of the sum over j are 0, except for j = k Consequently: