Clustering

1 minute read

Published:

Clustering

EM-Algorithm

E-step

  • Compute the probability that piont $n$ is generated by distribution $k$. There are total $N$ points and $K$ clusters.
  • The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood (ML) estimate in the presence of missing or hidden data
    • In the soft K-means, we don’t know the proportion of each instance belong to each cluster.
  • In Maximum Likelihood estimation, we wish to estimate the model parameter(s) for which the observe data are the most likely.
  • Each interation of the EM algorithm consistes of two processes
    • E-step: the missing data are estimated given the observed data and current esstimate of the model parameters.
    • M-step: the likelihood function is maximized under the assumption that the missing data are known.
\[P^{(i)}(k|n)=\frac{P_k^{(i)}g(x_n,m_k^{(i)},\sigma_k^{(i)})}{\sum\limits_{k=1}^{K} P_j^{(i)}g(x_n,m_j^{(i)},\sigma_j^{(i)})}\]

where $P_k=P(X=k)$

M-step

\[\begin{aligned} m_k^{(i+1)}& =\frac{\sum\limits_{n=1}^{N}P^{(i)}(k|n)x_n}{\sum\limits_{n=1}^{N}P^{(i)}(k|n)} \\ \sigma_k^{(i+1)}&=\sqrt{\frac{1}{D}\frac{\sum\limits_{n=1}^{N}P^{(i)}(k|n)\Vert x_n-m_k^{(i+1)} \Vert^2}{\sum\limits_{n=1}^{N}P^{(i)}(k|n)}}\\ P_k^{(i+1)}&=\frac{1}{N}\sum\limits_{n=1}^{N}P^{(i)}(k|n) \end{aligned}\]

EM and K-means

  • Notice the similarity between EM for normal mixtures and K-means.
    • The expectation is the assignment
    • The maximization step is update of centres
  • K-means is a simplified EM
  • K-means makes a hard decision while EM makes a soft decision when updating the parameters of the model.