===================================================
Gaussian mixture models
===================================================

.. contents:: Tables of contents

EM is a package which enables to create Gaussian Mixture Models
(diagonal and full covariance matrices supported), to sample them, 
and to estimate them from data using Expectation Maximization algorithm.
It can also draw confidence ellipsoides for multivariate models, and 
compute the Bayesian Information Criterion to assess the number of
clusters in the data. In a near future, I hope to add so-called 
online EM (ie recursive EM) and variational Bayes implementation.

EM is implemented in python, and uses the excellent numpy and scipy
packages. Numpy is a python packages which gives python a fast 
multi-dimensional array capabilities (ala matlab and the likes); scipy
leverages numpy to build common scientific features for signal processing,
linear algebra, statistics, etc...

.. warning::

  The code in the `scikits.learn.em` submodule is not as mature as the 
  rest of the scikit and is prone to changing.
     


Basic usage
============

Once you are inside a python interpreter, you can import the package using the
follwing command::


    from scikits.learn import em


Creating, sampling and plotting a mixture
-----------------------------------------

Importing the package gives access to 3 classes: GM (for Gausssian Mixture),
GMM (Gaussian Mixture Model) and EM (for Expectation Maximization). The first
class GM can be used to create an artificial Gaussian Model, samples it, or
plot it. The following example show how to create a 2 dimension Gaussian Model
with 3 components, sample it and plot its confidence ellipsoids with
matplotlib:

.. literalinclude::  ../examples/em/basic_example1.py


which plots this figure:

.. image:: em_data/example1.png 
    :width: 500
    :height: 400

There are basically two ways to create a GM instance: either an empty one (eg
without mean, weights and covariances) by giving hyper parameters (dimension,
number of clusters, type of covariance matrices) during instanciation, or
giving all parameters using the class method GM.fromvalues. The first method is
mostly useful as a container for learning. There are also methods to create
random (but valid !) parameters for a Gaussian Mixture: this can be done by the
function method GM.generate_params (a class method).


Basic estimation of mixture parameters from data
------------------------------------------------

If you want to learn a Gaussian mixture from data with EM, you need to use two
classes from em: GMM and EM. You first create a GMM object from a GM instance;
then you can give the GMM object as a parameter to the EM class to compute
iterations of EM; once the EM has finished the computation, the GM instance of
GMM contains the computed parameters.


.. literalinclude::  ../auto_examples/em/basic_example2.py


GMM class do all the hard work for learning: it can compute the sufficient
statistics given the current state of the model, and update its parameters from
the sufficient statistics; the EM class uses a GMM instance to compute several
iterations. The idea is that you can implements a different mixture model and
uses the same EM class if you want (there are several optimized models for GMM,
which are subclasses of GMM).


Advanced topics
===============

Bayesian Information Criterion and automatic clustering
-------------------------------------------------------

The GMM class is also able to compute the bayesian information criterion (BIC),
which can be used to assess the number of clusters into the data.  It was first
suggested by Schwarz (see bibliography), who gave a Bayesian argument for
adopting the BIC. The BIC is derived from an approximation of the integrated
likelihood of the model, based on regularity assumptions.  The following code
generates an artificial mixture of 4 clusters, runs EM with models of 1 to 6
clusters, and prints which number of clusters is the most likely from the BIC:



.. literalinclude::  ../auto_examples/em/basic_example3.py


which plots this figure:

.. image:: em_data/Bic_example.png 
    :width: 500
    :height: 400

The above example also shows that you can control the plotting parameters by
using returned handles from plot methods. This can be useful for complex
drawing.

Examples 
=========

Using EM for pdf estimation
---------------------------

The following example uses the old faithful dataset and is available in the
example directory. It models the joint distribution (d(t), w(t+1)), where d(t)
is the duration time, and w(t+1) the waiting time for the next eruption. It
selects the best model using the BIC.

.. figure:: em_data/pdfestimation.png 
    :width: 500
    :height: 400

    isodensity curves for the old faithful data modeled by a 1, 2, 3 and 4
    componenits model (up to bottom, left to right).


Using EM for clustering
-----------------------

TODO (this is fundamentally the same than pdf estimation, though)

supervised learning (e.g discriminative analysis)
-------------------------------------------------

The following example shows how to do classification using discriminative
analysis on the famous iris dataset. The idea is to model each class
distribution P(data|class) by a mixture of Gaussian, and to classify the data
with Maximum A Posteriori (eg takes the class which maximizes P(class | data)).
Not surprisingly, the errors lie on the border between the two classes which
are not linearly separable.

.. figure:: em_data/dclass.png 
    :width: 600
    :height: 400

    learning data points on the left with isodensity curves as estimated by
    each mixture. On the right, the misclassified points are in red, green are
    correctly classified.


Notes on performance
--------------------

The package is implemented in python (100% of the code has a python
implementation, so that it can be used for educational purpose too), but thanks
to the Moore Law, it is reasonably fast so that it can be used for other
problems than toys problem. On my computer (linux on bi xeon 3.2 Ghz, 2Gb RAM),
running 10 iterations of EM algorithm on 100 000 samples of dimension 15, for a
diagonal model with 30 components, takes around 1 minute and 15 seconds: this
makes the implementation usable for moderately complex problems such as speaker
recognition using MFCC. If this is too slow, there is a C implementation for
Gaussian densities which can be enabled by commenting out one line in the file
gmm_em.py, which should gives a speed up of a factor 2 at least; this has not
been tested much, though, so beware.

Also, increasing the number of components or dimension is much more expensive
than increasing the number of samples: a 5 dimension model of 100 components
will be much slower to estimate with 500 samples than a 5 dimension 10
components with 5000 samples. This is because loops on dimension or components
are in python, whereas loops on samples are in C (through numpy).  I don't
think there is an easy fix to this problem. 

Full covariances will be slow, because you cannot avoid nested loop in python
this case without insane amount of memory. A C version may be
implemented, but this is not my top priority; most of the time, you should
avoid full covariance models if possible.

..
 Notes
 -----
 
 I believe the current API simple and powerful enough, except 
 maybe for plotting (if you think otherwise, I would be happy to hear
 your suggestions). Now, I am considering adding some more functionalities
 to the toolbox:
 
 - add simple methods for regularization of covariance matrix (easy)
 - add bayes prior (using variational Bayes approximation) for overfitting and
   model selection problems (not trivial, but doable)
 - improve online EM
 
 Other things which are doable but which I don't intend to implement are:
 
 - add other models (mixtures of multinomial: easy, simple HMM: easy, other ?)
 - add bayes prior using MCMC (hard, use PyMC for sampling ?)
