Part 4. Multivariate Statistics¶

$$ \def\pr{\hbox{Pr}} \def\var{\hbox{var}} \def\cov{\hbox{cov}} \def\corr{\hbox{corr}} \def\dmX{\un{\mathcal{X}}} \def\dmG{\un{\mathcal{G}}} \def\dmK{\un{\mathcal{K}}} \def\dmS{\un{\mathcal{S}}} \def\dmC{\un{\mathcal{C}}} \def\dmZ{\un{\mathcal{Z}}} \def\bma{{\mbox{\boldmath $\alpha$}}} \def\bmb{{\mbox{\boldmath $\beta$}}} \def\bmu{{\mbox{\boldmath $\mu$}}} \def\bme{{\mbox{\boldmath $\epsilon$}}} \def\bmS{{\mbox{\boldmath $\Sigma$}}} \def\bmL{{\mbox{\boldmath $\Lambda$}}} \def\bmd{{\mbox{\boldmath $\delta$}}} \def\bmD{{\mbox{\boldmath $\Delta$}}} \def\bmG{{\mbox{\boldmath $\Gamma$}}} \def\bmphi{{\mbox{\boldmath $\phi$}}} \def\bmPhi{{\mbox{\boldmath $\Phi$}}} \def\bmpsi{{\mbox{\boldmath $\psi$}}} \def\bmtheta{{\mbox{\boldmath $\theta$}}} \def\eq{\begin{equation}} \def\eeq{\end{equation}} \def\i{{\bf i}} \def\un#1{{\bf #1}}$$

The mean of the random vector $\un Z$ is the vector of mean values of $Z_1$ and~$Z_2$,

$$ \langle \un Z\rangle = \pmatrix{\langle Z_1\rangle \cr \langle Z_2\rangle}. $$

The covariance of random variables $Z_1$ and $Z_2$ is a measure of how their realizations are dependent upon each other and is defined to be the mean of the random variable

$$(Z_1-\langle Z_1\rangle)(Z_2-\langle Z_2\rangle),$$

i.e.,

$$ \cov(Z_1,Z_2) = \left\langle\ (Z_1-\langle Z_1\rangle)(Z_2-\langle Z_2\rangle)\ \right\rangle. $$

Two simple consequences of the definition of covariance are:

$$\eqalign{ \cov(Z_1,Z_2) &= \langle Z_1 Z_2\rangle - \langle Z_1\rangle\langle Z_2\rangle \cr \cov(Z_1,Z_1) &= \var(Z_1).} $$

The correlation of $Z_1$ and $Z_2$ is defined by

$$ \rho_{12} = {\cov(Z_1,Z_2)\over\sqrt{\var(Z_1)\var(Z_2)}}. $$

The correlation is unitless and restricted to values

$$-1\le \rho_{12}\le 1.$$

If $|\rho_{12}|=1$, then $Z_1$ and $Z_2$ are linearly dependent.

Covariance matrix¶

Let $\un a = (a_1,a_2)^\top$ be any constant vector. Then the variance of the random variable $\un a^\top\un Z = a_1Z_1+a_2Z_2$ is given by

$$\eqalign{ \var(\un a^\top \un Z) &= \cov(a_1Z_1+a_2Z_2,a_1Z_1+a_2Z_2)\cr &=a_1^2\var(Z_1)+a_1a_2\cov(Z_1,Z_2)+a_1a_2\cov(Z_2,Z_1)+a_2^2\var(Z_2)\cr &=(a_1,a_2) \pmatrix{\var(Z_1)&\cov(Z_1,Z_2)\cr\cov(Z_2,Z_1)&\var(Z_2)} \pmatrix{a_1\cr a_2}.} $$

The matrix in the above equation is the variance-covariance matrix $\bf\Sigma$, i.e.,

$$ {\bf\Sigma} = \pmatrix{\var(Z_1)&\cov(Z_1,Z_2)\cr\cov(Z_2,Z_1)&\var(Z_2)}. $$

Therefore

$$ \var(\un a^\top \un Z) = \un a^\top\bf\Sigma\un a. $$

Note that $\bf\Sigma$ is symmetric, i.e.,

$$ \bf\Sigma^\top = \bf\Sigma $$

and postive definite, that is, for any real vector $\un a$,

$$ \un a^\top\bf\Sigma\un a > 0. $$

$\un a^\top\bf\Sigma\un a $ is called a quadratic form.

The covariance matrix can be written as an outer product:

$$ \bf\Sigma=\langle\ (\un Z-\langle\un Z\rangle)(\un Z-\langle\un Z\rangle)^\top\ \rangle = \langle\un Z\un Z^\top\rangle -\langle\un Z\rangle\langle\un Z\rangle^\top. $$

For example

$$ \left\langle\pmatrix{Z_1-\langle Z_1\rangle\cr Z_2-\langle Z_2\rangle} (Z_1-\langle Z_1\rangle,Z_2-\langle Z_2\rangle)\right\rangle $$$$ =\left\langle \pmatrix{(Z_1-\langle Z_1\rangle)^2 & (Z_1-\langle Z_1\rangle)(Z_2-\langle Z_2\rangle)\cr (Z_2-\langle Z_2\rangle)(Z_1-\langle Z_1\rangle) & (Z_2-\langle Z_2\rangle)^2 }\right\rangle $$$$ = \pmatrix{\var(Z_1)&\cov(Z_1,Z_2)\cr\cov(Z_2,Z_1)&\var(Z_2)} $$

If $\langle\un Z\rangle=\un 0$, we can write simply

$$ \bf\Sigma=\langle \un Z\un Z^\top\rangle. $$

A random vector $\un Z=(Z_1,Z_2)^\top$ is often described by a {\it multivariate normal density function} $p(\un z)$, given by

$$ p(\un z) = {1\over (2\pi)^{N/2}\sqrt{|\bf\Sigma|}} \exp\left(-{1\over 2}(\un z-\bf\mu)^\top \bf\Sigma^{-1} (\un z-\bf\mu)\right), $$

in this case $N=2$.

In analogy to the univariate case,

$$ \langle\un Z\rangle = \bf\mu,\quad \langle\ (\un Z-\langle\un Z\rangle)(\un Z-\langle\un Z\rangle)^\top\ \rangle = \bf\Sigma. $$

We write $$ \un Z \sim \mathcal{N}(\bf\mu,\bf\Sigma). $$

Theorem:

If two random variables $Z_1,Z_2$ have a bivariate normal distribution, they are independent if and only if $\rho_{12}=0$, that is, if and only if they are uncorrelated.

In general a zero correlation does not imply that two random variables are independent. However this theorem says that it does if the variables are normally distributed.

Parameter estimation¶

In the multivariate case, the sample functions of interest are those which can be used to estimate the vector mean and variance-covariance matrix of a distribution $P(\un z)$. These are the sample mean vector

$$ \bar{\un Z} = {1\over n}\sum_{i=1}^n\un Z(i) $$

and the sample variance-covariance matrix

$$ \un S = {1\over n-1}\sum_{i=1}^n( \un Z(i) - \bar{\un Z} ) (\un Z(i) - \bar{\un Z})^\top. $$

As in the scalar case, these are unbiased estimators:

$$ \langle\bar{\un Z}\rangle = \langle\un Z\rangle\quad \langle\un S\rangle = \bf\Sigma $$

The sample mean vector

$$ \bar{\un Z} = {1\over n}\sum_{\nu=1}^n \un Z(\nu) $$

is multivariate normally distributed with mean $\bf\mu$ and covariance matrix $\bf\Sigma/n$.

For zero-mean random vectors $\un Z_i$, let

$$ \un X = (n-1)\un S = \sum_{\nu=1}^n \un Z(\nu)\un Z(\nu)^\top $$

The probability density function of $\un X$ is the Wishart density with $n$ degrees of freedom

$$ p_W(\un x) ={|\un x|^{(n-N-1)/2}\exp(-{\rm tr}(\bf\Sigma^{-1}\un x)/2) \over 2^{Nn/2}\pi^{N(N-1)/4}|\bf\Sigma|^{n/2}\prod_{i=1}^N\Gamma[(n+1-i)/2]} $$

for $\un x$ positive definite, and $0$ otherwise.

Exercise 1:

Show that the Wishart distribution with $n$ degrees of freedom reduces to the chi-square distribution with $n$ degrees of freedom in the univariate case $\un Z \to Z$ and $\bf\Sigma\to\sigma^2\to 1$.

Principal Components Analysis¶

The principal components transformation generates linear combinations of multivariate observations which have maximum variance.

Consider a set of measurements represented by the random vector $\un Z$ and assume that $\langle\un Z\rangle=\un 0$, so that the covariance matrix is

$$\bf\Sigma=\langle\un Z \un Z^\top\rangle.$$

We seek a linear combination $Y = \un a^\top \un Z$ whose variance is maximum. Remember

$$ \var(Y)=\un a^\top \bf\Sigma \un a. $$

The maximization only makes sense if we restrict $\un a$ in some way. A convenient constraint is $\un a^\top \un a=1$. So we maximize the Lagrange function

$$ L = \un a^\top \bf\Sigma \un a - \lambda(\un a^\top \un a - 1). $$

Taking the derivative,

$$ {\partial L \over \partial \un a} = 2\bf\Sigma\un a - 2\lambda\un a = \un 0. $$

This is the eigenvalue problem

$$ \bf\Sigma\un a = \lambda\un a $$

for the symmetric, positive definite matrix $\bf\Sigma$.

Call the (orthogonal and normalized) eigenvectors $\un a_1\dots \un a_N$. Assume that they are sorted according to decreasing eigenvalue

$$\lambda_1\ge \dots\ge\lambda_N.$$

These eigenvectors are the principal axes and the corresponding linear combinations $\un a_i^\top\un Z$ are projections along the principal axes, called the principal components of $\un Z$.

The individual principal components

$$ Y_1 = \un a_1^\top\un Z,\ Y_2=\un a_2^\top\un Z,\ \dots,\ Y_N=\un a_N^\top\un Z $$

can be expressed more compactly as a random vector $\un Y$ by writing

$$ \un Y = \un A^\top\un Z, $$

where $\un A$ is the matrix whose columns comprise the eigenvectors, that is,

$$ \un A = (\un a_1\dots \un a_N). $$

Since the eigenvectors are orthogonal and normalized, $\un A$ is an orthonormal matrix:

$$ \un A^\top \un A=\un I. $$

If the covariance matrix of the principal components vector $\un Y$ is called $\bf\Sigma'$, then we have

$$ \bf\Sigma' = \langle \un Y \un Y^\top \rangle= \langle\un A^\top \un Z \un Z^\top \un A\rangle =\un A^\top \bf\Sigma \un A = \pmatrix{\lambda_1 & 0 & \cdots &0\cr 0 &\lambda_2& \cdots&0\cr \vdots&\vdots&\ddots&\vdots\cr 0 & 0& \cdots&\lambda_N} $$

or

$$ \un A^\top \bf\Sigma \un A = \bf\Lambda $$

The eigenvalues are thus seen to be the variances of the principal components, and all of the covariances are zero.

The first principal component $Y_1$ has maximum variance

$$\var(Y_1)=\lambda_1,$$

the second principal component $Y_2$ has maximum variance

$$\var(Y_2)=\lambda_2$$

subject to the condition that it is uncorrelated with $Y_1$, and so on. The fraction of the total variance in the original dataset which is accounted for by the first $r$ principal components is

$$ {\lambda_1+\dots +\lambda_r\over\lambda_1+\dots +\lambda_r+\dots +\lambda_N} = {\lambda_1+\dots +\lambda_r \over\ {\rm tr} \bf\Lambda}. $$

Data design matrix¶

Suppose that we have made independent observations $\un z(i)$ of $\un Z(i)$, $i=1\dots n$. These may be arranged into an $n\times \texttt{}N$ matrix $\dmZ$ in which each $N$-component observation vector forms a row, i.e.,

$$ \dmZ = \pmatrix{\un z(1)^\top \cr \un z(2)^\top\cr\vdots\cr\un z(n)^\top}, $$

This is a very useful construct for computer programs.

For two-dimensional observations $$ \un z(i) = \pmatrix{z_1(i)\cr z_2(i)} = (z_1(i), z_2(i))^\top,\quad i = 1\dots n $$

Mean $$ \bar z_1 = {1\over n}\sum_{i=1}^n z_1(i),\quad \bar z_2= {1\over n}\sum_{i=1}^n z_2(i) $$

$$ \bar{\un z} = \pmatrix{\bar z_1 \cr \bar z_2} = {1\over n}\pmatrix{z_1(1) & z_1(2) & \cdots & z_1(n)\cr z_2(1) & z_2(2) & \cdots & z_2(n)} \pmatrix {1\cr 1\cr \vdots\cr 1} = {1\over n}\dmZ^\top\un 1_n $$

Covariance matrix $$ {\bf\Sigma} = {1\over n-1}\pmatrix{ \sum_i(z_1(i)-\bar z_1)^2 & \sum_i (z_1(i)-\bar z_1)(z_2(i)-\bar z_2) \cr \sum_i (z_1(i)-\bar z_1)(z_2(i)-\bar z_2) & \sum_i(z_2(i)-\bar z_2)^2 } $$

Data matrix $$ \dmZ =\pmatrix{z_1(1) &z_2(1) \cr z_1(2) & z_2(2) \cr \vdots & \vdots \cr z_1(n) & z_2(n)} $$

Column centered data matrix

$$ \tilde\dmZ =\pmatrix{z_1(1)-\bar z_1 & z_2(1) -\bar z_2 \cr z_1(2)-\bar z_1& z_2(2)-\bar z_2 \cr \vdots & \vdots \cr z_1(n)-\bar z_1 & z_2(n) -\bar z_2} $$

So the covariance matrix estimate can be written as

$$ \un s = {1\over n-1}\tilde\dmZ^\top\tilde\dmZ $$

Exercise 2:

(a) Show that the covariance matrix estimate can also be written in the form

$$ (n-1)\un s = \dmZ^\top\un H\dmZ, $$

where the symmetric centering matrix $\un H$ is given by

$$ \un H = \un I_{nn} - {1\over n}\un 1_n \un 1_n^\top. $$

(b) Show that the matrix $\un H$ is idempotent, i.e., $$ \un H \un H = \un H. $$

Let $\un x$ be any $n$-component vector. Then $$ \un x^\top\un s\un x = {1\over n-1} \un x^\top\dmZ^\top\un H\dmZ\un x = {1\over n-1} \un x^\top\dmZ^\top\un H\un H\dmZ\un x = {1\over m-1}\un y^\top\un y \ge 0, $$ where $\un y = \un H\dmZ\un x$. Therefore $\un s$ is positive semi-definite.

So in practice, one estimates the covariance matrix $\bf\Sigma$ in terms of the data matrix $\dmZ$, and then solves the eigenvalue problem in the form

$$ \un s\un a = {1\over n-1}\dmZ^\top\dmZ\un a = \lambda\un a. $$

Alternatively, consider the SVD of the $m\times N$ data matrix $\dmZ$:

$$\dmZ = \un U\un W\un V^\top,\quad \un U^\top\un U = \un V^\top\un V = \un I,\ \un W\hbox{ is diagonal}.$$

Then

$$\eqalign{ (m-1)\bf\Sigma &= \dmZ^\top\dmZ = \un V\un W\un U^\top\un U\un W\un V^\top\cr &= \un V \un W^2\un V^\top }$$

or

$$\un V^\top\bf\Sigma\un V = {1\over m-1}\un W^2 = \bf\Lambda,$$

so $\un V$ is the matrix of eigenvectors and the eigenvalues are proportional to the squares of the diagonal elements of W.

**Example:** (“Factor analysis”)

We had for the principal components transformation

$$\un Y = \un A^\top\un Z$$

and we can recover the original variables by inverting,

$$\un Z = (\un A^\top)^{-1}\un Y = \un A \un Y.$$

Suppose that the first $r$ PCs account for (explain) most of the variance in the data and let $\un A = (\un A_r,\un A_{r-})$, where

$$\un A_r = (\un a_1,\dots \un a_r),\quad \un A_{r-} = (\un a_{r+1},\dots \un a_N),$$

and similarly

$$\un Y =\pmatrix{\un Y_r \cr \un Y_{r-}}.$$

Now reconstruct $\un Z$ from $\un Y_r = (Y_1\dots Y_r)^\top$:

$$\un Z \approx \un A_r \un Y_r.$$

The *reconstruction error* is

$$\bf\epsilon = \un Z - \un A_r \un Y_r = \un Z - \un A_r \un A_r^\top \un Z = (\un I - \un A_r \un A_r^\top)\un Z = \un A_{r-}\un A_{r-}^\top \un Z,$$

so we can write

$$\un Z = \un A_r\un Y_r + \bf\epsilon.$$

This is in the form of a *factor analysis model*. In this case the “factors” are the first $r$ principal components. The elements of $\un A_r$ are called the “factor loadings”.

The variance-covariance matrix for the error term is

$$\eqalign{ \langle \bf\epsilon\bf\epsilon^\top\rangle &= \langle\un A_{r-}\un A_{r-}^\top\un Z\un Z^\top\un A_{r-}\un A_{r-}^\top\rangle \cr &= \langle\un A_{r-}\un Y_{r-}\un Y_{r-}^\top\un A_{r-}\rangle \cr &= \un A_{r-}\un \bf\Lambda_{r-}\un A_{r-}^\top.}$$

%matplotlib inline

run dispms -f 20010525 -e 3 -p [1,2,3]

run pca /home/mort/stats2016/20010525

------------PCA ---------------
Wed Oct 26 16:03:04 2016
Input /home/mort/stats2016/20010525
Eigenvalues: [  6.53264847e+03   3.94881303e+02   1.39378715e+02   3.23104634e+01
   1.48719152e+01   3.02456833e+00]

result written to: /home/mort/stats2016/20010525_pca
elapsed time: 0.387809991837

run dispms -f /home/mort/stats2016/20010525_pca -p [1,2,3] -e 3 \
-F /home/mort/stats2016/20010525_pca -P [4,5,6] -E 3