Statistics for Data Analysis¶

Mort Canty¶

mort.canty@gmail.com¶

November, 2016¶

$$ \def\pr{\hbox{Pr}} \def\var{\hbox{var}} \def\cov{\hbox{cov}} \def\corr{\hbox{corr}} \def\dmX{\un{\mathcal{X}}} \def\dmG{\un{\mathcal{G}}} \def\dmK{\un{\mathcal{K}}} \def\dmS{\un{\mathcal{S}}} \def\dmC{\un{\mathcal{C}}} \def\dmZ{\un{\mathcal{Z}}} \def\bma{{\mbox{\boldmath $\alpha$}}} \def\bmb{{\mbox{\boldmath $\beta$}}} \def\bmu{{\mbox{\boldmath $\mu$}}} \def\bme{{\mbox{\boldmath $\epsilon$}}} \def\bmS{{\mbox{\boldmath $\Sigma$}}} \def\bmL{{\mbox{\boldmath $\Lambda$}}} \def\bmd{{\mbox{\boldmath $\delta$}}} \def\bmD{{\mbox{\boldmath $\Delta$}}} \def\bmG{{\mbox{\boldmath $\Gamma$}}} \def\bmphi{{\mbox{\boldmath $\phi$}}} \def\bmPhi{{\mbox{\boldmath $\Phi$}}} \def\bmpsi{{\mbox{\boldmath $\psi$}}} \def\bmtheta{{\mbox{\boldmath $\theta$}}} \def\eq{\begin{equation}} \def\eeq{\end{equation}} \def\i{{\bf i}} \def\un#1{{\bf #1}}$$

**Topics:**

Vectors and matrices
Random variables
Point and interval parameter estimation
Multivariate statistics
Linear regression
Multiple linear regression, Kalman filter
Hypothesis testing
Data assimilation
Spatial statistics

**References:**

J. E. Freund (2003), *Mathematical Statistics with Applications*, 7th Ed., Prentice Hall.

K. V. Mardia et al. (1979), *Multivariate Analysis*, Academic Press.

M. J. Canty (2014), *Image Analysis, Classification and Change Detection in Remote Sensing*, 3rd Ed., Taylor and Francis

**Course Material:**

http://mortcanty.github.io/src/stats2016.html

Jupyter notebook environment with IPython¶

Download and install Anaconda2 (Python 2.7 version)

https://www.continuum.io/downloads

In a command window enter

jupyter notebook

Point your browser (preferably Firefox) to localhost:8888

Part 1. Vectors and Matrices¶

The *sum* of two column vectors is given by

$$\un x + \un y = \pmatrix{x_1\cr x_2}+\pmatrix{y_1\cr y_2}=\pmatrix{x_1+y_1\cr x_2+y_2},$$

and their *inner product* by

$$\un x^\top \un y =(x_1,x_2) \pmatrix{y_1\cr y_2}=x_1y_1+x_2y_2.$$

The *length* or *(2-)norm* of the vector $\un x$ is

$$\|\un x\| = \sqrt{x_1^2+x_2^2} = \sqrt{\un x^\top \un x}\ .$$

The inner product can be written as

$$\un x^\top \un y = \|\un x\|\|\un y\|\cos(\theta)$$

where $\theta$ is the angle subtended by $\un x$ and $\un y$. If $\theta=\pi/2$ then the vectors are *orthogonal*.

The product of two $2\times 2$ matrices is given by

$$\eqalign{ \un A \un B &=\pmatrix{a_{11} & a_{12}\cr a_{21}&a_{22}} \pmatrix{b_{11} & b_{12}\cr b_{21}& b_{22}}\cr &=\pmatrix{a_{11}b_{11}+a_{12}b_{21} & a_{11}b_{12}+a_{12}b_{22}\cr a_{21}b_{11}+a_{22}b_{21} & a_{21}b_{12}+a_{22}b_{22}}}$$

The matrix product $\un A\un B$ is allowed whenever $\un A$ has the same number of columns as $\un B$ has rows.

So if $\un A$ has dimension $\ell\times m$ and $\un B$ has dimension $m\times n$, then $\un A\un B$ is $\ell\times n$ with elements

$$(\un A\un B)_{ij} = \sum_{k=1}^m a_{ik}b_{kj}\quad i=1\dots \ell,\ j=1\dots n.$$

Matrix multiplication is not *commutative*, i.e. $\un A\un B\ne\un B\un A$ in general. However it is *associative*:

$$(\un A\un B)\un C = \un A(\un B\un C)= \un A \un B \un C. \tag 1 $$

The *outer product* of two vectors of equal length, written $\un x \un y^\top$, is a matrix, e.g.,

$$\un x \un y^\top = \pmatrix{x_1\cr x_2} (y_1, y_2)= \pmatrix{x_1&0\cr x_2&0} \pmatrix{y_1&y_2\cr0&0} =\pmatrix{x_1y_1&x_1y_2\cr x_2y_1&x_2y_2}.$$

from numpy import *

A = mat(random.rand(3,1))
print A
print
B = A*A.T
print B
print
print linalg.det(B)

[[ 0.3860485 ]
 [ 0.14471887]
 [ 0.16162092]]

[[ 0.14903345  0.0558685   0.06239351]
 [ 0.0558685   0.02094355  0.0233896 ]
 [ 0.06239351  0.0233896   0.02612132]]

-3.75144182929e-38

Matrices, like vectors, have a transposed form, obtained by interchanging their rows and columns.

$$\un A = \pmatrix{a_{11} & a_{12}\cr a_{21}&a_{22}},\quad\un A^\top = \pmatrix{a_{11} & a_{21}\cr a_{12}&a_{22}}.$$

Transposition has the properties $$\eqalign{ (\un A+\un B)^\top &= \un A^\top + \un B^\top\cr (\un A\un B)^\top &= \un B^\top\un A^\top.}\tag 2$$

The determinant of a $2\times 2$ matrix is given by

$$|\un A| = a_{11}a_{22}-a_{12}a_{21}.$$

The determinant has the properties $$\eqalign{ |\un A\un B| &= |\un A| |\un B| \cr |\un A^\top| &= |\un A|.}\tag 3$$

The $2\times 2$ *identity matrix* is

$$\un I = \pmatrix{1&0\cr 0&1},$$

and for any $\un A$,

$$\un I \un A = \un A \un I= \un A.$$

The *matrix inverse* $\un A^{-1}$ of a square matrix $\un A$ is defined in terms of the identity matrix: $$\un A^{-1} \un A = \un A \un A^{-1} = \un I.$$

For example:

$$\un A^{-1}={1\over |\un A|}\pmatrix{a_{22}& -a_{12}\cr -a_{21}&a_{11}}.$$

The matrix inverse has the properties $$\eqalign{ (\un A\un B)^{-1} &= \un B^{-1}\un A^{-1}\cr (\un A^{-1})^\top &= (\un A^\top)^{-1}.}\tag 4$$

If the transpose of a square matrix is its inverse, i.e.,

$$\un A^\top\un A = \un A\un A^\top = \un I,\tag 5$$

then it is referred to as an *orthonormal matrix*.

The *trace* of a square matrix is the sum of its diagonal elements:

$${\rm tr}\ \un A = a_{11}+a_{22}.$$

The trace has the properties $$\eqalign{ {\rm tr}(\un A + \un B) &= {\rm tr} \un A + {\rm tr} \un B \cr {\rm tr}(\un A\un B) &= {\rm tr}(\un B\un A).\tag 6}$$

C = mat(random.rand(3,3))
print C.I*C
print trace(C.I*C)

[[  1.00000000e+00  -7.86079002e-17  -1.11975035e-16]
 [  8.93952029e-17   1.00000000e+00   2.79320690e-16]
 [ -7.44800763e-17   4.32583335e-17   1.00000000e+00]]
3.0

print B.I

[[ -2.48612733e+16   1.07341737e+18  -9.01777393e+17]
 [  6.91393290e+17  -1.84434380e+18  -0.00000000e+00]
 [ -5.59704760e+17  -9.12504570e+17   2.15398984e+18]]

The $p\times p$ matrix $\un A$ is *symmetric* if

$$\un A^\top = \un A.$$

The $p\times p$ symmetric matrix $\un A$ is *positive definite* if

$$\un x^\top\un A\un x > 0\quad\hbox{for all }p\hbox{-dimensional vectors } \un x.$$

The expression $\un x^\top\un A\un x$ is called a *quadratic form*.

**Exercise 1:** Prove with the help of (6) that $\un x^\top\un A\un x = {\rm tr}(\un A \un x\un x^\top)$.

Hint: Start with $$ y = \un x^\top\un A\un x, $$ multiply from the left and right with $\un x$ and $\un x^\top$, $$ \un x y \un x^\top = \un x\un x^\top\un A\un x\un x^\top, $$ and note that, since $y$ is a scalar, the LHS can be written $\un x \un x^\top y$.

Cholesky decomposition: If $\un A$ is symmetric positive definite, there exists a lower triangular matrix $\un L$ such that $$\un A = \un L\un L^\top.$$

# symmetric positive definite matrix
C = mat(random.rand(100,3))
S = C.T*C/100
print S
print
L = linalg.cholesky(S)
print L
print
print L*L.T

[[ 0.3630402   0.26191309  0.28761121]
 [ 0.26191309  0.32861809  0.25951777]
 [ 0.28761121  0.25951777  0.3711582 ]]

[[ 0.60252817  0.          0.        ]
 [ 0.4346902   0.3737145   0.        ]
 [ 0.47734069  0.13920374  0.3520318 ]]

[[ 0.3630402   0.26191309  0.28761121]
 [ 0.26191309  0.32861809  0.25951777]
 [ 0.28761121  0.25951777  0.3711582 ]]

If $|\un A|=0$, then $\un A$ has no inverse and is said to be a *singular matrix*. If $\un A$ is *nonsingular*, then the equation

$$ \un A \un x = \un 0 $$

only has the *trivial solution* $\un x=\un 0$. Thus, if $\un A$ is non-singular and $\un A\un x = \un 0$,

$$ \un A^{-1}\un A\un x = \un I\un x = \un x = \un 0. $$

If $\un A$ is singular, the equation has at least one *nontrivial solution* $\un x \ne \un 0$.

A set of vectors $\un x_1\dots \un x_r$ is said to be *linearly dependent* if there exists a set of scalars $c_1\dots c_r$, not all of which are zero, such that $$\sum_{i=1}^r c_i\un x_i = \un 0.$$ Otherwise they are *linearly independent*.

A matrix is said to have *rank* $r$ if the maximum number of linearly independent columns is $r$.

**Eigenvalues and eigenvectors**

An *eigenvalue problem*: Find *eigenvectors* $\un x$ and *eigenvalues* $\lambda$ that satisfy the equation

$$\un A \un x = \lambda \un x,\tag 7$$

where $\un A$ is a symmetric and positive definite matrix. This can be written equivalently as

$$(\un A-\lambda\un I)\un x = \un 0,$$

so for a nontrivial solution we must have

$$|\un A-\lambda\un I| = 0.$$

This is called the *characteristic equation*. If $\un x$ has dimension $N$, then the characteristic equation has $N$ solutions $\lambda_i$, $i=1\dots N$. Correspondingly there are $N$ eigenvectors $\un x=\un x^{(i)}$, $i=1\dots N$.

**Example:** A $2\times 2$ matrix eigenvalue problem, $\un A\un x = \lambda \un x$. The characteristic equation is

$$(a_{11}-\lambda)(a_{22}-\lambda)-a^2_{12} = 0.$$

This is a quadratic equation in $\lambda$ with solutions

$$\eqalign{ \lambda^{(1)} & = {1\over 2}\left( a_{11}+a_{22} +\sqrt{(a_{11}+a_{22})^2-4(a_{11}a_{22}-a_{12}^2)}\right)\cr \lambda^{(2)} & = {1\over 2}\left( a_{11}+a_{22} -\sqrt{(a_{11}+a_{22})^2-4(a_{11}a_{22}-a_{12}^2)}\right).}$$

Thus there are two eigenvalues and, correspondingly, two eigenvectors

$$\un x^{(1)},\quad\un x^{(2)}.$$

The eigenvectors can be obtained by first substituting $\lambda^{(1)}$ and then $\lambda^{(2)}$ into Equation (7) and solving for $x_1$ and $x_2$ each time.

**Exercise 2:** Show that the eigenvectors are orthogonal, $$(\un x^{(1)})^\top \un x^{(2)} = 0.$$

The eigenvectors can be always chosen to have unit length,

$$\|\un x^{(1)}\| = \|\un x^{(2)}\| = 1.$$

The matrix formed by two eigenvectors, for instance $$\un U =(\un x^{(1)},\un x^{(2)}) = \pmatrix{x^{(1)}_1 & x^{(2)}_1\cr x^{(1)}_2 & x^{(2)}_2},$$ is orthonormal: $$\un U^\top\un U = \un I$$ and is said to *diagonalize* the matrix $\un A$: $$\un U^\top \un A \un U = \pmatrix{\lambda^{(1)} & 0\cr 0 & \lambda^{(2)}} = \bf\Lambda,\tag 8$$

lmda,U = linalg.eig(S)
print lmda
print
print U
print
print U.T*U
print
print U.T*S*U

[ 0.89482078  0.07820002  0.08979569]

[[ 0.59039778  0.79684008  0.12836024]
 [ 0.54568212 -0.27690209 -0.79092114]
 [ 0.59469445 -0.53700198  0.59830376]]

[[  1.00000000e+00   5.01338207e-17   1.16485150e-16]
 [  5.01338207e-17   1.00000000e+00  -4.96955456e-15]
 [  1.16485150e-16  -4.96955456e-15   1.00000000e+00]]

[[  8.94820784e-01   1.57752446e-17   5.84431399e-17]
 [  4.82286377e-17   7.82000157e-02  -4.43556613e-16]
 [  4.50094104e-17  -4.47257593e-16   8.97956913e-02]]

**Singular value decomposition (SVD)**

Simply by rewriting (8), we get a special form of *singular value decomposition* (SVD) for symmetric matrices:

$$\un A = \un U\bf\Lambda\un U^\top.$$

This says that any symmetric matrix $\un A$ can be factored into the product of an orthonormal matrix $\un U$ times a diagonal matrix $\bf\Lambda$ times the transpose of $\un U$ (also called the *eigen-decomposition of $\un A$*). It can be written in the form

$$ \un A = \sum_{i=1}^n \lambda_i\un x^{(i)}\un x^{(i)\top}. $$

The SVD of a general $m\times n$ matrix $\un A$ with $m>n$ is

$$\un A = \un U\un W \un V^\top$$

where $\un U$ is $m\times n$ with $\un U^\top\un U = \un I_n$,

$$\un W = \pmatrix{w_1 & \cdots & 0 \cr \vdots &\vdots &\vdots\cr 0 &\cdots & w_n}$$

and $\un V$ is $n\times n$ with $\un V^\top\un V = \un V\un V^\top = \un I_n$

U,w,_ = linalg.svd(S)
print U
print
print w
# Singular matrix
U,w,_ = linalg.svd(B)
print
print U
print
print w

[[-0.59039778  0.12836024 -0.79684008]
 [-0.54568212 -0.79092114  0.27690209]
 [-0.59469445  0.59830376  0.53700198]]

[ 0.89482078  0.08979569  0.07820002]

[[-0.87177606  0.48711743  0.0521834 ]
 [-0.32680464 -0.49888006 -0.80269385]
 [-0.36497291 -0.71682306  0.59410393]]

[  1.96098318e-01   5.43071062e-18   6.01670419e-19]

If $\un A$ is singular and we order the eigenvalues and eigenvectors in order of decreasing eigenvalue, then the eigendecomposition reads

$$ \un A = \sum_{i=1}^r \lambda_i\un x^{(i)}\un x^{(i)\top}, $$

where $r$ is the number of nonzero eigenvalues. Accordingly,

$$ \un A = \un {U_r}\bf\Lambda_r\un U_r^\top, $$

where

$$ \un U_r = (\un {x}^{(1)},\dots , \un x^{(r)}),\quad \bf\Lambda = \pmatrix{\lambda_1 & 0 & \cdots & 0 \cr 0 & \lambda_2 & \cdots & 0 \cr \vdots & \vdots & \ddots & \vdots \cr 0 & 0 & \cdots & \lambda_r} . $$

The pseudoinverse of the symmetric, singular matrix $\un A$ is defined as

$$ \un A^+ = \un U_r\bf\Lambda_r^{-1}\un U_r^\top. $$

u = U[:,0]
print 'pseudo inverse =\n %s' % str(u*(1/w[0])*u.T)

pseudo inverse =
 [[ 3.87557377  1.45284502  1.62252613]
 [ 1.45284502  0.54463127  0.60824001]
 [ 1.62252613  0.60824001  0.67927776]]

Suppose $\un A$ is a $N\times N$ symmetric matrix. Then its eigenvectors $\un x_i$, $i=1\dots N$, are orthogonal and any $N$-component vector $\un u$ can be expressed as a linear combination of them,

$$\un u = \sum_{j=1}^N\beta_j\un x_j.$$

But then we have

$$\un u^\top\un A\un u = \un u^\top\sum_i\beta_i\lambda_i\un x_i = \sum_{j=1}^N\beta_j\un x_j^\top\sum_i\beta_i\lambda_i\un x_i =\sum_i\beta_i^2\lambda_i,$$

where $\lambda_i$, $i=1\dots N$, are the eigenvalues of $\un A$.

So $\un A$ is positive definite if and only if all of its eigenvalues are positive.

**Lagrange multipliers**

A *vector partial derivative* is written in the form ${\partial\over\partial\un x}$. For instance in two dimensions:

$${\partial\over \partial\un x} = \pmatrix{1\cr 0}{\partial\over \partial x_1}+\pmatrix{0\cr 1}{\partial\over \partial x_2}.$$

This operator (also called the *gradient*) arranges the partial derivatives of any scalar function of the vector $\un x$ with respect to each of its components into a column vector, e.g.,

$${\partial f(\un x)\over\partial\un x}=\pmatrix{\partial f(\un x)\over\partial x_1\cr {\partial f(\un x)\over\partial x_2}}.$$

The operations with vector derivatives correspond closely to operations with ordinary scalar derivatives:

$${\partial \over \partial\un x}(\un x^\top \un y) = \un y\quad\hbox{analogous to} \quad {\partial\over \partial x}xy=y$$$${\partial \over \partial\un x}(\un x^\top \un x) = 2\un x\quad\hbox{analogous to} \quad {\partial \over \partial x}x^2=2x.$$

For quadratic forms we have

$$\eqalign{ {\partial\over\partial \un x}(\un x^\top \un A \un y)&=\un A \un y\cr {\partial\over\partial \un y}(\un x^\top \un A \un y)&=\un A^\top \un x}$$

and

$${\partial\over\partial \un x}(\un x^\top \un A \un x)=\un A \un x+\un A^\top\un x.$$

If $\un A$ is a symmetric matrix

$${\partial\over\partial \un x}(\un x^\top \un A \un x)=2\un A \un x.$$

**Exercise 3:** What is ${\partial\over\partial\un x} {1 \over \un x^\top\un A \un y}$?

Suppose $x^*$ is a stationary point of the function $f(x)$, by which is meant that its first derivative vanishes at that point:

$$ {d\over dx}f(x^*)={d\over dx}f(x)\Big|_{x=x^*}= 0; $$

Then $f(x^*)$ is a local minimum if the second derivative at $x^*$ is positive,

$$ {d^2\over dx^2}f(x^*)>0. $$

This is easliy seen from the Taylor expnsion fo $f(x)$ about $x^*$:

$$ f(x)= f(x^*)+(x-x^*){d\over dx}f(x^*)+{1\over 2}(x-x^*)^2{d^2\over dx^2}f(x^*)+\dots\ . $$

The Taylor expansion of a scalar function of a vector $\un x$ about $\un x^*$ is

$$f(\un x) = f(\un x^*)+(\un x-\un x^*)^\top{\partial f(\un x^*)\over\partial\un x} +{1\over 2}(\un x-\un x^*)^\top\un H(\un x^*)(\un x-\un x^*)+\dots,$$

where $\un H$ is called the *Hessian matrix*. Its elements are given by

$$\left(\un H(\un x^*)\right)_{ij} = {\partial^2 f(\un x^*)\over\partial x_i\partial x_j},$$

and it is a symmetric matrix. Note that we can write

$$\un H(\un x^*)= {\partial\over\partial\un x} {\partial\over\partial\un x^\top} f(\un x^*) = {\partial^2 f(\un x^*)\over\partial \un x\partial\un x^\top}.$$

At the stationary point the vector derivative with respect to $\un x$ vanishes,

$${\partial f(\un x^*)\over\partial\un x}=\un 0,$$

so we have

$$f(\un x) \approx f(\un x^*)+{1\over 2}(\un x-\un x^*)^\top\un H(\un x^*)(\un x-\un x^*).$$

The condition for a local minimum is that the Hessian matrix be positive definite at the point $\un x^*$.

Suppose that we wish to find the position $\un x^*$ of a minimum (or maximum) of a function $f(\un x)$. If there are no constraints, then we solve the set of equations

$${\partial f(\un x)\over \partial x_i}= 0,\quad i=1,2\dots$$

or, in terms of our notation for vector derivatives,

$${\partial f(\un x)\over \partial\un x}=\un 0,$$

and examine the Hessian matrix at the solution point.

Suppose that $\un x$ is *constrained* by the condition

$$h(\un x) = 0.$$

For example, in two dimensions we might have the constraint

$$h(\un x)=x_1^2+x_2^2-1=0$$

which requires $\un x$ to lie on a circle of radius 1.

A very general way to find an extremum of $f(\un x)$ subject to $h(\un x)=0$ is to determine an *unconstrained* minimum or maximum of the *Lagrange function*

$$L(\un x) = f(\un x)+\lambda h(\un x).$$

The quantity $\lambda$ is a *Lagrange multiplier*. To find the extremum, we solve the set of equations $$\eqalign{ {\partial\over \partial \un x}(f(\un x)+\lambda h(\un x))&= \un 0,\cr {\partial\over \partial \lambda}(f(\un x)+\lambda h(\un x))&= 0}$$ for $\un x$ and $\lambda$.

**Exercise 4:** Find the minimum of

$$f(\un x) = ax_1^2 + bx_2^2$$

for $a,b>0$ subject to the constraint

$$x_1+x_2=1.$$

**Example:** (PCA) Find the maximum of

$$f(\un x) = \un x^\top\un C\un x,$$

where $\un C$ is symmetric positive definite, subject to the constraint $$h(\un x) = \un x^\top\un x - 1 = 0.$$ The Lagrange function is

$$L(\un x) = \un x^\top\un C \un x - \lambda(\un x^\top \un x - 1).$$

Any extremum of $L$ occurs at a value of $\un x$ for which

$${\partial L\over \partial\un x}=2\un C \un x-2\lambda\un x=0,$$

which is the eigenvalue problem

$$\un C \un x = \lambda\un x.$$

Let $\un u$ be an eigenvector with eigenvalue $\lambda$. Then

$$\un u^\top\un C\un u = \lambda\un u^\top\un u = \lambda.$$

So to maximize $f(\un x)$ we must choose the eigenvector of $\un C$ with maximum eigenvalue.