**Topics:**
Vectors and matrices
Random variables
Point and interval parameter estimation
Multivariate statistics
Linear regression
Multiple linear regression, Kalman filter
Hypothesis testing
Data assimilation
Spatial statistics
**References:**
J. E. Freund (2003), *Mathematical Statistics with Applications*, 7th Ed., Prentice Hall.
K. V. Mardia et al. (1979), *Multivariate Analysis*, Academic Press.
M. J. Canty (2014), *Image Analysis, Classification and Change Detection in Remote Sensing*, 3rd Ed., Taylor and Francis
**Course Material:**
Download and install Anaconda2 (Python 2.7 version)
https://www.continuum.io/downloads
In a command window enter
jupyter notebook
Point your browser (preferably Firefox) to localhost:8888
The *sum* of two column vectors is given by
$$\un x + \un y = \pmatrix{x_1\cr x_2}+\pmatrix{y_1\cr y_2}=\pmatrix{x_1+y_1\cr x_2+y_2},$$and their *inner product* by
$$\un x^\top \un y =(x_1,x_2) \pmatrix{y_1\cr y_2}=x_1y_1+x_2y_2.$$The *length* or *(2-)norm* of the vector $\un x$ is
$$\|\un x\| = \sqrt{x_1^2+x_2^2} = \sqrt{\un x^\top \un x}\ .$$The inner product can be written as
$$\un x^\top \un y = \|\un x\|\|\un y\|\cos(\theta)$$where $\theta$ is the angle subtended by $\un x$ and $\un y$. If $\theta=\pi/2$ then the vectors are *orthogonal*.
The product of two $2\times 2$ matrices is given by
$$\eqalign{ \un A \un B &=\pmatrix{a_{11} & a_{12}\cr a_{21}&a_{22}} \pmatrix{b_{11} & b_{12}\cr b_{21}& b_{22}}\cr &=\pmatrix{a_{11}b_{11}+a_{12}b_{21} & a_{11}b_{12}+a_{12}b_{22}\cr a_{21}b_{11}+a_{22}b_{21} & a_{21}b_{12}+a_{22}b_{22}}}$$The matrix product $\un A\un B$ is allowed whenever $\un A$ has the same number of columns as $\un B$ has rows.
So if $\un A$ has dimension $\ell\times m$ and $\un B$ has dimension $m\times n$, then $\un A\un B$ is $\ell\times n$ with elements
$$(\un A\un B)_{ij} = \sum_{k=1}^m a_{ik}b_{kj}\quad i=1\dots \ell,\ j=1\dots n.$$Matrix multiplication is not *commutative*, i.e. $\un A\un B\ne\un B\un A$ in general. However it is *associative*:
$$(\un A\un B)\un C = \un A(\un B\un C)= \un A \un B \un C. \tag 1 $$The *outer product* of two vectors of equal length, written $\un x \un y^\top$, is a matrix, e.g.,
$$\un x \un y^\top = \pmatrix{x_1\cr x_2} (y_1, y_2)= \pmatrix{x_1&0\cr x_2&0} \pmatrix{y_1&y_2\cr0&0} =\pmatrix{x_1y_1&x_1y_2\cr x_2y_1&x_2y_2}.$$from numpy import *
A = mat(random.rand(3,1))
print A
print
B = A*A.T
print B
print
print linalg.det(B)
Matrices, like vectors, have a transposed form, obtained by interchanging their rows and columns.
$$\un A = \pmatrix{a_{11} & a_{12}\cr a_{21}&a_{22}},\quad\un A^\top = \pmatrix{a_{11} & a_{21}\cr a_{12}&a_{22}}.$$Transposition has the properties $$\eqalign{ (\un A+\un B)^\top &= \un A^\top + \un B^\top\cr (\un A\un B)^\top &= \un B^\top\un A^\top.}\tag 2$$
The determinant of a $2\times 2$ matrix is given by
$$|\un A| = a_{11}a_{22}-a_{12}a_{21}.$$The determinant has the properties $$\eqalign{ |\un A\un B| &= |\un A| |\un B| \cr |\un A^\top| &= |\un A|.}\tag 3$$
The $2\times 2$ *identity matrix* is
$$\un I = \pmatrix{1&0\cr 0&1},$$and for any $\un A$,
$$\un I \un A = \un A \un I= \un A.$$The *matrix inverse* $\un A^{-1}$ of a square matrix $\un A$ is defined in terms of the identity matrix: $$\un A^{-1} \un A = \un A \un A^{-1} = \un I.$$
For example:
$$\un A^{-1}={1\over |\un A|}\pmatrix{a_{22}& -a_{12}\cr -a_{21}&a_{11}}.$$The matrix inverse has the properties $$\eqalign{ (\un A\un B)^{-1} &= \un B^{-1}\un A^{-1}\cr (\un A^{-1})^\top &= (\un A^\top)^{-1}.}\tag 4$$
If the transpose of a square matrix is its inverse, i.e.,
$$\un A^\top\un A = \un A\un A^\top = \un I,\tag 5$$then it is referred to as an *orthonormal matrix*.
The *trace* of a square matrix is the sum of its diagonal elements:
$${\rm tr}\ \un A = a_{11}+a_{22}.$$The trace has the properties $$\eqalign{ {\rm tr}(\un A + \un B) &= {\rm tr} \un A + {\rm tr} \un B \cr {\rm tr}(\un A\un B) &= {\rm tr}(\un B\un A).\tag 6}$$
C = mat(random.rand(3,3))
print C.I*C
print trace(C.I*C)
print B.I
The $p\times p$ matrix $\un A$ is *symmetric* if
$$\un A^\top = \un A.$$The $p\times p$ symmetric matrix $\un A$ is *positive definite* if
$$\un x^\top\un A\un x > 0\quad\hbox{for all }p\hbox{-dimensional vectors } \un x.$$The expression $\un x^\top\un A\un x$ is called a *quadratic form*.
**Exercise 1:** Prove with the help of (6) that $\un x^\top\un A\un x = {\rm tr}(\un A \un x\un x^\top)$.
Hint: Start with $$ y = \un x^\top\un A\un x, $$ multiply from the left and right with $\un x$ and $\un x^\top$, $$ \un x y \un x^\top = \un x\un x^\top\un A\un x\un x^\top, $$ and note that, since $y$ is a scalar, the LHS can be written $\un x \un x^\top y$.
Cholesky decomposition: If $\un A$ is symmetric positive definite, there exists a lower triangular matrix $\un L$ such that $$\un A = \un L\un L^\top.$$
# symmetric positive definite matrix
C = mat(random.rand(100,3))
S = C.T*C/100
print S
print
L = linalg.cholesky(S)
print L
print
print L*L.T
If $|\un A|=0$, then $\un A$ has no inverse and is said to be a *singular matrix*. If $\un A$ is *nonsingular*, then the equation
$$ \un A \un x = \un 0 $$only has the *trivial solution* $\un x=\un 0$. Thus, if $\un A$ is non-singular and $\un A\un x = \un 0$,
$$ \un A^{-1}\un A\un x = \un I\un x = \un x = \un 0. $$If $\un A$ is singular, the equation has at least one *nontrivial solution* $\un x \ne \un 0$.
A set of vectors $\un x_1\dots \un x_r$ is said to be *linearly dependent* if there exists a set of scalars $c_1\dots c_r$, not all of which are zero, such that $$\sum_{i=1}^r c_i\un x_i = \un 0.$$ Otherwise they are *linearly independent*.
A matrix is said to have *rank* $r$ if the maximum number of linearly independent columns is $r$.
**Eigenvalues and eigenvectors**
An *eigenvalue problem*: Find *eigenvectors* $\un x$ and *eigenvalues* $\lambda$ that satisfy the equation
$$\un A \un x = \lambda \un x,\tag 7$$where $\un A$ is a symmetric and positive definite matrix. This can be written equivalently as
$$(\un A-\lambda\un I)\un x = \un 0,$$so for a nontrivial solution we must have
$$|\un A-\lambda\un I| = 0.$$This is called the *characteristic equation*. If $\un x$ has dimension $N$, then the characteristic equation has $N$ solutions $\lambda_i$, $i=1\dots N$. Correspondingly there are $N$ eigenvectors $\un x=\un x^{(i)}$, $i=1\dots N$.
**Example:** A $2\times 2$ matrix eigenvalue problem, $\un A\un x = \lambda \un x$. The characteristic equation is
$$(a_{11}-\lambda)(a_{22}-\lambda)-a^2_{12} = 0.$$This is a quadratic equation in $\lambda$ with solutions
$$\eqalign{ \lambda^{(1)} & = {1\over 2}\left( a_{11}+a_{22} +\sqrt{(a_{11}+a_{22})^2-4(a_{11}a_{22}-a_{12}^2)}\right)\cr \lambda^{(2)} & = {1\over 2}\left( a_{11}+a_{22} -\sqrt{(a_{11}+a_{22})^2-4(a_{11}a_{22}-a_{12}^2)}\right).}$$Thus there are two eigenvalues and, correspondingly, two eigenvectors
$$\un x^{(1)},\quad\un x^{(2)}.$$The eigenvectors can be obtained by first substituting $\lambda^{(1)}$ and then $\lambda^{(2)}$ into Equation (7) and solving for $x_1$ and $x_2$ each time.
**Exercise 2:** Show that the eigenvectors are orthogonal, $$(\un x^{(1)})^\top \un x^{(2)} = 0.$$
The eigenvectors can be always chosen to have unit length,
$$\|\un x^{(1)}\| = \|\un x^{(2)}\| = 1.$$The matrix formed by two eigenvectors, for instance $$\un U =(\un x^{(1)},\un x^{(2)}) = \pmatrix{x^{(1)}_1 & x^{(2)}_1\cr x^{(1)}_2 & x^{(2)}_2},$$ is orthonormal: $$\un U^\top\un U = \un I$$ and is said to *diagonalize* the matrix $\un A$: $$\un U^\top \un A \un U = \pmatrix{\lambda^{(1)} & 0\cr 0 & \lambda^{(2)}} = \bf\Lambda,\tag 8$$
lmda,U = linalg.eig(S)
print lmda
print
print U
print
print U.T*U
print
print U.T*S*U
**Singular value decomposition (SVD)**
Simply by rewriting (8), we get a special form of *singular value decomposition* (SVD) for symmetric matrices:
$$\un A = \un U\bf\Lambda\un U^\top.$$This says that any symmetric matrix $\un A$ can be factored into the product of an orthonormal matrix $\un U$ times a diagonal matrix $\bf\Lambda$ times the transpose of $\un U$ (also called the *eigen-decomposition of $\un A$*). It can be written in the form
$$ \un A = \sum_{i=1}^n \lambda_i\un x^{(i)}\un x^{(i)\top}. $$The SVD of a general $m\times n$ matrix $\un A$ with $m>n$ is
$$\un A = \un U\un W \un V^\top$$where $\un U$ is $m\times n$ with $\un U^\top\un U = \un I_n$,
$$\un W = \pmatrix{w_1 & \cdots & 0 \cr \vdots &\vdots &\vdots\cr 0 &\cdots & w_n}$$and $\un V$ is $n\times n$ with $\un V^\top\un V = \un V\un V^\top = \un I_n$
U,w,_ = linalg.svd(S)
print U
print
print w
# Singular matrix
U,w,_ = linalg.svd(B)
print
print U
print
print w
If $\un A$ is singular and we order the eigenvalues and eigenvectors in order of decreasing eigenvalue, then the eigendecomposition reads
$$ \un A = \sum_{i=1}^r \lambda_i\un x^{(i)}\un x^{(i)\top}, $$where $r$ is the number of nonzero eigenvalues. Accordingly,
$$ \un A = \un {U_r}\bf\Lambda_r\un U_r^\top, $$where
$$ \un U_r = (\un {x}^{(1)},\dots , \un x^{(r)}),\quad \bf\Lambda = \pmatrix{\lambda_1 & 0 & \cdots & 0 \cr 0 & \lambda_2 & \cdots & 0 \cr \vdots & \vdots & \ddots & \vdots \cr 0 & 0 & \cdots & \lambda_r} . $$The pseudoinverse of the symmetric, singular matrix $\un A$ is defined as
$$ \un A^+ = \un U_r\bf\Lambda_r^{-1}\un U_r^\top. $$u = U[:,0]
print 'pseudo inverse =\n %s' % str(u*(1/w[0])*u.T)
Suppose $\un A$ is a $N\times N$ symmetric matrix. Then its eigenvectors $\un x_i$, $i=1\dots N$, are orthogonal and any $N$-component vector $\un u$ can be expressed as a linear combination of them,
$$\un u = \sum_{j=1}^N\beta_j\un x_j.$$But then we have
$$\un u^\top\un A\un u = \un u^\top\sum_i\beta_i\lambda_i\un x_i = \sum_{j=1}^N\beta_j\un x_j^\top\sum_i\beta_i\lambda_i\un x_i =\sum_i\beta_i^2\lambda_i,$$where $\lambda_i$, $i=1\dots N$, are the eigenvalues of $\un A$.
So $\un A$ is positive definite if and only if all of its eigenvalues are positive.
**Lagrange multipliers**
A *vector partial derivative* is written in the form ${\partial\over\partial\un x}$. For instance in two dimensions:
$${\partial\over \partial\un x} = \pmatrix{1\cr 0}{\partial\over \partial x_1}+\pmatrix{0\cr 1}{\partial\over \partial x_2}.$$This operator (also called the *gradient*) arranges the partial derivatives of any scalar function of the vector $\un x$ with respect to each of its components into a column vector, e.g.,
$${\partial f(\un x)\over\partial\un x}=\pmatrix{\partial f(\un x)\over\partial x_1\cr {\partial f(\un x)\over\partial x_2}}.$$The operations with vector derivatives correspond closely to operations with ordinary scalar derivatives:
$${\partial \over \partial\un x}(\un x^\top \un y) = \un y\quad\hbox{analogous to} \quad {\partial\over \partial x}xy=y$$$${\partial \over \partial\un x}(\un x^\top \un x) = 2\un x\quad\hbox{analogous to} \quad {\partial \over \partial x}x^2=2x.$$For quadratic forms we have
$$\eqalign{ {\partial\over\partial \un x}(\un x^\top \un A \un y)&=\un A \un y\cr {\partial\over\partial \un y}(\un x^\top \un A \un y)&=\un A^\top \un x}$$and
$${\partial\over\partial \un x}(\un x^\top \un A \un x)=\un A \un x+\un A^\top\un x.$$If $\un A$ is a symmetric matrix
$${\partial\over\partial \un x}(\un x^\top \un A \un x)=2\un A \un x.$$**Exercise 3:** What is ${\partial\over\partial\un x} {1 \over \un x^\top\un A \un y}$?
Suppose $x^*$ is a stationary point of the function $f(x)$, by which is meant that its first derivative vanishes at that point:
$$ {d\over dx}f(x^*)={d\over dx}f(x)\Big|_{x=x^*}= 0; $$Then $f(x^*)$ is a local minimum if the second derivative at $x^*$ is positive,
$$ {d^2\over dx^2}f(x^*)>0. $$This is easliy seen from the Taylor expnsion fo $f(x)$ about $x^*$:
$$ f(x)= f(x^*)+(x-x^*){d\over dx}f(x^*)+{1\over 2}(x-x^*)^2{d^2\over dx^2}f(x^*)+\dots\ . $$The Taylor expansion of a scalar function of a vector $\un x$ about $\un x^*$ is
$$f(\un x) = f(\un x^*)+(\un x-\un x^*)^\top{\partial f(\un x^*)\over\partial\un x} +{1\over 2}(\un x-\un x^*)^\top\un H(\un x^*)(\un x-\un x^*)+\dots,$$where $\un H$ is called the *Hessian matrix*. Its elements are given by
$$\left(\un H(\un x^*)\right)_{ij} = {\partial^2 f(\un x^*)\over\partial x_i\partial x_j},$$and it is a symmetric matrix. Note that we can write
$$\un H(\un x^*)= {\partial\over\partial\un x} {\partial\over\partial\un x^\top} f(\un x^*) = {\partial^2 f(\un x^*)\over\partial \un x\partial\un x^\top}.$$At the stationary point the vector derivative with respect to $\un x$ vanishes,
$${\partial f(\un x^*)\over\partial\un x}=\un 0,$$so we have
$$f(\un x) \approx f(\un x^*)+{1\over 2}(\un x-\un x^*)^\top\un H(\un x^*)(\un x-\un x^*).$$The condition for a local minimum is that the Hessian matrix be positive definite at the point $\un x^*$.
Suppose that we wish to find the position $\un x^*$ of a minimum (or maximum) of a function $f(\un x)$. If there are no constraints, then we solve the set of equations
$${\partial f(\un x)\over \partial x_i}= 0,\quad i=1,2\dots$$or, in terms of our notation for vector derivatives,
$${\partial f(\un x)\over \partial\un x}=\un 0,$$and examine the Hessian matrix at the solution point.
Suppose that $\un x$ is *constrained* by the condition
$$h(\un x) = 0.$$For example, in two dimensions we might have the constraint
$$h(\un x)=x_1^2+x_2^2-1=0$$which requires $\un x$ to lie on a circle of radius 1.
A very general way to find an extremum of $f(\un x)$ subject to $h(\un x)=0$ is to determine an *unconstrained* minimum or maximum of the *Lagrange function*
$$L(\un x) = f(\un x)+\lambda h(\un x).$$The quantity $\lambda$ is a *Lagrange multiplier*. To find the extremum, we solve the set of equations $$\eqalign{ {\partial\over \partial \un x}(f(\un x)+\lambda h(\un x))&= \un 0,\cr {\partial\over \partial \lambda}(f(\un x)+\lambda h(\un x))&= 0}$$ for $\un x$ and $\lambda$.
**Exercise 4:** Find the minimum of
$$f(\un x) = ax_1^2 + bx_2^2$$for $a,b>0$ subject to the constraint
$$x_1+x_2=1.$$**Example:** (PCA) Find the maximum of
$$f(\un x) = \un x^\top\un C\un x,$$where $\un C$ is symmetric positive definite, subject to the constraint $$h(\un x) = \un x^\top\un x - 1 = 0.$$ The Lagrange function is
$$L(\un x) = \un x^\top\un C \un x - \lambda(\un x^\top \un x - 1).$$Any extremum of $L$ occurs at a value of $\un x$ for which
$${\partial L\over \partial\un x}=2\un C \un x-2\lambda\un x=0,$$which is the eigenvalue problem
$$\un C \un x = \lambda\un x.$$Let $\un u$ be an eigenvector with eigenvalue $\lambda$. Then
$$\un u^\top\un C\un u = \lambda\un u^\top\un u = \lambda.$$So to maximize $f(\un x)$ we must choose the eigenvector of $\un C$ with maximum eigenvalue.