Summary

We introduced the notion of a random variable and associated probabilty distribution and density functions

Then we defined the first two moments of those functions, mean and variance

Considering two random variables, we defined joint and marginal distributions and the idea of statistical independence

Two important distributions are normal and chi-square:

normally distributed random variables sum to normal distributions 

squares of normally and equally distributed random variables sum to chi-square distributions

We saw some ways of calcuating the distributions of functions of a random variable with a known distribution

We defined conditional probability and discussed Bayes' theorem, which allows us to correct our prior assumptions on the basis of new observations

We introduced samples and sample functions for estimating mean and variance of probability distributions and discussed the probability distributions of these point estimators

We gave some examples of interval estimation for mean and vaiance of normall distributions

We extended the concept of random variable to random vector and defined the variance covariance matrix of a multivariate distribution

We also extended the idea of a point estimator to the mutivariate case and gave expressions for estimating the vector mean and the covariance matrix

As an example we considered the principal components transformation which makes use of the estimated covariance matrix to find new random vectors which maximize variance.

Then we looked at the theory of linear regression, first of all for one independent variable, and derived expressions for the slope and intercept parameters, their uncertainties and goodness of fit. We also explained the coefficient of determination.

We generalized to multiple linear regression, derived the normal equation and pseudoinverse, also giving a formula for the error covariance matrix. We looked at the Kalman filter for multiple linear regression on sequential measurement data.

Part 7. Hypothesis Testing

$$ \def\pr{\hbox{Pr}} \def\var{\hbox{var}} \def\cov{\hbox{cov}} \def\tr{\hbox{tr}} \def\corr{\hbox{corr}} \def\dmX{\un{\mathcal{X}}} \def\dmG{\un{\mathcal{G}}} \def\dmK{\un{\mathcal{K}}} \def\dmS{\un{\mathcal{S}}} \def\dmC{\un{\mathcal{C}}} \def\dmZ{\un{\mathcal{Z}}} \def\bma{{\mbox{\boldmath $\alpha$}}} \def\bmb{{\mbox{\boldmath $\beta$}}} \def\bmu{{\mbox{\boldmath $\mu$}}} \def\bme{{\mbox{\boldmath $\epsilon$}}} \def\bmS{{\mbox{\boldmath $\Sigma$}}} \def\bmL{{\mbox{\boldmath $\Lambda$}}} \def\bmd{{\mbox{\boldmath $\delta$}}} \def\bmD{{\mbox{\boldmath $\Delta$}}} \def\bmG{{\mbox{\boldmath $\Gamma$}}} \def\bmphi{{\mbox{\boldmath $\phi$}}} \def\bmPhi{{\mbox{\boldmath $\Phi$}}} \def\bmpsi{{\mbox{\boldmath $\psi$}}} \def\bmtheta{{\mbox{\boldmath $\theta$}}} \def\eq{\begin{equation}} \def\eeq{\end{equation}} \def\i{{\bf i}} \def\un#1{{\bf #1}}$$

Maximum likelihood

Consider a random sample $\un z(i)$, $i=1\dots m$, from a population whose density function is $p(\un z\mid \theta)$. The *likelihood function* for the sample is

$$L(\theta) = \prod_{i=1}^m p(\un z(i)\mid \theta).$$

Its logarithm

$$\mathcal{L}(\theta) = \sum_{i=1}^{m} \log p(\un z(i)\mid \theta),$$

is called the *log-likelihood*. Taking products is justified when the $\un z(i)$ are realizations of independent random vectors. The parameter set $\hat\theta$ which maximizes the likelihood function or its logarithm, i.e., which gives the largest value for all of the realizations, is called the *maximum likelihood estimate* of $\theta$.

For $m$ samples from the multivariate normal distribution:

$$ p(\un z) = {1\over (2\pi)^{N/2}\sqrt{|\bf\Sigma|}} \exp\left(-{1\over 2}(\un z-\bf\mu)^\top \bf\Sigma^{-1} (\un z-\bf\mu)\right), $$

we get

$$\mathcal{L}(\mu,{\bf\Sigma}) = \sum_{i=1}^{m} \log p(\un z(i)\mid {\bf\mu},{\bf\Sigma}) = m{N\over 2}\log 2\pi-m{1\over 2}\log|{\bf\Sigma}|-{1\over 2} \sum_{i=1}^{m} (\un z(i)-\mu)^\top\bf\Sigma^{-1}(\un z(i)-\mu).$$

To maximize with respect to $\mu$ we set

$${\partial\mathcal{L}(\mu,{\bf\Sigma})\over\partial\mu}= \sum_{i=1}^{m} {\bf\Sigma}^{-1}(\un z(i)-\mu) =\un 0,$$

giving $$\hat\mu = {1\over m}\sum_{i=1}^{m} \un z(i),$$ which is the realization $ \bar{\un z}$ of the unbiased estimator for the sample mean.

In a similar way one can show that

$$\hat{\bf\Sigma} = {1\over m}\sum_{i=1}^{m}(\un z(i)-\mu)(\un z(i)-\mu)^\top,$$

which is (almost) the realization $\un s$ of the unbiased sample covariance matrix estimator, except that the denominator $m-1$ is replaced by $m$, a fact which can often be ignored.

Some terminology

A *statistical hypothesis* is a conjecture about the distributions of one or more random variables. It might, for instance, be an assertion about the mean of a distribution, or about the equivalence of the variances of two different distributions.

We distinguishe between *simple* hypotheses, for which the distributions are completely specified, for example: *the mean of a normal distribution with variance $\sigma^2$ is $\mu=0$*, and *composite* hypotheses, for which this is not the case, e.g., *the mean is $\mu\ge 0$*.

In order to test such assertions on the basis of samples of the distributions involved, it is also necessary to formulate *alternative* hypotheses. To distinguish these from the original assertions, the latter are called *null* hypotheses.

Thus we might be interested in testing the simple null hypothesis $\mu = 0$ against the composite alternative hypothesis $\mu\ne 0$.

An appropriate sample function for deciding whether or not to reject the null hypothesis in favor of its alternative is referred to as a *test statistic*.

An appropriate *test procedure* will partition the possible realizations of the test statistic into two subsets: an acceptance region for the null hypothesis and a rejection region. The latter is referred to as the *critical region*.

**Definition:** Referring to the null hypothesis as $H_0$, there are two kinds of errors which can arise from any test procedure:

  1. $H_0$ may be rejected when in fact it is true. This is called an *error of the first kind* and the probability that it will occur is denoted $\alpha$.

  2. $H_0$ may be accepted when in fact it is false, which is called an *error of the second kind* with probability of occurrence $\beta$.

The probability of obtaining a value of the test statistic within the critical region when $H_0$ is true is thus $\alpha$. The probability $\alpha$ is also referred to as the *level of significance* of the test.

It is generally the case that the lower the value of $\alpha$, the higher is the probability $\beta$ of making a second kind error. Traditionally significance levels of 0.01 or 0.05 are used.

Such values are obviously arbitrary, and for exploratory data analysis it is common to avoid specifying them altogether. Instead the *$P$-value* for the test is stated:

**Definition:** Given the observed value of a test statistic, the *$P$-value* is the lowest level of significance at which the null hypothesis could have been rejected.

High $P$-values provide evidence in favor of accepting the null hypothesis, without actually forcing one to commit to a decision.

Example: IAEA Nuclear material accontancy

$H_0:$ No material is missing (simple hypothesis)

$H_1:$ One "sgnificant quantity"is missing (simple hypothesis).

Test statistic:

$$ MUF = Inventory_0 + Input - Output - Inventory_1 $$

which might be assumed to be normally distibuted with mean 0 under $H_0$.

$\alpha$ is the false alarm probability.

$1-\beta$ is the detection probability.

Likelihood ratio test

Consider a random sample $\un z(i)$, $i=1\dots m$, from a population whose density function is $p(\un z\mid \theta)$. Here $\theta$ is the parameter set of the density function, e.g., $\theta = \{\mu,\sigma^2\}$. The likelihood function for the sample is $$L(\theta) = \prod_{i=1}^m p(\un z(i)\mid \theta).$$

**Definition:** Let $\omega$ be space of all possible values of the parameters $\theta$ and $\omega_0$ be a subset of that space. The likelihood ratio test for the null hypothesis $\theta\in\omega_0$ against the alternative $\theta\in\omega-\omega_0$ has the critical region

$$Q = {\max_{\theta\in \omega_0} L(\theta) \over \max_{\theta\in\omega} L(\theta)} \le k.$$

This definition simply reflects the fact that, if $H_0$ is true, the maximum likelihood for $\theta$ when restricted to $\omega_0$ should be close to the the maximum likelihood for $\theta$ without that restriction. Therefore, if the likelihood ratio is small, (less than or equal to some small value $k$), then $H_0$ should be rejected.

In many cases the likelihood ratio test will lead to a test statistic whose distribution is unknown. The LRT has, however, an important asymptotic property:

Theorem 1

If $\omega$ is a region of $R^q$ and $\omega_0$ is an $r$-dimensional subregion, then for each $\theta\in\omega_0$, $-2\log Q$ has an asymptotic chi-square distribution with $q-r$ degrees of freedom as $n\to\infty$.

**Example:** Given random samples $z_1,z_2\dots z_m$ from a normal distribution with mean $\mu$ and known variance $\sigma^2$. The likelihood ratio test at significance level $\alpha$ for the simple hypothesis $ H_0:\ \mu = \mu_0 $ against the alternative composite hypothesis $ H_1:\ \mu \ne \mu_0 $ has the critical region

$${L(\mu_0)\over L(\hat\mu)}\le k,$$

where $\hat\mu$ is the maximizes the likelihood function for all possible values of $\mu$,

$$\hat\mu = \bar z = {1\over m}\sum_{i=1}^m z_i.$$

The critical region is then

$$\eqalign{ Q = {L(\mu_0)\over L(\hat\mu)} &= \exp\left(-\sum_\nu(z(\nu)-\mu_0)^2/2\sigma^2\right)\Bigg/\exp\left(-\sum_\nu(z(\nu)-\bar z)^2/2\sigma^2\right)\cr &=\exp\left(-{1\over 2\sigma^2}\sum_\nu(z(\nu)-\mu_0)^2+(z(\nu)-\bar z)^2\right)\cr &= \exp\left(-{1\over 2\sigma^2/m}(\bar z - \mu_0)^2\right) \le k} $$

Equivalently,

$$ e^{-x^2} \le k,\quad\hbox{or}\quad |x^2| > k $$

where

$$ x = {\bar z - \mu_0 \over \sigma/\sqrt{m}} $$

So the critical region can be written in the form

$$\left|{\bar z - \mu_0 \over \sigma/\sqrt{m}}\right| > k_\alpha.$$

But we know that $\bar z$ is normally distributed with mean $\mu_0$ and variance $\sigma^2/m$. Therefore $x$ has the standard normal distribution $\Phi(x)$ and $k_\alpha$ is determined by

$$\Phi(-k_\alpha) + 1- \Phi(k_\alpha) = \alpha,$$

or

$$1- \Phi(k_\alpha) = \alpha/2.$$

Note that the above likelihood ratio test statistic $Q$ satisfies Theorem 1:

$$ -2\log Q = -2\cdot\left(-{1\over 2\sigma^2/m}(\bar z - \mu_0)^2\right)= {(\bar z - \mu_0)^2\over\sigma^2/m}, $$

which is chi-square distributed with one degree of freedom. Since $\omega_0$ consists of the single point $\mu_0$, its dimension is $r=0$, whereas $\omega$ is all real values of $\mu$, so $q=1$. Hence $q-r=1$. In this simple case, Theorem 1 holds for all values of $m$.

T and F tests

Suppose however that the variance $\sigma^2$ is unknown and that we wish nevertheless to make a statement about $\mu$ in terms of some realization of an appropriate statistic. To treat this and similar problems it is first necessary to define some additional distribution functions.

Theorem 2 If the random variable $Z$ is standard normally distributed and $Y$ is chi-square distributed with $m$ degrees of feedom, then the random variable

$$ T = { Z\over \sqrt{Y / m}}\tag 7 $$

is Student-t distributed with $m$ degrees of freedom. The corresponding density is given by

$$ p_{t;m}(z) = d_m\left( 1 + {z^2\over m}\right)^{-(m+1)/2},\quad -\inftywith $d_m$ being a normalization factor. The Student-t distribution converges to the standard normal distribution for $m\to\infty$.

In [ ]:
# The t-distribution is symmetric about 0 and for m > 30 
# almost identical to the normal distribution:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st

z = np.linspace(-5,5,200)
plt.plot(z,st.t.pdf(z,5),z,st.norm.pdf(z))
plt.show()

The Student-t distribution is used to make statements regarding the mean when the variance is unknown. Thus if $Z_i$, $i=1\dots m$, is a sample from a normal distribution with mean $\mu_0$ and unknown variance, and

$$\bar Z = {1\over m}\sum_{i=1}^m Z_i,\quad S = {1\over m}\sum_i^m (Z_i-\bar Z)^2,$$

then the random variable

$$T = {\bar Z - \mu_0 \over \sqrt{S/m} }\tag 9$$

is Student-t distributed with $m-1$ degrees of freedom and is a likelihood ratio test for $H_0: \mu=\mu_0$, $H_1: \mu\ne\mu_0$.

proof:

In (7) set $Z_0 = {\bar Z - \mu_0 \over \sigma/\sqrt{m}}$ where $\sigma$ is the true (unknown) variance of $Z$. This has a standard normal distribition. Set the denominator replace $Y$ by $(m-1)S/\sigma^2$. This has the chi-square distribution with $\nu = m-1$ degrees of freedom.

$$ T = { Z\over \sqrt{Y / m}} = { {\bar Z - \mu_0 \over \sigma/\sqrt{m}} \over \sqrt{(m-1)S \over \sigma^2}/(m-1)} $$

The ratio is the $T$ statistic (9) and by definition it satisfies Theorem 2 with $m-1$ degrees of freedom.

Example:

In 16 one-hour tests, the gasoline consumption of an engine averaged 8.5 liters per 100 km with a standard deviation of 2.0 liters. Test the null hypothesis that the manufacturer's claim that the average consumption is 7 liters is true.

In (9), $m = 16, \mu_0 = 7.0, \bar z = 8.5, s = 2.0^2$

$$ t = {8.5 - 7.0 \over 2.0\big/\sqrt{16}} = 3.0 $$

We can reject the null hypothesis:

In [ ]:
print 'p-value = %f' % (1-st.t.cdf(3.0,15))

If $Z_i$, $i=1\dots n$, and $Y_i$, $i=1\dots m$, are samples from normal distributions with equal variance, then the random variable

$$T_d = {\bar Z-\bar Y - (\mu_Z-\mu_Y)\over \sqrt{\left({1\over n}+{1\over m}\right){(n-1)S_Z+(m-1)S_Y\over n+m-2}}}\tag 9$$

is Student-t distributed with $n+m-2$ degrees of freedom. It may be used to test the hypothesis $\mu_Z=\mu_Y$. For instance, if $n=m$, under the null hypothesis the test statistic becomes

$$ T_d = {\bar Z - \bar Y \over \sqrt{(S_z +S_y)/m}}. $$

**Definition:** The *F-probability density function* with $m$ and $n$ degrees of freedom is given by

$$ p_{f;m,n}(z) = \cases{ c_{mn} z^{(m-2)/2}\left(1+{m\over n}\ z\right)^{-(m+n)/2} & for $z>0$\cr 0 & otherwise,} $$

with normalization factor $c_{mn}$.

**Theorem 3:** If $Y$ and $Z$ are chi-square distributed with $m$ and $n$ degrees of freedom, respctively, then

$$F = {{1\over m}Y \over {1\over n}Z }\tag{10}$$

is $F$-distributed with $m$ and $n$ degrees of freedom.

In [ ]:
z = np.linspace(0,5,200)
plt.plot(z,st.f.pdf(z,10,12))
plt.show()

The $F$-distribution can be used for hypothesis tests regarding two variances. If $S_Z$ and $S_Y$ are sample variances for samples of size $n$ respectively $m$ drawn from normally distributed populations with variances $\sigma_Z^2$ and $\sigma_Y^2$, then $$F = {\sigma_Y^2S_Z\over \sigma_Z^2S_Y}$$ is a random variable having an $F$-distribution with $n-1$ and $m-1$ degrees of freedom. It is a likelihood ratio test for

$$H_0:\sigma_Z^2 = \sigma_Y^2, \quad H_1: \sigma_Z^2 \ne \sigma_Y^2.$$

Example: Automatic radiometric normalization of satellite imagery

In [ ]:
ls
In [ ]:
%matplotlib inline
%run dispms -f /home/mort/stats2016/20010626 -e 1 -p [4,5,6] \
-F /home/mort/stats2016/20010829 -E 1 -P [4,5,6]
In [ ]:
%run iMad /home/mort/stats2016/20010626 /home/mort/stats2016/20010829
In [ ]:
run dispms -f /home/mort/stats2016/MAD(20010626-20010829) -e 3 -p [7,7,7] \
-F /home/mort/stats2016/MAD(20010626-20010829) -E 3 -P [4,5,6] 
In [ ]:
run radcal /home/mort/stats2016/MAD(20010525-20010829)
In [ ]:
run dispms -f /home/mort/stats2016/20010829 -e 1 -p [4,5,6] \
-F /home/mort/stats2016/20010829_norm -E 1 -P [4,5,6]

ANOVA: More than two populations

The case often arises when one wants to compare the mean values of several random variables, for instance if several medications are equally effective.

Wikipedia

Suppose that $X_1, X_2,\dots, X_k$ are normally distributed random variables with means $\mu_1,\mu_2,\dots,\mu_k$ and variance $\sigma^2$. We want to test the null hypothesis

$$ H_0: \quad \mu_1 = \mu_2 = \dots = \mu_k. $$

Let

$$ X_{11},X_{12},\dots,X_{1n_1},\quad X_{21},X_{22},\dots,X_{2n_2},\quad\dots\quad X_{k1},X_{k2},\dots,X_{kn_k} $$

be independent samples drawn from the $k$ distributions. Call this the overall sample, consisting of $k$ partial samples. The total number of samples is

$$ n = \sum_1^k n_i. $$

The overall sample mean and variance are then

$$ \bar X = {1\over n}\sum_{i=1}^k\sum_{j=1}^{n_i} X_{ij},\quad S = {1\over n-1}\sum_{i=1}^k\sum_{j=1}^{n_i}(X_{ij}-\bar X)^2. $$

Similarly for the partial samples,

$$ \bar X_i = {1\over n_i}\sum_{j=1}^{n_i}X_{ij}, \quad S_i= {1\over n_i-1}\sum_{j=1}^{n_i}(X_{ij}-\bar X_i)^2. $$

Now consider

$$\eqalign{ \sum_{j=1}^{n_i}(X_{ij}-\bar X)^2 &= \sum_{j=1}^{n_i}(X_{ij}-\bar X_i + \bar X_i-\bar X)^2 = \sum_{j=1}^{n_i}((X_{ij}-\bar X_i) + (\bar X_i-\bar X))^2 \cr &= \sum_{j=1}^{n_i}(X_{ij}-\bar X_i)^2 + 2(X_i-\bar X)\sum_{j=1}^{n_i}(X_{ij}-\bar X_i) + n_i(\bar X_i-\bar X)^2 \cr &= (n_i-1)S_i + n_i(\bar X_i-\bar X)^2, } $$

since the second term vanishes. Therefore

$$ (n-1)S = \sum_{i=1}^k\sum_{j=1}^{n_i}(X_{ij}-\bar X)^2 = \sum_{i=1}^k (n_i-1)S_i + \sum_{i=1}^k n_i(\bar X_i-\bar X)^2.\tag{11} $$

Thus the overall sample variance can be split into a component that only depends upon the partial sample variances and a component that depends on the square of the distances between the partial sample means and the overall sample mean.

Only the second term in (11) depends upon the partial sample means. So if we consider the ratio

$$ {\sum_i n_i(\bar X_i-\bar X)^2\over \sum_i (n_i-1)S_i} = {\sum_i n_i(\bar X_i-\bar X)^2\over \sum_i\sum_j (X_{ij}-\bar X_i)^2 } $$

large values will indicate that the means are different. So this appears to be a good test statistic, but what is its probability distribution?

Under the null hypothesis all the $X_{ij}$ have the same (unknown) variance $\sigma^2$ and $\bar X_i$ has variance $\sigma^2/n_i$. Rewriting the above ratio as

$$ {\sum_i {(\bar X_i-\bar X)^2\over \sigma^2/n_i} \over \sum_i\sum_j {(X_{ij}-\bar X_i)^2 \over \sigma^2}} $$

The numerator has the chi-square distrinution with $k-1$ degrees of freedom, the denominator has the chi-square distribution with $n-k$ degrees of freedom. So if we construct the ratio

$$ {{1\over k-1}\sum_i n_i(\bar X_i-\bar X)^2\over {1\over n-k}\sum_i\sum_j (X_{ij}-\bar X_i)^2 } = {(n-k)\sum_i n_i(\bar X_i-\bar X)^2\over (k-1)\sum_i\sum_j (X_{ij}-\bar X_i)^2 } $$

and compare with (10) we seem to have a test statistic with the F-distribution with $k-1$ and $n-k$ degrees of freedom.

In [ ]:
import numpy as np
from scipy.stats import f
X1 = [21,22,26]
X2 = [30,29,28]
X3 = [25,27]
X = X1+X2+X3
X1bar = np.mean(X1)
X2bar = np.mean(X2)
X3bar = np.mean(X3)
Xbar = np.mean(X)

num = (8-3)*(3*(X1bar-Xbar)**2+3*(X2bar-Xbar)**2+2*(X3bar-Xbar)**2)

den = (3-1)*(3*np.var(X1)+3*np.var(X2)+2*np.var(X3) )

test = num/den

print 'test statistic = %f' %test

print 'rejection threshold = %f' %f.ppf(1-0.05,2,5)

# P-value

print 'p-value = %f' %(1-f.cdf(7.5,2,5))

So, at a significance of $0.05$ we can reject the hypothesis that the means are all the same.

Example: Wishart test (polarimetric SAR change detection)

The power received by a radar transmitting/receiving antenna reflected from a distributed (as opposed to point) target a distance $D$ from the antenna is given by

$$ P_R = {P_TG_TG_R\lambda^2\sigma^o\Delta_a\Delta_r\over (4\pi)^3D^4}\ [W], $$

where $P_T$ is the transmitted power $[W\cdot m^{-2}]$, $\lambda$ is the operating wavelength~$[m]$, $G_T(G_R)$ is the transmitting(receiving) antenna gain, $\Delta_a(\Delta_r)$ is the azimuth(ground range) resolution $[m]$ and $\sigma^o$ is the unitless scattering coefficient (referred to as the {\it radar cross section}) of the target surface.

A fully polarimetric SAR measures a $2\times 2$ scattering matrix $\un S$ at each resolution cell on the ground. The scattering matrix relates the incident and the backscattered electric fields $\un E^i$ and $\un E^b$ according to

$$ \un E^b = \un S\un E^i \quad\hbox{or}\quad \pmatrix{E_h^b \cr E_v^b} =\pmatrix{S_{hh} & S_{hv}\cr S_{vh} & S_{vv}}\pmatrix{E_h^i \cr E_v^i}. $$

The per-pixel polarimetric information in the scattering matrix $\un S$, under the assumption of reciprocity ($S_{hv} = S_{vh}$), can be expressed as a three-component complex vector

$$ \un s = \pmatrix{S_{hh}\cr \sqrt{2}S_{hv}\cr S_{vv}}, $$

where the $\sqrt{2}$ ensures that the total intensity (received signal power) is consistent.

The total intensity is referred to as the span,

$$ {\rm span} = \un s^+\un s = |S_{hh}|^2 + 2|S_{hv}|^2 + |S_{vv}|^2, $$

and the corresponding gray-scale image as the {\it span image}.

Another representation of the polarimetric signal is the {\it complex covariance matrix} $\un C$, obtained by taking the complex outer product of $\un s$ with itself:

$$ \un C = \un s\un s^+ = \pmatrix{ |S_{hh}|^2 & \sqrt{2}S_{hh}S_{hv}^* & S_{hh}S_{vv}^* \cr \sqrt{2}S_{hv}S_{hh}^* & 2|S_{hv}|^2 & \sqrt{2}S_{hv}S_{vv}^* \cr S_{vv}S_{hh}^* & \sqrt{2}S_{vv}S_{hv}^* & |S_{vv}|^2 }. $$

The diagonal elements of $\un C$ are real numbers, with span $= {\rm Tr}(\un C)$, and the off-diagonal elements are complex. The covariance matrix representation contains all of the information in the polarized signal.

Multi-look processing essentially corresponds to the averaging of neighborhood pixels, with the objective of reducing speckle and compressing the data. In practice the averaging is often not performed in the spatial domain, but rather in the frequency domain during range/azimuth compression of the received signal. The covariance matrix representation contains all of the information in the polarized signal. It can be averaged over the number of looks to give

$$\eqalign{ \bar{\un C} & ={1\over m}\sum_{\nu=1}^m \un s(\nu)\un s(\nu)^+ = \langle \un s\un s^+ \rangle \cr & = \pmatrix{ \langle |S_{hh}|^2\rangle & \langle\sqrt{2}S_{hh}S_{hv}^*\rangle & \langle S_{hh}S_{vv}^*\rangle \cr \langle\sqrt{2} S_{hv}S_{hh}^*\rangle & \langle 2|S_{hv}|^2\rangle & \langle\sqrt{2}S_{hv}S_{vv}^*\rangle \cr \langle S_{vv}S_{hh}^*\rangle & \langle\sqrt{2}S_{vv}S_{hv}^*\rangle & \langle |S_{vv}|^2\rangle }.} $$

The quantity

$$ m\bar{\un C} = \sum_{\nu=1}^m \un s(\nu)\un s(\nu)^+ =: \un x. $$

is a realization of a complex sample matrix which is known to have a {\it complex Wishart distribution} with $N\times N$ covariance matrix $\bmS$ (here $N=3$) and $m$ degrees of freedom

$$ p_{W_c}(\un x) ={|\un x|^{(m-N)}\exp(-{\rm tr}({\bf\Sigma}^{-1}\un x)) \over \pi^{N(N-1)/2}|{\bf\Sigma}|^{m}\prod_{i=1}^N\Gamma(m+1-i)},\quad m \ge N, $$

provided that the vectors $\un s(\nu)$ are independent and drawn from a complex multivariate normal distribution.

We represent a pixel vector in an $m$ look-averaged polarimetric SAR image in covariance matrix format by $\bar{\un C}$, where

$$ m\bar{\un C} = \un x = \sum_{\nu=1}^m \un s(\nu)\un s(\nu)^+ $$

is a realization of a random matrix $\un X$ with a complex Wishart distribution. If

$$ \un X = \sum_{\nu=1}^m\un Z(\nu)\un Z(\nu)^+,\quad \un Z(\nu) \sim \mathcal{N}_C(\un 0,\bf\Sigma),\ \nu=1\dots m, $$

is complex Wishart distributed with covariance matrix $\bf\Sigma$ and $m$ degrees of freedom, then the maximum likelihood estimate for $\bf\Sigma$ is

$$ \hat{\bf\Sigma} = {1\over m}\sum_{\nu=1}^m\un z(\nu)\un z(\nu)^+ = {1\over m}\un x, $$

so that each pixel $\bar{\un C}$ is the maximum likelihood estimate of $\bf\Sigma$.

To simplify the notation , define

$$ \Gamma_N(m) = \pi^{N(N-1)/2} \prod_{i=1}^N\Gamma(m+1-i). $$

Then, for two $m$-look quad polarimetric covariance images $\un X_1$ and $\un X_2$, the multivariate densities are

$$\eqalign{ p(\un x_1\mid m,{\bf\Sigma_1}) &= { |\un x_1|^{m-3}\exp(-\tr({\bf\Sigma}_1^{-1}\un x_1)) \over |{\bf\Sigma}_1|^m\Gamma_3(m)}\cr p(\un x_2\mid m,{\bf\Sigma_2}) &= { |\un x_2|^{m-3}\exp(-\tr({\bf\Sigma}_2^{-1}\un x_2)) \over |{\bf\Sigma}_2|^m\Gamma_3(m)}. } $$

We now define the null (or no-change) simple hypothesis

$$ H_0:\quad {\bf\Sigma}_1 = {\bf\Sigma}_2 = {\bf\Sigma}, $$

against the alternative composite hypothesis $$ H_1:\quad {\bf\Sigma}_1 \ne {\bf\Sigma}_2. $$

Under $H_0$ the likelihood for ${\bf\Sigma}$ is given by

$$ L({\bf\Sigma}) = p(\un x_1\mid m,{\bf\Sigma})p(\un x_2\mid m,{\bf\Sigma}) = { |\un x_1|^{m-3}|\un x_2|^{m-3}\exp(-\tr({\bf\Sigma}^{-1}(\un x_1+ \un x_2))) \over |{\bf\Sigma}|^{2m}\Gamma_3(m)^2}. $$

The maximum likelihood estimate of ${\bf\Sigma}$ is

$$ \hat{\bf\Sigma} = {1\over 2m}(\un x_1+\un x_2). $$

Hence the maximum likelihood under the null hypothesis is

$$ L(\hat{\bf\Sigma}) = { |\un x_1|^{m-3}|\un x_2|^{m-3}\exp(-2m\cdot\tr(\un I)) \over \left({1\over 2m}\right)^{3\cdot 2m}|\un x_1+\un x_2|^{2m}\Gamma_3(m)^2 }, $$

where $\un I$ is the $3\times 3$ identity matrix and $\tr(\un I)=3$.

Under $H_1$ we obtain, similarly, the maximum likelihood for ${\bf\Sigma}_1$ and ${\bf\Sigma}_2$ as

$$ L(\hat{\bf\Sigma}_1,\hat{\bf\Sigma}_2) = { |\un x_1|^{m-3}|\un x_2|^{m-3}\exp(-2m\cdot\tr(\un I)) \over \left({1\over m}\right)^{3m}\left({1\over m}\right)^{3m} |\un x_1|^m |\un x_2|^m\Gamma_3(m)^2 } $$

Then the Likelihood ratio test has the critical region

$$ Q = {L(\hat{\bf\Sigma}) \over L(\hat{\bf\Sigma}_1,\hat{\bf\Sigma}_2) } = 2^{6m}{ |\un x_1|^m |\un x_2|^m \over |\un x_1 + \un x_2|^{2m} } \le k. $$

For a large number of looks $m$, we can apply the asymptotic approximation described above to obtain the following distribution for the test statistic: As $m\to\infty$, the quantity

$$ -2\log Q = -2m( 6\log 2 + \log|\un x_1| + \log|\un x_2| -2\log|\un x_1+\un x_2| ) $$

is a realization of a chi-square random variable with 9 degrees of freedom, i.e.,

$$ \pr(-2\log Q\le z) \simeq P_{\chi^2;9}(z). $$

This follows from the fact that the number of parameters required to specify a $3\times 3$ complex covariance matrix is 9: three for the real diagonal elements and $2\times 3$ for the three complex elements above the diagonal. Thus the parameter space $\omega$ for $({\bf\Sigma}_1,{\bf\Sigma}_2)$ has dimension $q=2\times 9$, and the subspace $\omega_0$ for the simple null hypothesis has dimension $r=9$, so $q-r=9$.

Multitemporal data

The preceding discussion generalizes in a straightforward way to a time series of $n$ images (Conradsen et al (2016)). In the equation for the test statistic $Q$ above, the numerator consists of a product $k$ determinants $|\un x_1|\cdot|\un x_2|\cdot\dots|\un x_n|$ and the denominator similarly the determinant of the sum of the $n$ observations $|\un x_1+\un x_2+\dots \un x_n|$. The multitemporal test is referred to as the omnibus test will in general be more powerful (have a higher detection probability) for the same significance level than a bitemporal test just involving the first and last images. Furthermore, the test statistic can be factored into a product of test statistics that can be used to ascertain the time or times at which significant changes occur in the series.

For each pixel:

Test $\Sigma_1 = \Sigma_2$ against $\Sigma_1 \ne \Sigma_2$

If we don't reject $H_0$ then test $\Sigma_1 = \Sigma_2=\Sigma_3$ against $\Sigma_1 = \Sigma_2 \ne \Sigma_3$

If we reject, then start over from that point and test $\Sigma_3=\Sigma_4$ against $\Sigma_3 \ne \Sigma_4$, etc.

In [ ]: