Part 6. Multiple Linear Regression

$$ \def\pr{\hbox{Pr}} \def\var{\hbox{var}} \def\cov{\hbox{cov}} \def\corr{\hbox{corr}} \def\dmX{\un{\mathcal{X}}} \def\dmG{\un{\mathcal{G}}} \def\dmK{\un{\mathcal{K}}} \def\dmS{\un{\mathcal{S}}} \def\dmC{\un{\mathcal{C}}} \def\dmZ{\un{\mathcal{Z}}} \def\bma{{\mbox{\boldmath $\alpha$}}} \def\bmb{{\mbox{\boldmath $\beta$}}} \def\bmu{{\mbox{\boldmath $\mu$}}} \def\bme{{\mbox{\boldmath $\epsilon$}}} \def\bmS{{\mbox{\boldmath $\Sigma$}}} \def\bmL{{\mbox{\boldmath $\Lambda$}}} \def\bmd{{\mbox{\boldmath $\delta$}}} \def\bmD{{\mbox{\boldmath $\Delta$}}} \def\bmG{{\mbox{\boldmath $\Gamma$}}} \def\bmphi{{\mbox{\boldmath $\phi$}}} \def\bmPhi{{\mbox{\boldmath $\Phi$}}} \def\bmpsi{{\mbox{\boldmath $\psi$}}} \def\bmtheta{{\mbox{\boldmath $\theta$}}} \def\eq{\begin{equation}} \def\eeq{\end{equation}} \def\i{{\bf i}} \def\un#1{{\bf #1}}$$

The statistical model for linear regression may be written more generally in the form

$$Y(i) = w_o + \sum_{j=1}^{N} w_jx_j(i) + R(i),\quad i=1\dots n,\tag 1$$

relating $n$ measurements of the $N$ independent variables $x_1\dots x_N$ to a measured dependent variable $Y$ via the parameters $w_0,w_1\dots w_N$.

Equivalently, in vector notation we can write

$$Y(i) = \un w^\top\un x(i) + R(i),\quad i=1\dots n,$$

where $\un x=(x_0=1,x_1\dots x_N)^\top$ and $\un w=(w_0,w_1\dots w_N)^\top$.

The random variables $R(i)$ represent the measurement uncertainty in the realizations $y(i)$ of $Y(i)$. We again assume that they are uncorrelated and normally distributed with zero mean and variance $\sigma^2$, whereas the values $\un x(i)$ are, as before, assumed to be exact. Now we wish to determine the best value for the parameter vector $\un w$.

Introducing the $n\times (N+1)$ data matrix

$$\dmX = \pmatrix{\un x(1)^\top\cr\vdots\cr \un x(n)^\top},$$

we can write

$$\un Y = \dmX \un w + \un R,$$

where $\un Y = (Y(1)\dots Y(n))^\top$, $\un R=(R(1)\dots R(n))^\top$ and, by assumption,

$$\bf\Sigma_R = \langle\un R \un R^\top\rangle = \sigma^2\un I.$$

The identity matrix $\un I$ is $n\times n$.

The goodness of fit function is

$$z(\un w) =\sum_{i=1}^n\left[{y(i)-\un w^\top\un x(i)\over \sigma}\right]^2 ={1\over\sigma^2}(\un y-\dmX\un w)^\top(\un y-\dmX\un w).$$

The goodness of fit function is minimized by solving the equations

$${\partial z(\un w)\over\partial w_k}=0,\quad k=0\dots N.$$

Expanding:

$$ z(\un w) = {1\over\sigma^2}(\un y^\top\un y-\un y^\top\dmX\un w-\un w^\top\dmX\un y+\un w^\top\dmX^\top\dmX\un w), $$

so that, using the rules for vector differentiation,

$$ {\partial z(\un w)\over \partial\un w} = 0 = {1\over\sigma^2}(-\dmX^\top\un y -\dmX^\top \un y + 2\dmX^\top\dmX\un w) $$

from which we obtain the so-called *normal equation*,

$$\dmX^\top \un y = (\dmX^\top \dmX) \un w\ .$$

The estimated parameters of the model are obtained by solving for $\un w$,

$$ \hat{\un w}=(\un{\mathcal{X}}^\top \un{\mathcal{X}})^{-1} \un{\mathcal{X}}^\top \un y =: \dmX^+ \un y. $$

The matrix

$$ \dmX^+ =(\dmX^\top \dmX)^{-1} \dmX^\top\tag 1 $$

is the pseudoinverse of the data matrix $\dmX$.

In order to obtain the uncertainty in the estimate $\hat{\un w}$, we can think of $\un w$ as a random vector with mean value $\langle\un w\rangle$. Its covariance matrix is then given by

$$\eqalign{ \bf\Sigma_w &= \big\langle(\un w-\langle\un w\rangle) (\un w-\langle\un w\rangle)^\top\big\rangle\cr &\approx \big\langle(\un w-\hat{\un w}) (\un w-\hat{\un w})^\top\big\rangle\cr &= \big\langle(\un w-\dmX^+ \un y) (\un w-\dmX^+ \un y)^\top\big\rangle\cr &= \big\langle(\un w-\dmX^+ (\dmX \un w+\un r)) (\un w-\dmX^+ (\dmX \un w+\un r))^\top\big\rangle.} $$

But from Equation (1) we see that $\dmX^+ \dmX=\un I$, so

$$\eqalign{ \bf\Sigma_w &\approx \langle(-\dmX^+ \un r) (-\dmX^+ \un r)^\top\rangle = \dmX^+ \langle\un r \un r^\top\rangle {\dmX^+}^\top\cr &= \sigma^2\dmX^+ {\dmX^+}^\top. } $$

Again with Equation (1) we have finally

$$ \bf\Sigma_w \approx \sigma^2(\dmX^\top \dmX)^{-1} \dmX^\top \dmX((\dmX^\top\dmX)^{-1})^\top = \sigma^2(\dmX^\top \dmX)^{-1}. $$
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress, pearsonr

a = 1.0
b = 2.0
n = 200
sigma = 0.2
x = np.linspace(0,1,n)
y = (a + b*x + np.random.normal(scale = sigma, size = n))

bb,aa,_,_,s = linregress(x,y)

print 'a = %f, b = %f, s2 = %f' % (aa,bb,s**2)
print
# data matrix
X = np.mat(np.zeros((n,2)))
X[:,0] = 1
X[:,1] = np.mat(x).T
# pseudo inverse
Xp = (X.T*X).I*X.T
# solution
y = np.mat(y).T
w = Xp*y
print w.ravel()
print
# covariance matrix
print (sigma**2)*(X.T*X).I
a = 1.027349, b = 1.949169, s2 = 0.002255

[[ 1.02734885  1.94916912]]

[[ 0.00079403 -0.00118806]
 [-0.00118806  0.00237612]]

The Kalman filter

Consider the statistical model given by Equation (1):

$$ Y(i) = \sum_{j=0}^N w_j x_j(i) + R(i),\quad i=1\dots \nu.\tag 2 $$

This model relates the independent variables $\un x(i) = (1,x_1(i)\dots x_N(i))^\top$ to a measured quantity $Y(i)$ via the parameters $\un w= (w_0,w_1\dots w_N)^\top$. The index $\nu$ is now intended to represent the number of measurements that have been made so far. The random variables $R(i)$ represent the measurement uncertainty in the realizations $y(i)$ of $Y(i)$. We again assume that they are uncorrelated and normally distributed with zero mean and unit variance ($\sigma^2=1$), whereas the values $\un x(i)$ are exact.

We wish to determine the best values for parameters $\un w$. Equation (1) can be written in the terms of a data matrix $\un{\mathcal{X}}_\nu$ as

$$ \un Y_\nu = \un{\mathcal{X}}_\nu \un w + \un R_\nu, $$

where

$$ \un{\mathcal{X}}_\nu = \pmatrix{\un x(1)^\top\cr\vdots\cr\un x(\nu)^\top}, $$

$\un Y_\nu = (Y(1)\dots Y(\nu))^\top$ and $\un R_\nu = (R(1)\dots R(\nu))^\top$.

As was shown above, the best solution in the least squares sense for the parameter vector $\un w$ is given by

$$ \un w(\nu)=[(\un{\mathcal{X}}_\nu^\top \un{\mathcal{X}}_\nu)^{-1} \un{\mathcal{X}}_\nu^\top] \un y_\nu = \bf\Sigma(\nu) \un{\mathcal{X}}_\nu^\top \un y_\nu,\tag 3 $$

where the expression in square brackets is the pseudoinverse of $\un{\mathcal{X}}_\nu$ and where $\bf\Sigma(\nu)$ is an estimate of the covariance matrix of $\un w$,

$$ \bf\Sigma(\nu) = (\un{\mathcal{X}}_\nu^\top \un{\mathcal{X}}_\nu)^{-1}.\tag 4 $$

Suppose a new observation $(\un x(\nu+1),$ $y(\nu+1))$ becomes available. Now we must solve the least squares problem

$$ \pmatrix{\un Y_\nu\cr Y(\nu+1)} = \pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}\un w +\un R_{\nu+1}.\tag 5 $$

With Equation (3), the solution is

$$ \un w(\nu+1) = \bf\Sigma(\nu+1)\pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}^\top\pmatrix{\un y_\nu\cr y(\nu+1)}.\tag 6 $$

Inverting Equation (4) with $\nu\to \nu+1$, we obtain a recursive formula for the new covariance matrix $\bf\Sigma(\nu+1)$:

$$ \bf\Sigma(\nu+1)^{-1} = \pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}^\top\pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top} =\un{\mathcal{X}}_\nu^\top\un{\mathcal{X}}_\nu + \un x(\nu+1)\un x(\nu+1^\top) $$

or

$$ \bf\Sigma(\nu+1)^{-1} = \bf\Sigma(\nu)^{-1} + \un x(\nu+1)\un x(\nu+1)^\top.\tag 7 $$

To obtain a similar recursive formula for $\un w(\nu+1)$ we multiply Equation (6) out, giving

$$ \un w(\nu+1) = \bf\Sigma(\nu+1)(\un{\mathcal{X}}_\nu^\top\un y_\nu + \un x(\nu+1) y(\nu+1)), $$

and replace $\un y_\nu$ with $\un{\mathcal{X}}_\nu\un w(\nu)$ to obtain

$$ \un w(\nu+1) = \bf\Sigma(\nu+1)\Big(\un{\mathcal{X}}_\nu^\top\un{\mathcal{X}}_\nu\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big). $$

Using Equations (4) and (7),

$$\eqalign{ &\un w(\nu+1) = \bf\Sigma(\nu+1)\Big(\bf\Sigma(\nu)^{-1}\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big)\cr & = \bf\Sigma(\nu+1)\Big[\bf\Sigma(\nu+1)^{-1}\un w(\nu)-\un x(\nu+1)\un x(\nu+1)^\top\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big].} $$

This simplifies to

$$ \un w(\nu+1) = \un w(\nu) + \un K(\nu+1)\Big[y(\nu+1)-\un x(\nu+1)^\top\un w(\nu)\Big],\tag 8 $$

where the Kalman gain $\un K(\nu+1)$ is given by

$$ \un K(\nu+1) = \bf\Sigma(\nu+1)\un x(\nu+1).\tag 9 $$

Equations (7)--(9) define a so-called Kalman filter for the least squares problem of Equation (1). For observations

$$ \un x(\nu+1)\quad \hbox{and}\quad y(\nu+1) $$

the system response $\un x(\nu+1)^\top\un w(\nu)$ is calculated and compared in Equation (8) with the measurement $y(\nu+1)$. Then the innovation, that is to say the difference between the measurement and system response, is multiplied by the Kalman gain $\un K(\nu+1)$ determined by Equations (9) and (7) and the old estimate $\un w(\nu)$ for the parameter vector $\un w$ is corrected to the new value $\un w(\nu+1)$.

In [ ]: