The statistical model for linear regression may be written more generally in the form
$$Y(i) = w_o + \sum_{j=1}^{N} w_jx_j(i) + R(i),\quad i=1\dots n,\tag 1$$relating $n$ measurements of the $N$ independent variables $x_1\dots x_N$ to a measured dependent variable $Y$ via the parameters $w_0,w_1\dots w_N$.
Equivalently, in vector notation we can write
$$Y(i) = \un w^\top\un x(i) + R(i),\quad i=1\dots n,$$where $\un x=(x_0=1,x_1\dots x_N)^\top$ and $\un w=(w_0,w_1\dots w_N)^\top$.
The random variables $R(i)$ represent the measurement uncertainty in the realizations $y(i)$ of $Y(i)$. We again assume that they are uncorrelated and normally distributed with zero mean and variance $\sigma^2$, whereas the values $\un x(i)$ are, as before, assumed to be exact. Now we wish to determine the best value for the parameter vector $\un w$.
Introducing the $n\times (N+1)$ data matrix
$$\dmX = \pmatrix{\un x(1)^\top\cr\vdots\cr \un x(n)^\top},$$we can write
$$\un Y = \dmX \un w + \un R,$$where $\un Y = (Y(1)\dots Y(n))^\top$, $\un R=(R(1)\dots R(n))^\top$ and, by assumption,
$$\bf\Sigma_R = \langle\un R \un R^\top\rangle = \sigma^2\un I.$$The identity matrix $\un I$ is $n\times n$.
The goodness of fit function is
$$z(\un w) =\sum_{i=1}^n\left[{y(i)-\un w^\top\un x(i)\over \sigma}\right]^2 ={1\over\sigma^2}(\un y-\dmX\un w)^\top(\un y-\dmX\un w).$$The goodness of fit function is minimized by solving the equations
$${\partial z(\un w)\over\partial w_k}=0,\quad k=0\dots N.$$Expanding:
$$ z(\un w) = {1\over\sigma^2}(\un y^\top\un y-\un y^\top\dmX\un w-\un w^\top\dmX\un y+\un w^\top\dmX^\top\dmX\un w), $$so that, using the rules for vector differentiation,
$$ {\partial z(\un w)\over \partial\un w} = 0 = {1\over\sigma^2}(-\dmX^\top\un y -\dmX^\top \un y + 2\dmX^\top\dmX\un w) $$from which we obtain the so-called *normal equation*,
$$\dmX^\top \un y = (\dmX^\top \dmX) \un w\ .$$The estimated parameters of the model are obtained by solving for $\un w$,
$$ \hat{\un w}=(\un{\mathcal{X}}^\top \un{\mathcal{X}})^{-1} \un{\mathcal{X}}^\top \un y =: \dmX^+ \un y. $$The matrix
$$ \dmX^+ =(\dmX^\top \dmX)^{-1} \dmX^\top\tag 1 $$is the pseudoinverse of the data matrix $\dmX$.
In order to obtain the uncertainty in the estimate $\hat{\un w}$, we can think of $\un w$ as a random vector with mean value $\langle\un w\rangle$. Its covariance matrix is then given by
$$\eqalign{ \bf\Sigma_w &= \big\langle(\un w-\langle\un w\rangle) (\un w-\langle\un w\rangle)^\top\big\rangle\cr &\approx \big\langle(\un w-\hat{\un w}) (\un w-\hat{\un w})^\top\big\rangle\cr &= \big\langle(\un w-\dmX^+ \un y) (\un w-\dmX^+ \un y)^\top\big\rangle\cr &= \big\langle(\un w-\dmX^+ (\dmX \un w+\un r)) (\un w-\dmX^+ (\dmX \un w+\un r))^\top\big\rangle.} $$But from Equation (1) we see that $\dmX^+ \dmX=\un I$, so
$$\eqalign{ \bf\Sigma_w &\approx \langle(-\dmX^+ \un r) (-\dmX^+ \un r)^\top\rangle = \dmX^+ \langle\un r \un r^\top\rangle {\dmX^+}^\top\cr &= \sigma^2\dmX^+ {\dmX^+}^\top. } $$Again with Equation (1) we have finally
$$ \bf\Sigma_w \approx \sigma^2(\dmX^\top \dmX)^{-1} \dmX^\top \dmX((\dmX^\top\dmX)^{-1})^\top = \sigma^2(\dmX^\top \dmX)^{-1}. $$%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress, pearsonr
a = 1.0
b = 2.0
n = 200
sigma = 0.2
x = np.linspace(0,1,n)
y = (a + b*x + np.random.normal(scale = sigma, size = n))
bb,aa,_,_,s = linregress(x,y)
print 'a = %f, b = %f, s2 = %f' % (aa,bb,s**2)
print
# data matrix
X = np.mat(np.zeros((n,2)))
X[:,0] = 1
X[:,1] = np.mat(x).T
# pseudo inverse
Xp = (X.T*X).I*X.T
# solution
y = np.mat(y).T
w = Xp*y
print w.ravel()
print
# covariance matrix
print (sigma**2)*(X.T*X).I
Consider the statistical model given by Equation (1):
$$ Y(i) = \sum_{j=0}^N w_j x_j(i) + R(i),\quad i=1\dots \nu.\tag 2 $$This model relates the independent variables $\un x(i) = (1,x_1(i)\dots x_N(i))^\top$ to a measured quantity $Y(i)$ via the parameters $\un w= (w_0,w_1\dots w_N)^\top$. The index $\nu$ is now intended to represent the number of measurements that have been made so far. The random variables $R(i)$ represent the measurement uncertainty in the realizations $y(i)$ of $Y(i)$. We again assume that they are uncorrelated and normally distributed with zero mean and unit variance ($\sigma^2=1$), whereas the values $\un x(i)$ are exact.
We wish to determine the best values for parameters $\un w$. Equation (1) can be written in the terms of a data matrix $\un{\mathcal{X}}_\nu$ as
$$ \un Y_\nu = \un{\mathcal{X}}_\nu \un w + \un R_\nu, $$where
$$ \un{\mathcal{X}}_\nu = \pmatrix{\un x(1)^\top\cr\vdots\cr\un x(\nu)^\top}, $$$\un Y_\nu = (Y(1)\dots Y(\nu))^\top$ and $\un R_\nu = (R(1)\dots R(\nu))^\top$.
As was shown above, the best solution in the least squares sense for the parameter vector $\un w$ is given by
$$ \un w(\nu)=[(\un{\mathcal{X}}_\nu^\top \un{\mathcal{X}}_\nu)^{-1} \un{\mathcal{X}}_\nu^\top] \un y_\nu = \bf\Sigma(\nu) \un{\mathcal{X}}_\nu^\top \un y_\nu,\tag 3 $$where the expression in square brackets is the pseudoinverse of $\un{\mathcal{X}}_\nu$ and where $\bf\Sigma(\nu)$ is an estimate of the covariance matrix of $\un w$,
$$ \bf\Sigma(\nu) = (\un{\mathcal{X}}_\nu^\top \un{\mathcal{X}}_\nu)^{-1}.\tag 4 $$Suppose a new observation $(\un x(\nu+1),$ $y(\nu+1))$ becomes available. Now we must solve the least squares problem
$$ \pmatrix{\un Y_\nu\cr Y(\nu+1)} = \pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}\un w +\un R_{\nu+1}.\tag 5 $$With Equation (3), the solution is
$$ \un w(\nu+1) = \bf\Sigma(\nu+1)\pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}^\top\pmatrix{\un y_\nu\cr y(\nu+1)}.\tag 6 $$Inverting Equation (4) with $\nu\to \nu+1$, we obtain a recursive formula for the new covariance matrix $\bf\Sigma(\nu+1)$:
$$ \bf\Sigma(\nu+1)^{-1} = \pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top}^\top\pmatrix{\un{\mathcal{X}}_\nu \cr \un x(\nu+1)^\top} =\un{\mathcal{X}}_\nu^\top\un{\mathcal{X}}_\nu + \un x(\nu+1)\un x(\nu+1^\top) $$or
$$ \bf\Sigma(\nu+1)^{-1} = \bf\Sigma(\nu)^{-1} + \un x(\nu+1)\un x(\nu+1)^\top.\tag 7 $$To obtain a similar recursive formula for $\un w(\nu+1)$ we multiply Equation (6) out, giving
$$ \un w(\nu+1) = \bf\Sigma(\nu+1)(\un{\mathcal{X}}_\nu^\top\un y_\nu + \un x(\nu+1) y(\nu+1)), $$and replace $\un y_\nu$ with $\un{\mathcal{X}}_\nu\un w(\nu)$ to obtain
$$ \un w(\nu+1) = \bf\Sigma(\nu+1)\Big(\un{\mathcal{X}}_\nu^\top\un{\mathcal{X}}_\nu\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big). $$Using Equations (4) and (7),
$$\eqalign{ &\un w(\nu+1) = \bf\Sigma(\nu+1)\Big(\bf\Sigma(\nu)^{-1}\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big)\cr & = \bf\Sigma(\nu+1)\Big[\bf\Sigma(\nu+1)^{-1}\un w(\nu)-\un x(\nu+1)\un x(\nu+1)^\top\un w(\nu) + \un x(\nu+1) y(\nu+1)\Big].} $$This simplifies to
$$ \un w(\nu+1) = \un w(\nu) + \un K(\nu+1)\Big[y(\nu+1)-\un x(\nu+1)^\top\un w(\nu)\Big],\tag 8 $$where the Kalman gain $\un K(\nu+1)$ is given by
$$ \un K(\nu+1) = \bf\Sigma(\nu+1)\un x(\nu+1).\tag 9 $$Equations (7)--(9) define a so-called Kalman filter for the least squares problem of Equation (1). For observations
$$ \un x(\nu+1)\quad \hbox{and}\quad y(\nu+1) $$the system response $\un x(\nu+1)^\top\un w(\nu)$ is calculated and compared in Equation (8) with the measurement $y(\nu+1)$. Then the innovation, that is to say the difference between the measurement and system response, is multiplied by the Kalman gain $\un K(\nu+1)$ determined by Equations (9) and (7) and the old estimate $\un w(\nu)$ for the parameter vector $\un w$ is corrected to the new value $\un w(\nu+1)$.