The Least-squares Solution

Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.

The OLS estimator is consistent when the regressors are exogenous, and—by the Gauss–Markov theorem-optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed, OLS is the maximum likelihood estimator.

Linear model

Okun's law in macroeconomics states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.

Suppose the data consists of $n$ observations $\left\{\mathbf {x} _{i},y_{i}\right\}_{i=1}^{n}$ . Each observation $i$ includes a scalar response $y_{i}$ and a column vector $\mathbf {x} _{i}$ of $p$ parameters (regressors), i.e., $\mathbf {x} _{i}=\left[x_{i1},x_{i2},\dots ,x_{ip}\right]^{\mathsf {T}}$ . In a linear regression model, the response variable, $y_{i}$ , is a linear function of the regressors:

y_{i}=\beta _{1}\ x_{i1}+\beta _{2}\ x_{i2}+\cdots +\beta _{p}\ x_{ip}+\varepsilon _{i},

or in vector form,

y_{i}=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}+\varepsilon _{i},\,

where $\mathbf {x} _{i}$ , as introduced previously, is a column vector of the $i$ -th observation of all the explanatory variables; ${\boldsymbol {\beta }}$ is a $p\times 1$ vector of unknown parameters; and the scalar $\varepsilon _{i}$ represents unobserved random variables (errors) of the $i$ -th observation. $\varepsilon _{i}$ accounts for the influences upon the responses $y_{i}$ from sources other than the explanators $\mathbf {x} _{i}$ . This model can also be written in matrix notation as

\mathbf {y} =\mathrm {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,

where $\mathbf {y}$ and ${\boldsymbol {\varepsilon }}$ are $n\times 1$ vectors of the response variables and the errors of the $n$ observations, and $\mathrm {X}$ is an $n\times p$ matrix of regressors, also sometimes called the design matrix, whose row $i$ is $\mathbf {x} _{i}^{\mathsf {T}}$ and contains the $i$ -th observations on all the explanatory variables.

As a rule, the constant term is always included in the set of regressors $\mathrm {X}$ , say, by taking $x_{i1}=1$ for all $i=1,\dots ,n$ . The coefficient $\beta _{1}$ corresponding to this regressor is called the intercept.

Regressors do not have to be independent: there can be any desired relationship between the regressors (so long as it is not a linear relationship). For instance, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ( ${\boldsymbol {\beta }}$ ).

Matrix/vector formulation

Consider an overdetermined system

\sum _{j=1}^{p}X_{ij}\beta _{j}=y_{i},\ (i=1,2,\dots ,n),

of $n$ linear equations in $p$ unknown coefficients, $\beta _{1},\beta _{2},\dots ,\beta _{p}$ , with $n>p$ . (Note: for a linear model as above, not all elements in $\mathrm {X}$ contains information on the data points. The first column is populated with ones, $X_{i1}=1$ . Only the other columns contain actual data. So here $p$ is equal to the number of regressors plus one.) This can be written in matrix form as

\mathrm {X} {\boldsymbol {\beta }}=\mathbf {y} ,

where

\mathrm {X} ={\begin{bmatrix}X_{11}&X_{12}&\cdots &X_{1p}\\X_{21}&X_{22}&\cdots &X_{2p}\\\vdots &\vdots &\ddots &\vdots \\X_{n1}&X_{n2}&\cdots &X_{np}\end{bmatrix}},\qquad {\boldsymbol {\beta }}={\begin{bmatrix}\beta _{1}\\\beta _{2}\\\vdots \\\beta _{p}\end{bmatrix}},\qquad \mathbf {y} ={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{bmatrix}}.

Such a system usually has no exact solution, so the goal is instead to find the coefficients ${\boldsymbol {\beta }}$ which fit the equations "best", in the sense of solving the quadratic minimization problem

{\hat {\boldsymbol {\beta }}}={\underset {\boldsymbol {\beta }}{\operatorname {arg\,min} }}\,S({\boldsymbol {\beta }}),

where the objective function $S$ is given by

S({\boldsymbol {\beta }})=\sum _{i=1}^{n}{\biggl |}y_{i}-\sum _{j=1}^{p}X_{ij}\beta _{j}{\biggr |}^{2}={\bigl \|}\mathbf {y} -\mathrm {X} {\boldsymbol {\beta }}{\bigr \|}^{2}.

A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the $p$ columns of the matrix $\mathrm {X}$ are linearly independent, given by solving the normal equation

(\mathrm {X} ^{\mathsf {T}}\mathrm {X} ){\hat {\boldsymbol {\beta }}}=\mathrm {X} ^{\mathsf {T}}\mathbf {y} \ .

The matrix $\mathrm {X} ^{\mathsf {T}}\mathrm {X}$ is known as the Gram matrix and the matrix $\mathrm {X} ^{\mathsf {T}}\mathbf {y}$ is known as the moment matrix of regressand by regressors. Finally, ${\hat {\boldsymbol {\beta }}}$ is the coefficient vector of the least-squares hyperplane, expressed as