tags: linear-regression; statistics; regularization; ridge; machine-learning

Enter the Ridge: derivation and properties 2020-11-01

Intro

Ridge regression is a linear (on parameters) regression technique. It was popularized in machine learning circles by Friedman, Hastie, Tibshirani and company, with the introduction of the elastic-net, that generalizes the Ridge when the LASSO weighting is set to zero. It's often taken as the benchmark that other algorithms are compared to, since it's a powerful yet simple technique that gives good results in practice. But it's root lay on deeper grounds. It's an estimation technique that trades reductions in variance for tolerable increases in bias.

One of it's main selling points is that it's a strictly convex optimization problem. That means that it has a single optimum solution, the global one. This solution can be found using the normal equations from ordinary least squares (OLS). Let's start from there.

The ordinary least squares (OLS)

Arguably, linear regression is the first learning algorithm. It can also be shown to be at the basis of most statistical inference (sometimes quite out of sight). The problem statement is simple: minimize the residual sum of squares in the model below.

y=βX+ϵy = \beta X + \epsilon

yy, β\beta and epsilonepsilon are row vectors, XX is a matrix with independent variables as rows and samples as columns. Our objective function, the residual sum of squares then is simply ϵϵT\epsilon\epsilon^T.

LOLS=ϵ22=iN(yiβXi)2=ϵϵT=(yβX)(yβX)T\mathcal L_\text{OLS} = \|\epsilon\|_2^2 = \sum_i^N (y_i-\beta X_i)^2 = \epsilon\epsilon^T =\\ (y - \beta X)(y - \beta X)^T

The minimum can be found by taking the derivative regarding β\beta:

β(ϵϵT)=2X(yβX)T\frac{\partial}{\partial \beta}\left(\epsilon\epsilon^T\right)=-2X(y-\beta X)^T

Setting this to zero we retrieve:

yXT=βXXTyX^T=\beta XX^T

Solving for β\beta:

β=yXT(XXT)1\beta = yX^T(XX^T)^{-1}

The Ridge Regression problem statement

We obtained an estimator for the OLS coefficients. As all estimators, it has associated variance (the precision, the spread of the estimate's distribution) and bias (the accuracy, or the difference between the estimate and the true parameter). On bias, actually, the OLS estimator is unbiased. On the other hand, depending on the problem it can have quite large variance. For example, ill-posed (n<p n < p ) problems have unbounded variance.

What if we could trade some of that excess variance for some increase in bias?

That's one justification for Tikhonov regularization. We add a term to the objective function that prioritizes low norm solutions. Γ\Gamma is the Tikhonov matrix.

LTikhonov=yβX22+βΓ22(yβX)(yβX)T+βΓΓTβT\mathcal L_\text{Tikhonov} = \|y - \beta X\|_2^2+\|\beta\Gamma\|_2^2\\ (y - \beta X)(y - \beta X)^T + \beta\Gamma\Gamma^T\beta^T

Derivation is, again, quite straighforward:

β(LTikhonov)=2ΓTΓβT2X(yβX)T\frac{\partial}{\partial \beta}\left(L_\text{Tikhonov}\right)=2\Gamma^T\Gamma\beta^T-2X(y-\beta X)^T

Setting this to zero:

ΓTΓβT=X(yβX)T\Gamma^T\Gamma\beta^T=X(y-\beta X)^T

Transposing:

βΓTΓ=(yβX)XT=yXTβXXT\beta\Gamma^T\Gamma=(y-\beta X)X^T= y X^T - \beta X X^T

Solving for β\beta:

β=yXT(XXT+ΓTΓ)1\beta= y X^T\left(X X^T + \Gamma^T\Gamma\right)^{-1}
The usual Ridge regression assumes the Tikhonov matrix to be a multiple of the identity: Γ=λI\Gamma = \sqrt\lambda \mathbb I.

We can see that Tikhonov regularization amounts to the same solution as before, except that we augment the XXTX X^T with ΓTΓ\Gamma^T\Gamma.

In fact, this augmentation is more than figurative (see this answer). Imagine we augmented samples with virtual samples hand-crafted so that their cross product XXTXX^T matches ΓTΓ\Gamma^T\Gamma.

X=[XoriginalXvirtual]X = \left[\begin{matrix} X_\text{original} & X_\text{virtual} \end{matrix}\right]

If we do that:

XXT=XoriginalXoriginalT+XvirtualXvirtualT=XoriginalXoriginalT+ΓTΓX X^T = X_\text{original} X_\text{original}^T + X_\text{virtual} X_\text{virtual}^T = X_\text{original} X_\text{original}^T + \Gamma^T\Gamma

Thus, it suffices to have Xvirtual=ΓTX_\text{virtual} = \Gamma ^T.

The rest of the solution requires the product yXTyX^T. Since we augmented XX, we must augment yy in a way to keep the product intact. The correct augmentation is simply a row matrix of zeros (with pp elements, where pp is the number of variables), so that the product with XX is null.

y=[yoriginalyvirtual]=[yoriginal0]y = \left[\begin{matrix} y_\text{original} & y_\text{virtual} \end{matrix}\right] = \left[\begin{matrix} y_\text{original} & \mathbb 0 \end{matrix}\right]

Doing this augmentation

β=yoriginalXoriginalT(XoriginalXoriginalT+ΓTΓ)1=yXT(XXT)1\beta= y_\text{original} X_\text{original}^T\left(X_\text{original} X_\text{original}^T + \Gamma^T\Gamma\right)^{-1}\\= y X^T\left(X X^T\right)^{-1}

Properties

An interesting emergent property of Ridge regression is that, unlike the OLS solution, predictions and residuals are not uncorrelated, i.e.:

y^ϵT=(βX)(yβX)0\hat y \epsilon^T = \left(\beta X \right) \left(y - \beta X \right) \neq 0

In fact, this very fact[1] was the main motivation for me to write this post.

Due to the identification with the augmentation, however, we can see that's simply an artifact. Or, rather, that orthogonality of predictions and residuals is maintained if you take into consideration the virtual samples we introduced. After all, we've shown the OLS and the Ridge solutions to be obtained from the same equations.

Another interesting property, or rather interpretation, is the correspondence with Bayesianism. In the Bayesian point of view probabilities encode a level of certainty. In such a framework, linear regression parameters, for example, have each a distribution (and so do most constructs). Using Bayes' theorem, the posterior probability is proportional to the product of the likelihood and the prior probability. Taking the logarithm, that product becomes a sum. The residual sum of squares corresponds to the Gaussian log-likelihood, with constant diagonal covariance. The regularization term βΓ22\|\beta\Gamma\|_2^2, in turn, corresponds to a Gaussian log-prior on the parameters, with covariance Σ=(ΓΓT)1\Sigma=\left(\Gamma\Gamma^T\right)^{-1} and zero expected value (other values are also possible, and the Tikhonov framework accommodates them). Other regularization schemes may (or may not) correspond to other prior distributions.

I may talk a little more about them in future posts, specially the LASSO and the Total Variation regularizations. I hope you enjoyed the read thus far! Until the next one!

[1] Why does regularization wreck orthogonality of predictions and residuals in linear regression?