Last-layer Bayes neural nets
Bayesian and other probabilistic inference in overparameterized ML
January 11, 2017 — February 9, 2023
Consider the original linear model. We have a (column) vector \(\mathbf{y}=[y_1,y_2,\dots,y_n]^T\) of \(n\) observations, an \(n\times p\) matrix \(\mathbf{X}\) of \(p\) covariates, where each column corresponds to a different covariate and each row to a different observation.
We assume the observations are related to the covariates by \[ \mathbf{y}=\mathbf{Xb}+\mathbf{e} \] where \(\mathbf{b}=[b_1,b_2,\dots,b_p]\) gives the parameters of the model which we don’t yet know. We call \(\mathbf{e}\) the “residual” vector. Legendre and Gauss pioneered the estimation of the parameters of a linear model by minimising the squared residuals, \(\mathbf{e}^T\mathbf{e}\), i.e. \[ \begin{aligned}\hat{\mathbf{b}} &=\operatorname{arg min}_\mathbf{b} (\mathbf{y}-\mathbf{Xb})^T (\mathbf{y}-\mathbf{Xb})\\ &=\operatorname{arg min}_\mathbf{b} \|\mathbf{y}-\mathbf{Xb}\|_2\\ &=\mathbf{X}^+\mathbf{y} \end{aligned} \] where we find the pseudo-inverse \(\mathbf{X}^+\) using a numerical solver of some kind, using one of many carefully optimised methods that exist for least squares.
So far there is no statistical argument, merely function approximation.
However, it turns out that if you assume that the \(\mathbf{e}_i\) are distributed randomly and independently i.i.d. errors in the observations (or at least independent with constant variance), then there is also a statistical justification for this idea;
🏗 more exposition of these. Linkage to Maximum likelihood.
For now, handball to Lu (2022).