Linear Regression Part 2: Validation

Thu, Apr 13, 2023 The problem with OLS is that it fits training data possibly too well. The test data may have a pretty awful $R^2$, which screams overfitting. This post, taken from the MMAI 863 course at Smith, details some ways to get the training data fit closer to the test data.

Residual Sum of Squares (RSS) is a form of loss function, that is the cost associated with the selected parameters. We want to minimize the loss function on the test data, without (of course) actually operating on the test data.

Shrinkage

Shrinkage is a type of feature selection. Each feature in a multivariate dataset will have a stronger or weaker influence on the fit. Sometimes one needs to lessen the effect of (or eliminate entirely) the weaker influences to prevent overfitting. Sometimes one wants to minimize the too-strong effect of others.

Two regression techniques, based on OLS, are Lasso Regression and Ridge Regression. They both work by the selection of a hyperparameter. A hyperparameter is a value decided by you the user that will adjust the results of an algorithm. Usually the data scientist will set a hyperparameter, run an algorithm, and then reset it and rerun to see if the training and test fits are any better.

Lasso and Ridge Regressions make use of a hyperparameter named Lambda ($\lambda$).

Lasso Regression

Lasso Regression lowers the effect of weaker influences. If its $\lambda$ is sufficiently large, its effect on the optimization (i.e., changing the the parameters over several runs to find the minimum loss function) is to set the co-efficients $\beta$ of the weakest features to 0.

Recall that for multivariate OLS, we want to minimize RSS:

$$ RSS = {\sum^n_{i=1}(y_i-\beta_0-\sum^p_{j=1}\beta_{j}x_{ij})^2} $$

For lasso regression, we want to minimize L1:

$$ L1 = RSS + \lambda{\sum^p_{j=1}|\beta_j|} $$

Ridge Regression

Ridge regression looks a lot like lasso regression, but rather than operating on the absolute of $\beta_j$, we operate on the square.

We want to minimize L2:

$$ L2 = RSS + \lambda{\sum^p_{j=1}\beta^2_j} $$

Ridge regression will always include all the parameters no matter the value of $\lambda$. The optimized parameter selection has the effect of reducing the largest co-efficients. Ridge regression is subject to interpretation problems if the number of parameters p is high.

Validation

Validation gives a direct estimate of what the test error is going to be, but it is computationally expensive. Basically, one pulls a random subset of the training data out and uses it as intermediate test data. If the results suck, one can readjust the model (i.e., change hyperparameters or even replace the model) without compromising the actual test data.

Cross-validation

Cross-validation is the repetition of validation with different holdouts. Two example techniques are LOOCV (leave-one-out cross-validation) and K-Fold.

LOOCV

Take out one instance from the data set as the validator. Train on the the remaining data. Test against the validator. Put the instance back and take out another instance as the validator. Rinse and repeat.

K-Fold

Create k subsets of the data set, randomly populated. Similarly to LOOCV, pull one subset (a fold) out. Train against the remaining data. Validate against the fold. Put the fold back, and withdraw another fold. Rinse and repeat.

Minimizing the Loss Function through Gradient Descent

The previous post had an equation that determined the optimal 1-d OLS equation in one pass. Multivariate OLS among other algorithms need several tries. However, how does one determine if parameters need to be adjusted up or down? How does one know that the parameters are as good a fit as they can get? One answer: convex optimization, as detailed by the Lagrange Lagrange Multiplier in Basic Calculus 4.

Gradient Descent is an algorithm that takes small steps in the direction where the function decreases the most, i.e. opposite to the direction of the steepest gradient. It stops when slope=0 or the slope starts climbing again.

$$ \beta^{\tau+1}_i = \beta^\tau_i-\eta\nabla $$

where:

$\nabla$ (nabla) represents the loss function, such as RSS, L1, or L2.
$\eta$ (eta) is the size of the step, called the learning rate.
$\tau$ (tau) is the number of steps taken towards slope=0, called the epoch.

The smaller the $\eta$, the more likely the gradient descent process will hit slope=0, but it takes longer. Smaller $\eta$ is also susceptible to hitting local minimums rather than achieving the global minimum.

Batches

A full batch gradient descent uses all training data each epoch. Slow and sucks memory. Most accurate.
A stochastic gradient descent uses one instance from the training data each epoch. Fast but noisy.
The mini-batch gradient descent uses a subset from the training data each epoch.

Determining the best learning rate

We can use grid search of the hyperparameters to find the best $\eta$.

Populate a grid with $\lambda$ values. Get the cross-validation error for each grid element. Select the tuning parameter value of the box with the lowest cost. Re-fit the model.