Machine Learning Concepts

Wed, Feb 8, 2023 This post is a simple list of concepts the student should know before tackling the details behind ML algorithms such as linear regression.

Although it is based on the MMAI863 Session Four slides at Smith School of Business, “statistical learning” will be skipped as it is not used for the rest of the programme.

General Concepts

Inputs vs Outputs
- Inputs are also known as predictors, independent variables, features, or X.
- Outputs are also known as responses, dependent variables, or Y.
Inference vs Prediction
- Inference means understanding the relationship between X and Y.
- Prediction means estimating future Y for given future X.
Supervised, Unsupervised, and Reinforcement Learning
- Supervised: Assume that, given a set of predictors and their associated responses, $Y=f(X) + \epsilon$, where $\epsilon$ is unavoidable random noise. The point of supervised ML is to discover an efficient function f() that mimics the relation between X and Y, but isn’t influenced by $\epsilon$.
- Unsupervised: There is a given set of predictors, but there are no responses at all, i.e., for set X there is no set Y. The point of unsupervised ML is to find relationships between the variables in set X.
- Reinforcement Learning: Develop an algorithm based on feedback from the environment.
Non-parametric estimation of f:
- Do not assume a form of f.
- Try several forms of f and tweak it close enough to the data points, but not so close that statistical random noise makes f too “wiggly.” Being influenced by noise is called “overfitting.”

Model Lifecycle

Collect data. This takes a while and can be frustrating and expensive.¹
Prepare/clean the data: fix missing data, eliminate errors, treat outliers.
Perform data analysis, and then add feature engineering (new data derived from previous data. For example, dates in sales predictors can lead to seasonal insights).
Develop (“train”) the algorithm (the “model”):
- Split the collected data 50/25/25% for training, validation and testing.
- “Fit” (train) the model with the 50% set of the data allocated for training
- Adjust hyperparameters with the validation data.
Test the algorithm on the test data.
Deploy the algorithm.²

Model Accuracy

Remember, predictions are future Y values generated by the model based on any future X inputs. The testing data is used as “future X” and the “future Y” results are compared against the real testing responses.

Predicted Y is denoted with the “hat” symbol: $\hat{Y}$. The estimated function f (our model) developed during is training denoted $\hat{f}$. So, $\hat{Y}=\hat{f}(X)$.

Note that $\hat{f}$ will never be exactly f. There may be no such thing as a perfect f. As f gets trained, the reducible errors are eliminated. However, unmeasurable variations such as human behaviour or unavailable predictors will knock $\hat{f}$ away from f. These irreducible errors are denoted $\epsilon$ and should be avoided during training to prevent overfitting.

Accuracy vs Interpretability Trade-off

The more restrictive the model, the more interpretable it is.
The more flexible the model, the more accurate it is. However, more flexible models are prone to overfitting.

Regression vs Classification

A Regression Model deals with quantitative outputs: numerical values.
A Classification Model deals with qualitative outputs: discrete categories such as {duck, dog, mouse, swan}.

Assessing Regression Model Accuracy

Since there is no one-size-fits-all ML approach to fit all data sets, one needs to select the best through quality of fit. For example, a common measure for regression models is mean squared error (MSE):

$$ MSE = {1\over{n}}{\sum^n_{i=1}}{(y_{i}-\hat{f}(x_i))^2} $$

Or more simply:

$$ MSE = {1\over{n}}{\sum^n_{i=1}}{(y_{i}-\hat{y}_i)^2} $$

In Python terms:

def mse(Y:list, Y_pred:list) -> float:
    sigma = 0
    n = len(Y)
    for i in range(n):
      sigma += (Y[i] - Y_pred[i])**2
    return sigma/n

Or more simply,

from sklearn.metrics import mean_square_error as mse

Let’s illustrate. Let us say the test responses (i.e., the outputs) for X are $Y=[1.3,2.4,3.2,5.5,7.3,7.9,8.0]$. This is the 25% of the original dataset set aside for testing.
Let us say that the trained model ran through X and came with its predicted responses $\hat{Y}=[1.3,2.5,3.1,4.8,7.6,8.0,8.5]$.

Y = [1.3,2.4,3.2,5.5,7.3,7.9,8.0]  # From test dataset
Y_pred = [1.3,2.5,3.1,4.8,7.6,8.0,8.5]

mse(Y, Y_pred)
0.12285714275714287

The goal of measuring accuracy is to train the model by assessing and lowering its training MSE but, in the end, find the model that gives the lowest test MSE. If one has trained the model to a low training MSE but the test MSE is much higher, than the data has been overfitted: the irreducible errors have too much influence. To illustrate the predictions being further off:

Y_pred = [0.5,2.8,3.5,4.1,8.6,10.0,10.5]
mse(Y, Y_pred)
2.1714285714285713

Training goes through three stages:

Stage 1: underfitting
Stage 2: good fit < Ideal scenario: low train and test MSE
Stage 2: overfitting

Assessing Classification Model Accuracy

One cannot use MSE on qualitative responses.

$$ ({duck}-{mouse})^2 = ? $$

Classification models use several techniques for model accuracy assessment:

Confusion Matrix
Accuracy
Recall
F1 Score
Fβ Score
The ROC Curve and the AUC

The Confusion Matrix

The confusion matrix shows the number of true and false positives, and true and false negatives.

	Predicted null	Predicted non-null	Total
True null	True negative TN	False positive TP	N
True non-null	False negative FN	True negative TN	P
Total	$\hat{N}$	$\hat{P}$

False Positives are type I errors
False Negatives are type II errors.

Accuracy

How many times the model is correct. It works best with balanced data sets.

$$ Accuracy = {{TP + TN}\over{TP + TN + FP + FN}} $$

Conversely, it is horrible with unbalanced data sets.

Precision

$P={TP\over{TP+FP}}$ The proportion of predicted non-null predictions that were actually true.

Recall

$R={TP\over{TP+FN}}$ The proportion of true non-nulls correctly predicted.

F1 Score

$F_1={2PR\over{P+R}}$ Harmonic mean of precision and recall, balancing the metrics but being closer to the lowest.

Fβ Score

$F_\beta={(1-\beta^2)P+R\over{\beta^{2}P+R}}$ generalizes the F1 Score. β is an adjustment that permits more weight to either Precision or Recall.

ROC Curve and the AUC

One can influence classification models by a probabilistic threshold. By default, non-null and null predictions are influenced by $Pr(default = Yes|X=x)=0.5$, i.e., even odds. Changes to this threshold should be decided by someone (yes, a real person) with domain knowledge.

Alternatively, one can look at an ROC curve to see the effect of thresholds over the type I (FP) and type II (FN) errors. Overall, the performance of a classification model is called the Area Under the Curve (AUC). The bigger the AUC, the lower the two types of error, and the better the model.

For an example AUC and ROC, see https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/. Know the difference between Sensitivity (TP rate) and 1-Specificity (FP rate).

An AUC of 1.0 is perfect. An AUC of 0.5 is no better than random.

ROC is the acronym for “Receiver Operating Characteristics.”

Much data is proprietary or subject to misuse. In practice, forms and contracts may need to be signed to use the data. ↩︎
Later courses teach that once the hyperparameters have been decided on and the testing does not show overfitting, the algorithm is retrained on the entire data set (do NOT readjust the hyperparameters) and then deployed. ↩︎