Machine Learning Concepts
This post is a simple list of concepts the student should know before tackling the details behind ML algorithms such as linear regression.Although it is based on the MMAI863 Session Four slides at Smith School of Business, “statistical learning” will be skipped as it is not used for the rest of the programme.
General Concepts
-
Inputs vs Outputs
- Inputs are also known as predictors, independent variables, features, or X.
- Outputs are also known as responses, dependent variables, or Y.
-
Inference vs Prediction
- Inference means understanding the relationship between X and Y.
- Prediction means estimating future Y for given future X.
-
Supervised, Unsupervised, and Reinforcement Learning
-
Supervised: Assume that, given a set of predictors and their associated responses, \(Y=f(X) + \epsilon\), where \(\epsilon\) is unavoidable random noise. The point of supervised ML is to discover an efficient function f() that mimics the relation between X and Y, but isn’t influenced by \(\epsilon\).
-
Unsupervised: There is a given set of predictors, but there are no responses at all, i.e., for set X there is no set Y. The point of unsupervised ML is to find relationships between the variables in set X.
-
Reinforcement Learning: Develop an algorithm based on feedback from the environment.
-
-
Non-parametric estimation of f:
- Do not assume a form of f.
- Try several forms of f and tweak it close enough to the data points, but not so close that statistical random noise makes f too “wiggly.” Being influenced by noise is called “overfitting.”
Model Lifecycle
- Collect data. This takes a while and can be frustrating and expensive.1
- Prepare/clean the data: fix missing data, eliminate errors, treat outliers.
- Perform data analysis, and then add feature engineering (new data derived from previous data. For example, dates in sales predictors can lead to seasonal insights).
- Develop (“train”) the algorithm (the “model”):
- Split the collected data 50/25/25% for training, validation and testing.
- “Fit” (train) the model with the 50% set of the data allocated for training
- Adjust hyperparameters with the validation data.
- Test the algorithm on the test data.
- Deploy the algorithm.2
Model Accuracy
Remember, predictions are future Y values generated by the model based on any future X inputs. The testing data is used as “future X” and the “future Y” results are compared against the real testing responses.
Predicted Y is denoted with the “hat” symbol: \(\hat{Y}\). The estimated function f (our model) developed during is training denoted \(\hat{f}\). So, \(\hat{Y}=\hat{f}(X)\).
Note that \(\hat{f}\) will never be exactly f. There may be no such thing as a perfect f. As f gets trained, the reducible errors are eliminated. However, unmeasurable variations such as human behaviour or unavailable predictors will knock \(\hat{f}\) away from f. These irreducible errors are denoted \(\epsilon\) and should be avoided during training to prevent overfitting.
Accuracy vs Interpretability Trade-off
- The more restrictive the model, the more interpretable it is.
- The more flexible the model, the more accurate it is. However, more flexible models are prone to overfitting.
Regression vs Classification
- A Regression Model deals with quantitative outputs: numerical values.
- A Classification Model deals with qualitative outputs: discrete categories such as {duck, dog, mouse, swan}.
Assessing Regression Model Accuracy
Since there is no one-size-fits-all ML approach to fit all data sets, one needs to select the best through quality of fit. For example, a common measure for regression models is mean squared error (MSE):
$$ MSE = {1\over{n}}{\sum^n_{i=1}}{(y_{i}-\hat{f}(x_i))^2} $$
Or more simply:
$$ MSE = {1\over{n}}{\sum^n_{i=1}}{(y_{i}-\hat{y}_i)^2} $$
In Python terms:
def mse(Y:list, Y_pred:list) -> float:
sigma = 0
n = len(Y)
for i in range(n):
sigma += (Y[i] - Y_pred[i])**2
return sigma/n
Or more simply,
from sklearn.metrics import mean_square_error as mse
Let’s illustrate. Let us say the test responses (i.e., the outputs) for X are
\(Y=[1.3,2.4,3.2,5.5,7.3,7.9,8.0]\). This is the 25% of the original dataset
set aside for testing.
Let us say that the trained model ran through X and came with its
predicted responses \(\hat{Y}=[1.3,2.5,3.1,4.8,7.6,8.0,8.5]\).
Y = [1.3,2.4,3.2,5.5,7.3,7.9,8.0] # From test dataset
Y_pred = [1.3,2.5,3.1,4.8,7.6,8.0,8.5]
mse(Y, Y_pred)
0.12285714275714287
The goal of measuring accuracy is to train the model by assessing and lowering its training MSE but, in the end, find the model that gives the lowest test MSE. If one has trained the model to a low training MSE but the test MSE is much higher, than the data has been overfitted: the irreducible errors have too much influence. To illustrate the predictions being further off:
Y_pred = [0.5,2.8,3.5,4.1,8.6,10.0,10.5]
mse(Y, Y_pred)
2.1714285714285713
Training goes through three stages:
- Stage 1: underfitting
- Stage 2: good fit < Ideal scenario: low train and test MSE
- Stage 2: overfitting
Assessing Classification Model Accuracy
One cannot use MSE on qualitative responses.
$$ ({duck}-{mouse})^2 = ? $$
Classification models use several techniques for model accuracy assessment:
- Confusion Matrix
- Accuracy
- Recall
- F1 Score
- Fβ Score
- The ROC Curve and the AUC
The Confusion Matrix
The confusion matrix shows the number of true and false positives, and true and false negatives.
Predicted null | Predicted non-null | Total | |
---|---|---|---|
True null | True negative TN | False positive TP | N |
True non-null | False negative FN | True negative TN | P |
Total | \(\hat{N}\) | \(\hat{P}\) |
- False Positives are type I errors
- False Negatives are type II errors.
Accuracy
How many times the model is correct. It works best with balanced data sets.
$$ Accuracy = {{TP + TN}\over{TP + TN + FP + FN}} $$
Conversely, it is horrible with unbalanced data sets.
Precision
\(P={TP\over{TP+FP}}\) The proportion of predicted non-null predictions that were actually true.
Recall
\(R={TP\over{TP+FN}}\) The proportion of true non-nulls correctly predicted.
F1 Score
\(F_1={2PR\over{P+R}}\) Harmonic mean of precision and recall, balancing the metrics but being closer to the lowest.
Fβ Score
\(F_\beta={(1-\beta^2)P+R\over{\beta^{2}P+R}}\) generalizes the F1 Score. β is an adjustment that permits more weight to either Precision or Recall.
ROC Curve and the AUC
One can influence classification models by a probabilistic threshold. By default, non-null and null predictions are influenced by \(Pr(default = Yes|X=x)=0.5\), i.e., even odds. Changes to this threshold should be decided by someone (yes, a real person) with domain knowledge.
Alternatively, one can look at an ROC curve to see the effect of thresholds over the type I (FP) and type II (FN) errors. Overall, the performance of a classification model is called the Area Under the Curve (AUC). The bigger the AUC, the lower the two types of error, and the better the model.
For an example AUC and ROC, see https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/. Know the difference between Sensitivity (TP rate) and 1-Specificity (FP rate).
An AUC of 1.0 is perfect. An AUC of 0.5 is no better than random.
ROC is the acronym for “Receiver Operating Characteristics.”
-
Much data is proprietary or subject to misuse. In practice, forms and contracts may need to be signed to use the data. ↩︎
-
Later courses teach that once the hyperparameters have been decided on and the testing does not show overfitting, the algorithm is retrained on the entire data set (do NOT readjust the hyperparameters) and then deployed. ↩︎