Parameter Tuning

Section 6 Parameter Tuning

All of our regression models for both classification and regression techniques, require finding values of \(\Theta\) that work well for our model and for our desired use of the data. But there are also many other parameters that we can vary as part of our optimization technique. How do we find the right mix of parameters? How do we deal with models that are overfitting and underfitting? What does it mean for a model to work well? What does desired use of the data mean? We will address various aspects of these question in this section. Regularization is an important strategy in parameter tuning to help control overfitting. Cross-validation and parameter searches are a way to help find the right mix of parameters. There are additional model evaluation metrics that are important for classification models that will help us determine if the model is working well in different situations.

Subsection 6.1 Regularization

There are three different types of error that can occur in a machine learning model: Bias, Variance, and Irreducible Error.

Irreducible Error: This is error that is inherent in the data itself. We may be able to clean up the data in the preprocessing stage, but usually there is always still some error that is present in the data.

Bias: The bias of an estimator is its average error for different training sets. If there is high bias then the model isn't doing a very good job of making predictions. Note: This is not related to the bias term in linear regression.

Variance: The variance of an estimator indicates how sensitive it is to varying training sets. If there is high variance then the model will do very well on the training set, but not very well on the testing set.

Question 6.1.

Does high bias correspond to underfitting or overfitting? What about high variance? Answer

High bias corresponds to underfitting and high variance corresponds to overfitting.

Generally we are always trying to find a balance in our machine learning techniques between high bias and high variance. For example, in polynomial regression if the degree is too low we will have high bias and if the degree is too high we will have high variance.

Question 6.2.

What do we do if a model has high bias? What do we do if a model has high variance? Answer

If a model has high bias then the underlying model is just not good, so we should try a different model. If a model has high variance, then the model is working too hard to memorize the training data, so we might consider a different model, but we might also try to reduce the model's ability to memorize the training data. There are two main strategies here, add regularization or add early stopping.

Regularization is one tool to help with overfitting (high variance.) We are going to add a penalty to the cost function to constrain the size of the \(\Theta\) values. There are three different types of regularization based on the type of penalty we add to the cost function.

The first type is called Ridge Regression or \(l2\) penalty. For this technique we change the cost function to

\begin{equation*} J(\Theta) = MSE(\Theta) +\frac{\alpha}{2}\sum_{i=1}^m \theta_i^2 \end{equation*}

where \(MSE(\Theta)\) stands for the mean squared error cost function for Linear Regression, but would be replaced by the appropriate cost function for the given algorithm. For example, log-loss, or cross-entropy loss for logistic regression. For Ridge Regression we penalize the \(l2\) norm in the cost function. In otherwords we try to minimize the cost function with respect to \(\theta\) values, but we add a term that penalizes the cost when the \(\theta\) values are large. Here the penalty is the \(l2\) norm or Euclidean distance from orgin. (You may also see this as \(l^2\) norm, but sklearn favors \(l2\text{.}\))

Question 6.3.

There is a new parameter \(\alpha\text{.}\) What does this do? Answer

The parameter \(\alpha\) changes the amount of reglarization to do. If alpha is small then very little regularization, and if alpha is large than we are giving a large penalty.

Question 6.4.

Why does the sum start at 1 and not 0? Answer

\(\theta_0\) controls the intercept of our function. This is not part of overfitting and there are other parameters of our models that will allow us to set this to 0 if we want.

Question 6.5.

What is the impact of adding this penalty to the cost function? What effect will it have on the \(\theta_i\) values for \(i \geq 1\text{?}\) Answer

It will make the \(\theta_i\) values closer to zero.

Question 6.6.

When should we use ridge regularization? Answer

Almost always we should use at least a little regularization in our models. Nearly all of the sklearn models are implementing this automatically and ridge regularization is the standard type of regularization to try first.

Graphs of different values of the regularization coefficient \(\alpha\) is given in Figure 6.7.

Figure 6.7. Ridge Regression Image from Chapter 4

Question 6.8.

What is regularization doing to the graph and why? Answer

It is flattening the linear graph because it is keeping \(\theta_1\) smaller. It is smoothing out the polynomial graph because it is keeping \(\theta_i\) smaller.

There are different ways to access this in sklearn models using either ridge or l2 penalty terminology. But there are other ways to penalize the \(\theta\) values.

The second type of regularization is called Lasso Regression or \(l1\) penalty. For this technique we change the cost function to

\begin{equation*} J(\Theta) = MSE(\Theta) +\alpha\sum_{i=1}^m |\theta_i|. \end{equation*}

In Lasso Regression we penalize the \(l1\) norm in the cost function. Note that again the parameter \(\alpha\) changes the amount of reglarization to do.

Question 6.9.

Why is this named Lasso? Answer

Lasso stands for Least Alsolute Shrinkage and Selection Operator

Lasso Regression tends to eliminate weights of the least important features. It is especially useful if have more features than data points. However, it can cause trouble with optimization because the gradient descent may bounce around. That is, there may be convergence issues.

Graphs of different values of the regularization coefficient \(\alpha\) is given in Figure 6.10.

Figure 6.10. Lasso Regression Image from Chapter 4

Question 6.11.

How does Lasso Regression compare to Ridge Regression (as in Figure 6.7)? Answer

It has a stronger flattening effect and for \(\alpha=1\) appears to be constant.

Lasso Regression will be referenced in sklearn models using either lasso and l1 penalty terminology.

The third type of regularization is called ElasticNet Regression or \(l1\)-ratio. For this technique we change the cost function to

\begin{equation*} J(\Theta) = MSE(\Theta) +r\alpha\sum_{i=1}^m |\theta_i| + \alpha\frac{1-r}{2}\sum_{i=1}^m \theta_i^2. \end{equation*}

Elastic Net Regression is a mixture of Ridge and Lasso. We now have two parameters, \(\alpha\) which is the amount of regularization and \(r\) which is the mix ratio (also called the l1-ratio). The mix ratio impacts how much of ridge vs lasso you want to do. In general, \(0\lt r\lt 1\text{.}\) If \(r=1\text{,}\) then it is just Lasso. If \(r=0\text{,}\) then it is just ridge

In general, Ridge Regression is a good default (\(r=0\)). Lasso Regression is good if you suspect that you only have a few good features (\(r=1\)). Elastic Net is prefered over Lasso when several features are strongly correlated or the number of features is greater than the number of training instances.

Early stopping is another way we can try to reduce overfitting. We train the data until the testing score starts getting worse. The steps for this method are

Train data in epochs.
Track training and testing errors.
Stop once you start doing worse on testing.

For example, in Figure 6.12 we would stop training after approximately 240 epochs.

Figure 6.12. Early Stopping Image from Chapter 4

This is usually one of the parameters you can choose in many of the machine learning models. That is, you can often apply this by choosing early_stopping=True.

See Jupyter notebook for details on implementation for these techniques.

Subsection 6.2 Parameter Searching

We now have many parameters for our models that we want to explore and we would like a way to find the best parameter or combination of parameters. Generally this requires separating testing data and measuring the performance of our model on the testing data and comparing its performance across parameters. We are going to talk about ways to do that more efficiently.

The first issue we have seen with this technique is that the way the data is split between training and testing can really impact performance and accuracy of a model. However, sometimes we have small data sets, so it can be hard to do this well. One technique we can apply to help with this is called \(k\)-fold cross validation. In this technique we break the data into \(k\) sets, use each set as test data, and the remaining data as training data. We create the model \(k\) different times, (on \(k\) different training set), examine all the scores and the average score.

Example 6.13.

Suppose the data set has 500 elements and we want to apply 5-fold cross validation. We separate the data into five folds where each fold has a random selection of 100 data points.

100	fold 1
100	fold 2
100	fold 3
100	fold 4
100	fold 5

Question 6.14.

If our 500 data points have 300 dogs and 200 cats, how should we distribute them across folds? Answer

Each fold should have 60 dogs and 40 cats. But each fold should have a random set of 60 dogs and 40 cats, not the first 60 dogs in the data set, etc.

Of course, sklearn has a built-in for this its cross_val_score.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors = 3)
cv_scores = cross_val_score(knn, X, y, cv=3)

The full list of parameters are here:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, 
cv=None, n_jobs=None, verbose=0, fit_params=None, 
pre_dispatch='2*n_jobs', error_score=nan)

The main ones we will use are

The estimator is the machine learning technique (kNN, PCA, LinearRegression (with or without polynomial features), LogisticRegression (with or without polynomial features), etc) that we want to apply to the data.
\(X\) is the set of features for our data (should be scaled!).
\(y\) is the labeled data if a supervised tecnique is being used. No \(y\) is entered if using an unsupervised technique.
scoring=None means the estimators default scoring method is used.
cv is the number of folds. (The default, cv=None using 5 folds.) If the estimator is a classification technique then the data is split into folds using a stratified method. A random split is used for regression techniques.

Question 6.15.

What percentage of the data is in a testing set if five folds are used? Answer

\(\frac{1}{5}=.20\text{.}\) Thus, the default value of \(cv=5\) corresponds to our standard 20/80 split for testing/training data. In general, the size of the testing set will be approximately \(\frac{m}{cv}\) if \(m\) is the number of data points, and \(cv\) is the number of folds.

Question 6.16.

What does stratified mean? Answer

Remember that stratified means splitting the data so that the proportion of classes in each fold are as close as possible to the proportions of classes in the overall data set.

Question 6.17.

What are the pros and cons of cross validation? Answer

It is useful to see how sensitive the model is to the makeup of the training and testing set. We can see best case/worst case examples. The main downside is that we have to run \(k\) models which takes \(k\) times as long. Thus, this strategy is best for smaller data sets.

Remember that when feature scaling is applied to data, it should only be done on the training data and not the full data set. Otherwise it leaks data about the testing set. For cross validation it must be done on each training set individually. (The PIPELINE feature will really help here!)

We will return to the fruit data set from exam 1 for an example of cross validation.

We are going to revisit kNN as our classification technique and use all four features. Since the scales of the features are quite different, it is very important to use feature scaling now. We will create a Pipeline to combine ScandardScaler and kNeighborsClassifier.

Question 6.18.

What happens if we use six folds? Answer

We will get an error because it is impossible to do stratified separation. There are only five mandarins.

Question 6.19.

How many folds should we use in cross validation? Answer

Depends on the data! For this data set, we want five or less, but smaller is probably better because having only one mandarin per data set makes it difficult for the model to learn about mandarins. For train/test split we often use a 80/20 split, and the default value of five folds would correspond to this split.

Nearly all of the models have multiple parameters to tune. How do we find the best combination of parameters? For example, in the previous fruit data set example, what is the best value of \(k\) to use in \(k\) nearest neighbors? For a single parameter, we can use a validation curve. A validation_curve will produce an \(m \times n \) array of scores for training and testing where \(m= \) the number of parameters specified in the range and \(n= \) the number of folds for cross validation. We can then graph an average score across folds to see how the score varies as we change the single parameter.

from sklearn.model_selection import validation_curve
param_range = range(1,8,2)
train_scores, test_scores = validation_curve(KNeighborsClassifier(), X, y,
            param_name='n_neighbors', param_range=param_range, cv=cvnum)

The full list of parameters is

validation_curve(estimator, X, y, param_name, param_range, 
	groups=None, cv=None, scoring=None, n_jobs=None, 
	pre_dispatch='all',verbose=0, error_score=nan).

Most of these are the same as cross_val_score():

The estimator is the machine learning technique (kNN, PCA, LinearRegression (with or without polynomial features), LogisticRegression (with or without polynomial features), etc) that we want to apply to the data.
\(X\) is the set of features for our data
\(y\) is the labeled data if a supervised tecnique is being used. No \(y\) is entered if using an unsupervised technique.
scoring=None means the estimators default scoring method is used.
cv is the number of folds. (The default, cv=None using 5 folds.) If the estimator is a classification technique then the data is split into folds using a stratified method. A random split is used for regression techniques.

The new ones are param_name and param_range where we specify the parameter we want to vary.

Example 6.20.

If we call

 
train_scores, test_scores = validation_curve(KNeighborsClassifier(), X, y,
param_name='n_neighbors',param_range=[1,3,5,7], cv=3).

This will produce an \(m \times n \) array of scores for training and testing where \(m \) is the number of parameters specified in the range and \(n \) is the number of folds for cross validation.

	fold 1	fold 2	fold 3
k=1
k=3
k=5
k=7

Let's try it!

Now that we have all the scores, it is easiest to analyze them with a plot. We will plot the average across folds as the number of neighbors increases.

Question 6.21.

What do we think of our model? What's the best value of \(k\text{?}\) Is this model a keeper? Answer

The best value of \(k\) is 3 or 5, but the gap between training and testing scores is quite large, so we are not happy with this as a model.

Question 6.22.

Why is this model not very good? Answer

We skipped the feature scaling part!!!! We should have used our Pipeline instead of KNeighborsClassifier. Let's go back and fix it!

Note that we have to call the parameter name differently now. Its the name we used in the Pipeline for KNeighborsClassifer() and double underscore to reference the parameter.

We can search over multiple parameters with a parameter search. The main techniques combine cross validation with either an exhaustive search over all combinations of parameters in GridSearchCV or random search over combinations of parameters in RandomizedSearchCV. We will begin with GridSearchCV. This will perform an exhaustive search over all combinations of parameters entered, each combination calculated over \(cv\) folds. The full parameter list is

 from sklearn.model_selection import GridSearchCV
 GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None,
    iid='deprecated', refit=True, cv=None, verbose=0,
    pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

Most of these should look familiar. We specify the parameters we want to examine with a dictionary for param_grid. We can access the parameters that return the best score on the testing set with grid_knn_acc.best_params_ and their corresponding scores in grid_knn_acc.best_score_. Note that the default setting will not return the scores on the training data to reduce the number of calculations made. It is still useful to compare training and testing scores, but it may be better to go back and do that on just the best model rather than storing all the scores on the training data. A full set of results cana be obtained with grid_knn_acc.cv_results_.

Example without a Pipeline and with only one parameter.

Question 6.23.

What's up with the rankings? Is \(k=3\) really the best? Answer

Yes, but it is a tie with \(k=5\text{.}\) So we aren't completely certain about which is the best model here.

Example with a Pipeline and two parameters.

Question 6.24.

How many times did knn_model create a different model in the above GridSearchCV? Answer

There are 10 different choices for \(k\text{,}\) 5 different choices for \(p\) and 3 different folds. Thus, there are \(10*5*3=150\) different models that are evaluated. Is that too many?

Depending on the size of our model, an exhaustive grid search, might not be the way to go. Another option is RandomizedSearchCV. A fixed number of parameter values are examined here (specified by n_iter=10.) Parameters are specified with a dictionary or list of dictionaries in param_distributions. Ranges may be specified as a list (recommended for discrete parameters) or a distribution (recommended for continuous parameters). We'll mostly stick to lists of ranges for parameters. The full list of parameters is

 RandomizedSearchCV(estimator, param_distributions, n_iter=10,
    scoring=None, n_jobs=None, iid='deprecated', refit=True,
    cv=None,verbose=0, pre_dispatch='2*n_jobs', random_state=None,
    error_score=nan, return_train_score=False)

Let's examine a different classification technique. Let's apply LogisticRegression to the fruit data set and examine the effects of regularization. Note that the regularization parameter for Logistic Regression is C where \(C=\frac{1]{\alpha}\text{.}\) So large values of C indicate small amounts of regularization. We usually want to vary C by powers of ten. How do you think Logistic Regression will do compared to k Nearest Neighbors?

Question 6.25.

What does the notation 1e-30 mean? Which method does better? Why? Answer

1e-30 indicates scientific notation so \(1 \times 10^{-30}\text{.}\) k Nearest Neighbors did better. Possibly linear decision boundaries don't make sense on this data set. Maybe we should try LogisticRegression with Polynomial features?

Subsection 6.3 Evaluating Classification Models

As we are adding models to our collection, how do we determine which is the best one for our data? What does best even mean? For classification models, examining the accuracy of the model is always a good score to examine, but it may not help us determine the best model for our situation.

Question 6.26.

Suppose we have a fruit data set with 45 apples and 5 oranges. If our model is 90% accurate, is that a good score? Answer

Probably not, its probably just predicting apples all the time. In this case, we might think let's just add more oranges to our data set, but often times we are trying to identify things that are actually fairly rare. For example, whether a tumor is benign or malignant. Thankfully, benign tumors are much more common. But we really, really want to be able to identify the malignant ones, and not just always predict benign. In general, data that is highly skewed to one classification presents additional challenges and the accuracy score is not generally as helpful as we want.

Some other model evaluation techniques that are often used include Confusion Matrices, Precision vs Recall graphs, and ROC curves.

A Confusion Matrix records how accurate our model is in each class relative to each other class. For two classes, there are four categories,

True Negative (TN): The data has class 0 and the algorithm predicts class 0.
False Negative (FN): The data has class 1 and the algorithm predicts class 0.
False Positive (FP): The data has class 0 and the algorithm predicts class 1.
True Positive (TP): The data has class 1 and the algorithm predicts class 1.

and we can visualize this as a \(2 \times 2\) matrix.

	predict 0	predict 1
class 0	TN	FP
class 1	FN	TP

For \(k\) classes, we can visualize this as a \(k \times k\) matrix. Note TN,FP,FN,TP no longer make sense in this case as there are many more than four cases.

	predict 0	predict 1	\(\cdots \)	predict k
class 0
class 1
\(\vdots \)
class k

In the last example for logistic regression, even our best model didn't do very well. Why not? For simplicity we are looking at the confusion matrix on the full data set, but really we should only be looking at it on the testing data.

Question 6.27.

What fruits is this model having difficulty predicting? How does this help us build a better model? Answer

Its definitely having a lot of trouble correctly identifying oranges. It has a lot of trouble distinguishing oranges and apples. Maybe we should use a different model. But maybe we should add another feature that would help us distinguish between apples and oranges.

Question 6.28.

What is the accuracy of the model? How many did the model get wrong? What fraction of the oranges were correctly predicted to be orange? What fraction of the fruits predicted to be oranges are really oranges? Answer

The diagonals indicate the correct values, so our accuracy is
\begin{equation*} \frac{15+16+5+8}{59}=44/59=74.6\%. \end{equation*}
The off-diagonals indicate the fruits that were predicted incorrectly.
\begin{equation*} \frac{15}{59}=25.4\%. \end{equation*}
There are 19 oranges and the model correctly predicted 8 of them. Thus the fraction of oranges correctly predicted to be oranges is
\begin{equation*} 8/19=42\%\text{.} \end{equation*}
This is called Recall or True Positive Rate.
The model predicted 12 fruits to be oranges, but only 8 of them were actually oranges. Thus the fraction of the orange predictions that are correct is
\begin{equation*} 8/12=66.7\%\text{.} \end{equation*}
This is called Precision.

The trade-off between Precision and Recall is usually a function of our parameter tuning. We often want to analyze this to choose the best model based on how we are going to be using the data.

The term Recall (also called True Positive Rate) is defined as \(\frac{\text{TP}}{TP+FN}\) for two classes. In general, this involves all entries in the row for class 1 (orange). What fraction of positive instances are correctly predicted as positive?

The term Precision (also known as Positive Predictive Value) is defined to be \(\frac{\text{TP}}{TP+FP}\) for two classes. In general, this involves all entries in the column for predicting class 1. What fraction of positive predictions are correct?

In some cases precision might be more important than recall and vice versa.

Question 6.29.

Suppose we are using machine learning to diagnose tumors and class 1 indicates a malignant tumor and class 0 indicates a tumor is not malignant (benign). What does recall mean in this case? What does precision mean in this case? Are they equally important or is one of them more important? Answer

Recall is the percentage of tumors that are malignant that we correctly predicted as malignant. Precision is the fraction of tumors predicted to be malignant that are actually malignant. If we predict a malignant tumor is benign, then that person will not get treatment and could die. If we falsely predict a benign tumor is malignant, then that person will have more tests and possibly unnecessary surgery, but will live. Thus we care the most about missing a malignant tumor so we care more about recall.

Question 6.30.

Suppose we are using machine learning to predict if an email is spam (class 1) or not (class 0). What does recall mean in this case? What does precision mean in this case? Are they equally important or is one of them more important? Answer

Recall is the fraction of the spam emails that we correctly identified as spam. Precision is the fraction of the emails that are predicted to be spam that are actually spam. If we predict an email is not spam when it is, then it will survive the spam filter and clutter our inbox. If we predict an email is spam when it is a valid email, then that email dies in the spam filter and we don't know it exists. Possibly we missed having really important information. Thus, in this case, we probably care more about precision.

Ideally, we would like both precision and recall to be 100%, but generally as we vary the parameters, some parameters will have better precision at the expense of recall and vice-versa. Thus, we can examine precision versus recall scores to determine the parameters that give us the best model based on the desired use of the model/data.

We can also graph precision versus recall as parameters are varying. But this only works with binary classes.

from sklearn.metrics import precision_recall_curve

A sample precision versus recall curve from your textbook is below.

Another strategy we might use for a classification problem with two classes is to examine the trade-off between the true positive rate and the false positive rate.

The term True Positive Rate = TPR (same as Recall) is defined as \(\frac{\text{TP}}{TP+FN}\) for two classes. In general, this involves all entries in the row for class 1. What fraction of class 1 is correctly predicted to be class 1?

The term False Positive Rate =FPR (also known as the probability of false alarm) = \(\frac{\text{FP}}{FP+TN}\) In general, this involves all entries in the row for class 0. What fraction of class 0 is incorrectly predicted to be class 1?

A ROC Curve (stands for Receiver Operator Characteristic Curve) is created by plotting the True Positive Rate as a function of False Positive Rate as various parameters are varied. (FPR on \(x\)-axis, and TPR on \(y\)-axis.) It was originally developed for operators of military radar receivers. In this case we are trying to find a balance between correctly detecting an event and detecting false alarms. Ideally, we want the TPR to be 100% and FPR to be 0%. However, normally as we vary parameters we can only detect all the true positives if we allow some false alarms, and we can only have no false alarms with a very poor ability to detect the desired event. Thus, we want to identify a model that corresponds to an appropriate trade-off between detection and false alarms that is determined by the given situation.

ROC curves are generally better when observations are balanced across classes. Precision-Recall Curves are generally better for imbalanced data sets.