Machine Learning Class Notes

Question

Question 7.1.

What machine learning algorithms do we have so far? What are the main steps in most of our supervised machine learning algorithms?

Answer 1

Answer 2

Answer 1

So far we have one unsupervised learning algorithms, Principal Component Analysis, and five supervised learning algorithms:

Linear Regression
Linear Regression with Polynomial Features
Logistic Regression
Logistic Regression with Polynomial Features
k Nearest Neighbors

Answer 2

We generally have six steps in our regression models.

Get, process, clean data for \(X\) and \(y\text{.}\) Split into training and testing. Then explore/analyze training data set to understand the data. If necessary, perform feature scaling (fit on only training data, but applied to all data.)
Question 7.2.
Why should we perform data analysis on only the training data and not the testing data set? Answer
We need the testing set to be an unbiased measure of the performance of our model. If we explore the testing set as part of our data analysis we are leaking information about the testing set to us (and therefore our model.) We've been blurring this line a bit because of the small size of some of our data sets, but as we move to larger data sets we want to be more careful about this. (And its very important for real world implementation.)

Question 7.3.
When is feature scaling necessary? Why is it important? Answer
Almost always. Its especially important if the data features have widely different scales. In most cases it impacts the convergence of the optimization technique. In some cases it also affects the ability of the model to make accurate predictions.

Question 7.4.
Our parameter searching from last time actually adds a step here. What did we add? Answer
Split data into train, validate, and test. Validation data is used to tune parameters/hyperparameters and testing is reserved for an unbiased estimate of final model that can be applied to data that was not used to build the model or tune the parameters.
Learn Function \(F\) such that \(F(X) \approx y \text{.}\)

Question 7.5.

Which of our previous models do not learn a function that approximates \(y\text{?}\)
Answer

PCA is unsupervised so there is no y to approximate.

kNN does not learn a function, or follow this six step process. kNN memorizes the data and must search through the entire data set to find the \(k\) nearest points. It then makes a prediction based on majority vote.
Choose a Model.

Question 7.6.

What types of models have we used so far?
Answer

Linear, logistic and polynomial!

For the linear model

\begin{equation*} h_\Theta (X) = X \Theta . \end{equation*}

For the logistic model

\begin{equation*} h_\Theta(X)=g({X}{\Theta})= \frac{1}{1+e^{-{X}{\Theta}}}. \end{equation*}

(Also softmax model for multinomial logistic regression.)

Our polynomial model doesn't actually change the model, it just adds polynomial features.
Choose a Cost Function.

Question 7.7.

What did we recently add to our cost functions and what type of cost functions do we have so far? What are we looking for in a cost function in general?
Answer

We added regularization here!

We are using either mean squared error in Linear Regression or log-loss (or cross entropy loss) in Logistic Regression (multinomial), plus an appropriate regularization.

For the mean squared error

\begin{equation*} J = \frac{1}{2m} \sum_{i=1}^m \left[ h_\theta(X^{(i)} - y^{(i)} \right]^2 + \alpha \sum_i \theta_i^2 \end{equation*}

For log-loss

\begin{equation*} J= -\frac{1}{m} \sum_{i=1} ^m y_i\log(h_\Theta(X^{(i)})) +(1-y_i)\log(1-h_\Theta(X{(i)})) + \frac{1}{C} \sum_i \theta_i^2. \end{equation*}

Note: this is ridge or l2 penalty regularization.

The important features of our cost function are some way to measure error in our prediction and a smooth/convex function that is easy to optimize.
Solve for \(\Theta\text{.}\)
Make Predictions!

Answer 3

We need the testing set to be an unbiased measure of the performance of our model. If we explore the testing set as part of our data analysis we are leaking information about the testing set to us (and therefore our model.) We've been blurring this line a bit because of the small size of some of our data sets, but as we move to larger data sets we want to be more careful about this. (And its very important for real world implementation.)

Answer 4

Almost always. Its especially important if the data features have widely different scales. In most cases it impacts the convergence of the optimization technique. In some cases it also affects the ability of the model to make accurate predictions.

Answer 5

Split data into train, validate, and test. Validation data is used to tune parameters/hyperparameters and testing is reserved for an unbiased estimate of final model that can be applied to data that was not used to build the model or tune the parameters.

Answer 6

PCA is unsupervised so there is no y to approximate.

kNN does not learn a function, or follow this six step process. kNN memorizes the data and must search through the entire data set to find the \(k\) nearest points. It then makes a prediction based on majority vote.

Answer 7

Linear, logistic and polynomial!

For the linear model

\begin{equation*} h_\Theta (X) = X \Theta . \end{equation*}

For the logistic model

\begin{equation*} h_\Theta(X)=g({X}{\Theta})= \frac{1}{1+e^{-{X}{\Theta}}}. \end{equation*}

(Also softmax model for multinomial logistic regression.)

Our polynomial model doesn't actually change the model, it just adds polynomial features.

Answer 8

We added regularization here!

We are using either mean squared error in Linear Regression or log-loss (or cross entropy loss) in Logistic Regression (multinomial), plus an appropriate regularization.

For the mean squared error

\begin{equation*} J = \frac{1}{2m} \sum_{i=1}^m \left[ h_\theta(X^{(i)} - y^{(i)} \right]^2 + \alpha \sum_i \theta_i^2 \end{equation*}

For log-loss

\begin{equation*} J= -\frac{1}{m} \sum_{i=1} ^m y_i\log(h_\Theta(X^{(i)})) +(1-y_i)\log(1-h_\Theta(X{(i)})) + \frac{1}{C} \sum_i \theta_i^2. \end{equation*}

Note: this is ridge or l2 penalty regularization.

The important features of our cost function are some way to measure error in our prediction and a smooth/convex function that is easy to optimize.

Answer 9

We have two more supervised learning algorithms (support vector machines and neural networks) and one more unsupervised learning algorithm (clustering).

Answer 10

We want the decision boundary that will do the best job of classifying new data. The black and blue lines seem problematic since they are so close to the data. The green and brown lines are much better.

Answer 11

The first and main thing we need to change is the cost function. We need a cost function that will ensure a wide margin and still work well for optimization. We will then make some changes to our model to allow non-linear decision boundaries (kernel trick). Lastly we will make some more modifications to allow for a small number of points to live outside the margins (soft margin versus hard margin).

Answer 12

Yay Calculus! We find the minimum by setting the derivative equal to 0! \(y'=6x-12=0\) if \(x=2\text{.}\) Similarly \(y'=2x-4=0\) if \(x=2\text{.}\)

Answer 13

No! The constant \(C\) controls the weighting between regularization and the cost function so eliminating \(C\) would result in a loss of flexibility in our model. Different values of \(C\) will change the value of \(\Theta\) that minimizes the cost.

Question 7.17.

How? What does regularization do again? Answer

Regularization tries to control overfitting by constraining the size of \(\Theta\text{.}\)

Answer 14

Regularization tries to control overfitting by constraining the size of \(\Theta\text{.}\)

Answer 15

We could definitely specify this in terms of \(y^i\) instead. Let \(t^i=(-1)^{y^i+1}\text{.}\) Yay, math!

Answer 16

Linear is almost always way nicer! But corners are bad! The function is not differentiable there! So the non-differentiable point is a problem for optimization methods, but the linear component simplifies the calculations enough that this is still a computational advantage.

Answer 17

Large margin classification. So the other main reason for changing the cost function is to create the large margin. The rough idea is that the location for where the function is zero and the 'gap' between -1 and 1 allow us to create a large margin.

Answer 18

Depends on the class of \(X^i\text{.}\) If \(X^i\) has class 0 then

\begin{equation*} \max(0,1 + X\Theta)=0 \text{ if } X\Theta \leq -1 \text{.} \end{equation*}

If \(X^i\) has class 1 then

\begin{equation*} \max(0,1 - X\Theta)=0 \text{ if } X\Theta \geq 1 \text{.} \end{equation*}

\begin{equation*} J = C \sum_{i=1} ^m \left(\max(0,1- t^i X \Theta) \right) \text{ where } t^i= \begin{cases} \amp +1 \text{ if } y^i=1 \\ \amp -1 \text{ if } y^i=0 \end{cases} \; \; + \sum_{i=1} \theta_i^2 \end{equation*}

If \(\max(0,1 + X\Theta)=0\) then \(J=\sum_i \theta_i^2\text{.}\) So the cost is primarily dependent on \(\|\Theta\|\) in this case. We want to explore why forcing \(\|\Theta\|\) to be small to minimize the cost will correspond to a large margin classifier.

Answer 19

Its the length of the vector \(\Theta \) which is

\begin{equation*} \sqrt{\sum_{i=0} \theta_i^2}\text{.} \end{equation*}

We could also write this using dot product notation as \(\sum_{i=0} \theta_i^2 =\Theta \cdot \Theta.\) What else do we remember about dot products from Calculus?

Answer 20

There are two features, \(x_1, x_2 \text{,}\) and \(X \Theta= 0+\frac{1}{2} x_1 + 1 x_2 =\frac{1}{2} x_1+x_2\text{.}\) Thus,

\begin{equation*} X \Theta=0 \text{ corresponds to } \frac{1}{2} x_1+x_2=0 \text{ or } x_2= -\frac{1}{2} x_1. \end{equation*}

\begin{equation*} X \Theta=1 \text{ corresponds to } \frac{1}{2} x_1+x_2=1 \text{ or } x_2= 1- \frac{1}{2} x_1. \end{equation*}

\begin{equation*} X \Theta=-1 \text{ corresponds to } \frac{1}{2} x_1+x_2=-1 \text{ or } x_2= -1 - \frac{1}{2} x_1. \end{equation*}

Aha! This looks like a decision boundary with margin!

Answer 21

Remember we are examining the case where \(J \approx \|\Theta\|\text{.}\) We will start by viewing \(\Theta=(\theta_1,\theta_2)\) as a two dimensional vector. (We are assuming \(\theta_0=0\) and dropping it.) Then \(X \Theta = (x_1,x_2)\cdot(\theta_1,\theta_2)\) can be thought of as a dot product.

Question 7.28.

When is a dot product equal to zero? Answer

When the vectors are perpendicular!

Thus \(\Theta\) must be perpendicular to the decision boundary (and through the origin since \(\theta_0=0\text{.}\) ) Let's consider the data point \(x^i \text{.}\) It is class 1 so we must have \(X \cdot \Theta \geq 1\text{.}\)

Question 7.29.

What is \(X \cdot \Theta\) in general? That is, what is the formula for a dot product? Answer

\(\|X\| \; \|\Theta\| \cos(\phi)\) where \(\phi\) is the angle between \(X\) and \(\Theta\text{.}\)

Thus, \(\|X\| \; \|\Theta\| \cos(\phi) \geq 1.\) Let's solve for \(\|\Theta\|\text{.}\) \(\|\Theta\| \geq \frac{\|X\|}{\cos(\phi)} \text{.}\)

Which value of \(\phi \) will produce a smaller value of \(\|\Theta\|\text{?}\) Hint: \(\cos(0)=1\) and \(\cos(90^\circ)=0\text{.}\)

Answer 22

When the vectors are perpendicular!

Answer 23

\(\|X\| \; \|\Theta\| \cos(\phi)\) where \(\phi\) is the angle between \(X\) and \(\Theta\text{.}\)

Answer 24

If \(C=\infty\) then we only care about the first term and do not care about the second term. That is, we do not care about a large margin, we only care about margin violations. This will produce a hard margin model. All points must be outside the margin. You can try this value in the model, with C=float("inf").

If \(C=0\) then we only care about the second term and do not care about the first term. That is, we do not care about margin violations, we only care about a wide margin. This will produce a soft margin model. Note that \(C=0\) is not positive and this is not actually a valid parameter for the model.

Of course, we don't usually want to be at either of these extremes and the goal is to find the right balance between the hard margin and the soft margin.

Answer 25

A large value of C will mean little regularization. This is an important parameter to tune in our model.

Answer 26

The margin will tell us how far each class boundary line is shifted from the decision boundary line. Suppose \(\Theta=(0,\theta_1,\theta_2)\text{.}\) The class boundary lines correspond to \(X \Theta=1 \) and \(X \Theta=-1 \text{.}\) The decision boundary corresponds to \(X \Theta=0 \text{.}\) Since \(X \Theta= \theta_1 x_1 + \theta_2 x_2 \) we have

\begin{equation*} X \Theta=-1 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=-1 \text{ or } x_2= \frac{-1}{\theta_2} - \frac{\theta_1}{\theta_2} x_1. \end{equation*}

\begin{equation*} X \Theta=0 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=0 \text{ or } x_2= -\frac{\theta_1}{\theta_2} x_1. \end{equation*}

\begin{equation*} X \Theta=1 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=1 \text{ or } x_2= \frac{1}{\theta_2} - \frac{\theta_1}{\theta_2} x_1. \end{equation*}

Thus the margin is \(\frac{1}{\theta_2}\text{.}\) (Note that this is not the same as the width of the margin, because it is a vertical shift, not a perpendicular distance.)

Answer 27

Try it and see! Solution

	C=1000	C=inf	C=0.1	C=0.001
\(\theta_0\)	0.108	0.108	0.080	0.001
\(\theta_1\)	0.685	0.685	0.611	0.013
\(\theta_2\)	0.730	0.730	0.454	0.012
margin	1.369	1.369	2.204	82.189

Large values of \(C\) correspond to smaller margins and fewer margin violations. (None in this case.) Smaller values of \(C\) correspond to larger margins with more margin violations. In the case of \(C=0.001\) all the points are in the margin!

Answer 28

Try it and see!

Answer 29

\(e^{-\gamma x^2 }\) is a bell-shaped function varying from 0 (at \(x=0\)) to 1 \(as x \to \pm \infty\text{.}\) \(\gamma \) is a parameter which controls the width of the bell shape of \(e^{-\gamma x^2 }\text{.}\)

Adding this feature to our data produces a value near 1 if \(x\) is near the landmark, in this case \(l=0 \text{,}\) and produces a value of \(0\) if \(x\) is not near the landmark. Thus, it is a similarity metric, for how similar \(x\) is to \(l\text{.}\) The gamma parameter controls how close we need to be to \(x\) to produce values near \(1\text{.}\) For \(\gamma=0.1\) values of \(-5 \lt x \gt 5 \) will have nonzero scores to be counted as similar to \(x\text{.}\) For \(\gamma=10\) values of \(-1 \lt x \gt 1 \) will have nonzero scores to be counted as similar to \(x\text{.}\) So larger values of \(\gamma\) mean only points very similar to \(x\text{.}\) will be viewed as similar, and smaller values of \(\gamma\) mean a wider range of points will be viewed as similar to \(x\text{.}\)

Answer 30

Try it and see!

Answer 31

The standard implementation uses every point in the data set as a landmark, and drops the original features. This creates a data set with the same number of features as data points. Note, this makes our visualization above inaccurate. We can't visual 9D space to use all points as landmarks, but if we use \(l=0\) and \(l=3\) and \(\gamma=.1\) the transformed feature space is two dimensional and easily separable with a linear function.

Answer 32

If your model is overfitting, you should reduce \(\gamma\) and \(C\text{;}\) if the model is underfitting, you should increase them. Remember that a hard margin classification will require all data points must be outside the margin and a soft margin classification allows some points to land inside the margin or on the wrong side of the margin. The goal in soft margin classification is to find a good balance between keeping the margin as large as possible and limiting points the number of points that are allowed to be in the margin or on the wrong side of the margin. Essentially the regularlization parameter allows the model to vary between hard margin and soft margin. \(C=.001\) and \(\gamma=0.1\) is clearly underfitting (both are small) \(C=1000\) and \(\gamma=5\) is clearly overfitting (both are large). The other cases are definitely better. I'd lean towards \(C=1000\) and \(\gamma=0.1\) as the best of these models.

Answer 33

Always try LinearSVC first. If the training set is not too large (less than 50,000 data points) you should try Gaussian RBF kernel next. But if your training set has special structure, then you may want to experiment with other kernels especially if there are kernels that are specialized for that type of structure.

Answer 34

SVM works best when the number of training instances is equal to or greater than the number of features. If the number of training instances is less than the number of features there is a dual form of the SVM problem that will be more effective.

Answer 35

Use grid search or randomized search! It is often faster to first do a very coarse grid search, then a finer grid search around the best values found. It is important to understand what each hyperparameter does (and how it relates to overfitting versus underfitting!) to search more effectively!

Answer 36

\(\theta_0\) is included in \(\Theta \) but not \(\sum \theta_{i=1} \text{.}\)

Section 7 Support Vector Machines

Question 7.1.

Question 7.2.

Question 7.3.

Question 7.4.

Question 7.5.

Question 7.6.

Question 7.7.

Question 7.8.

Subsection 7.1 Introduction

Question 7.9.

Question 7.13.

Subsection 7.2 Revising the cost function.

Question 7.15.

Question 7.16.

Question 7.17.

Question 7.20.

Question 7.21.

Question 7.22.

Question 7.23.

Question 7.25.

Question 7.26.

Question 7.27.

Question 7.28.

Question 7.29.

Question 7.30.

Subsection 7.3 Linear Implementation

Question 7.31.

Question 7.32.

Question 7.33.

Subsection 7.4 Kernel Trick

Example 7.34.

Question 7.35.

Question 7.36.

Question 7.37.

Question 7.38.

Question 7.39.

Question 7.40.

Question 7.41.

Question 7.42.