Skip to main content

Section 7 Support Vector Machines

Are we ready for another machine learning algorithm? How about some Calculus?

Question 7.1.

What machine learning algorithms do we have so far? What are the main steps in most of our supervised machine learning algorithms?

Answer 1

So far we have one unsupervised learning algorithms, Principal Component Analysis, and five supervised learning algorithms:

  1. Linear Regression
  2. Linear Regression with Polynomial Features
  3. Logistic Regression
  4. Logistic Regression with Polynomial Features
  5. k Nearest Neighbors
Answer 2

We generally have six steps in our regression models.

  1. Get, process, clean data for \(X\) and \(y\text{.}\) Split into training and testing. Then explore/analyze training data set to understand the data. If necessary, perform feature scaling (fit on only training data, but applied to all data.)

    Question 7.2.
    Why should we perform data analysis on only the training data and not the testing data set? Answer
    We need the testing set to be an unbiased measure of the performance of our model. If we explore the testing set as part of our data analysis we are leaking information about the testing set to us (and therefore our model.) We've been blurring this line a bit because of the small size of some of our data sets, but as we move to larger data sets we want to be more careful about this. (And its very important for real world implementation.)
    Question 7.3.
    When is feature scaling necessary? Why is it important? Answer
    Almost always. Its especially important if the data features have widely different scales. In most cases it impacts the convergence of the optimization technique. In some cases it also affects the ability of the model to make accurate predictions.
    Question 7.4.
    Our parameter searching from last time actually adds a step here. What did we add? Answer
    Split data into train, validate, and test. Validation data is used to tune parameters/hyperparameters and testing is reserved for an unbiased estimate of final model that can be applied to data that was not used to build the model or tune the parameters.

  2. Learn Function \(F\) such that \(F(X) \approx y \text{.}\)

    Question 7.5.

    Which of our previous models do not learn a function that approximates \(y\text{?}\)

    Answer

    PCA is unsupervised so there is no y to approximate.

    kNN does not learn a function, or follow this six step process. kNN memorizes the data and must search through the entire data set to find the \(k\) nearest points. It then makes a prediction based on majority vote.

  3. Choose a Model.

    Question 7.6.

    What types of models have we used so far?

    Answer

    Linear, logistic and polynomial!

    For the linear model

    \begin{equation*} h_\Theta (X) = X \Theta . \end{equation*}

    For the logistic model

    \begin{equation*} h_\Theta(X)=g({X}{\Theta})= \frac{1}{1+e^{-{X}{\Theta}}}. \end{equation*}

    (Also softmax model for multinomial logistic regression.)

    Our polynomial model doesn't actually change the model, it just adds polynomial features.

  4. Choose a Cost Function.

    Question 7.7.

    What did we recently add to our cost functions and what type of cost functions do we have so far? What are we looking for in a cost function in general?

    Answer

    We added regularization here!

    We are using either mean squared error in Linear Regression or log-loss (or cross entropy loss) in Logistic Regression (multinomial), plus an appropriate regularization.

    For the mean squared error

    \begin{equation*} J = \frac{1}{2m} \sum_{i=1}^m \left[ h_\theta(X^{(i)} - y^{(i)} \right]^2 + \alpha \sum_i \theta_i^2 \end{equation*}

    For log-loss

    \begin{equation*} J= -\frac{1}{m} \sum_{i=1} ^m y_i\log(h_\Theta(X^{(i)})) +(1-y_i)\log(1-h_\Theta(X{(i)})) + \frac{1}{C} \sum_i \theta_i^2. \end{equation*}

    Note: this is ridge or l2 penalty regularization.

    The important features of our cost function are some way to measure error in our prediction and a smooth/convex function that is easy to optimize.

  5. Solve for \(\Theta\text{.}\)

  6. Make Predictions!

Question 7.8.
What other fun and exciting topics are left in this class? Answer
We have two more supervised learning algorithms (support vector machines and neural networks) and one more unsupervised learning algorithm (clustering).

Subsection 7.1 Introduction

Our next type of machine learning algorithm is called Support Vector Machines. Support Vector Machines (SVM) is one of the most popular and powerful machine learning algorithms. The idea was originally developed at AT&T Bell Laboratories The SVM algorithm dates to 1963 and work done by Vapnik and Chervonenkis. In the 1990's, Vapnik, Boser and Guyon expanded the model to include a non-linear component (kernel trick), and Cortes and Vapnik developed a more robust model for outliers (soft margin versus hard margin classification). It is based on statistical learning frameworks or VC theory and can be used for classification and regression. It can also be used for outlier detection, however it is especially powerful for classification of complex small to medium size data sets (up to approximately 50,000 data points). The idea is that we want a large margin classifier. What does large margin mean?

We will illustrate with the simple classification example below. In this case we want to find a linear decision boundary to separate the data set into two classes. Four lines are given that all separate the two classes of data.

Question 7.9.
Which decision boundary do we like the best and why?
Figure 7.10. Four Options for Decision Boundaries
Answer
We want the decision boundary that will do the best job of classifying new data. The black and blue lines seem problematic since they are so close to the data. The green and brown lines are much better.

The idea for using support vector machines for classification is that we want to find a line that has the largest possible distance from all the points in the data set. There should be a large margin \(m\) between the line and the data, and a margin of \(2m\) between the two different data sets.

Figure 7.11. A Large Margin for Classification

The points on the edge of the margin (points A and B) support the model. That is, adding more points would not change the decision boundary as long as they are outside the margin. These points are called support vectors.

The idea for using support vector machines for regression is similar except that we want the wide margin to contain all the points in our data set rather than separate them.

Figure 7.12. A Large Margin for Regression

Question 7.13.
Which aspects of our six step machine learning summary will we need to change? Answer
The first and main thing we need to change is the cost function. We need a cost function that will ensure a wide margin and still work well for optimization. We will then make some changes to our model to allow non-linear decision boundaries (kernel trick). Lastly we will make some more modifications to allow for a small number of points to live outside the margins (soft margin versus hard margin).
Figure 7.14. Soft Margin Situations

Subsection 7.2 Revising the cost function.

How we are going to revise the cost function and why is going to take some work! The idea is to use a modified version of the cost function for logistic regression. Recall that binary classification with logistic regression uses the log-loss cost function.

\begin{equation*} J= \frac{1}{m} \sum_{i=1} ^m -y_i\log\left( \frac{1}{1+e^{-{X}{\Theta}}} \right) -(1-y_i)\log\left( 1-\frac{1}{1+e^{-{X}{\Theta}}} \right) + \alpha \sum_i \theta_i^2 \end{equation*}

Note I've moved the minus signs back inside the sum here. There are three modifications that we will make to this function; two of them are minor notational changes and the last one is the substantial change.

  1. Change \(\alpha\) to \(C\) and move from regularization term to cost term.

    \begin{equation*} J= C\frac{1}{m} \sum_{i=1} ^m -y_i\log\left( \frac{1}{1+e^{-{X}{\Theta}}} \right) -(1-y_i)\log\left( 1-\frac{1}{1+e^{-{X}{\Theta}}} \right) + \sum_i \theta_i^2. \end{equation*}

    This is the same as replacing \(\alpha\) with \(\frac{1}{C}\) and multiplying by \(C\text{.}\) This is mostly a notational preference but it also indicates a shift in importance of the \(\sum_i \theta_i^2\) term. Note that now large values of \(C\) mean little regularization and low values of \(C\) mean lots of regularization.

  2. Eliminate the constant \(\frac{1}{m}\text{.}\)

    \begin{equation*} J= C \sum_{i=1} ^m -y_i\log\left( \frac{1}{1+e^{-{X}{\Theta}}} \right) -(1-y_i)\log\left( 1-\frac{1}{1+e^{-{X}{\Theta}}} \right) + \sum_i \theta_i^2. \end{equation*}

    This is also a minor change. We could view it as a minor notational change to replace the constant \(\frac{C}{m}\) with a different constant \(C\text{.}\) The important point is that \(\frac{1}{m} \operatorname{Cost}(\Theta)\) will be minimized at exactly the same place as \(\operatorname{Cost}(\Theta)\) so that eliminating the \(\frac{1}{m}\) will have no impact on the value of \(\Theta\) that minimizes the cost.

    Question 7.15.
    What value of \(x\) minimizes \(y=3x^2-12x\text{?}\) What value of \(x\) minimizes \(y=\frac{1}{3}(3x^2-12x)=x^2-4x\text{.}\) Answer
    Yay Calculus! We find the minimum by setting the derivative equal to 0! \(y'=6x-12=0\) if \(x=2\text{.}\) Similarly \(y'=2x-4=0\) if \(x=2\text{.}\)

    Question 7.16.
    Can we just eliminate \(C\) also for the same reason? Answer
    No! The constant \(C\) controls the weighting between regularization and the cost function so eliminating \(C\) would result in a loss of flexibility in our model. Different values of \(C\) will change the value of \(\Theta\) that minimizes the cost.
    Question 7.17.
    How? What does regularization do again? Answer
    Regularization tries to control overfitting by constraining the size of \(\Theta\text{.}\)

  3. This is the big change!! Flatten the cost function into a piecewise linear function. Remember that we have two cases in the cost function, corresponding to the two classes of \(X^i\text{.}\)

    Figure 7.18. Cost Function if \(y^i=1\) (class 1)
    Figure 7.19. Cost Function if \(y^i=0\) (class 0)

    To convert this to a formula we first replace the log pieces with the piecewise linear pieces.

    \begin{equation*} J= C \sum_{i=1} ^m -y_i\log\left( \frac{1}{1+e^{-{X}{\Theta}}} \right) -(1-y_i)\log\left( 1-\frac{1}{1+e^{-{X}{\Theta}}} \right) + \sum_i \theta_i^2\\ = C \sum_{i=1} ^m y^i \left(\max(0,1-X\Theta) \right) +(1-y^i)\left(\max(0, 1+X\Theta \right) + \sum_i \theta_i^2 \end{equation*}

    Then we have some fun with algebra! Remember that the \(y^i \) and \(1- y^i \) are either zero or one, so we are really just adding either \(\max(0,1-X\Theta) \) or \(\max(0,1+X\Theta).\) The only difference between these two pieces is the plus/minus sign. So we can combine this into a single piece with a new variable \(t^i\) to account for the plus/minus sign.

    \begin{equation*} J = C \sum_{i=1} ^m \left(\max(0,1- t^i X \Theta) \right) \text{ where } t^i= \begin{cases} +1 \amp \text{ if } y^i=1 \\ -1 \amp \text{ if } y^i=0 \end{cases} \; \; + \sum_i \theta_i^2 \end{equation*}

    Question 7.20.
    Do we need the new variable \(t^i\text{?}\) Could we specify this in terms of \(y^i\text{?}\) Answer
    We could definitely specify this in terms of \(y^i\) instead. Let \(t^i=(-1)^{y^i+1}\text{.}\) Yay, math!

Okay, we have a new cost function! This cost function is commonly referred to the hinge loss cost function. But why did we do this?!?

Question 7.21.
What aspect of the function \(\max(0,1\pm X\Theta)\)
looks way nicer than the log-loss function and what looks less nice? Answer
Linear is almost always way nicer! But corners are bad! The function is not differentiable there! So the non-differentiable point is a problem for optimization methods, but the linear component simplifies the calculations enough that this is still a computational advantage.
Question 7.22.
What was the goal of support vector machines again? Answer
Large margin classification. So the other main reason for changing the cost function is to create the large margin. The rough idea is that the location for where the function is zero and the 'gap' between -1 and 1 allow us to create a large margin.
Question 7.23.
When is \(\max(0,1\pm X\Theta)\) equal to zero? What is \(J\) in that case? Answer
Depends on the class of \(X^i\text{.}\) If \(X^i\) has class 0 then
\begin{equation*} \max(0,1 + X\Theta)=0 \text{ if } X\Theta \leq -1 \text{.} \end{equation*}
If \(X^i\) has class 1 then
\begin{equation*} \max(0,1 - X\Theta)=0 \text{ if } X\Theta \geq 1 \text{.} \end{equation*}
Figure 7.24. Cost Functions
\begin{equation*} J = C \sum_{i=1} ^m \left(\max(0,1- t^i X \Theta) \right) \text{ where } t^i= \begin{cases} \amp +1 \text{ if } y^i=1 \\ \amp -1 \text{ if } y^i=0 \end{cases} \; \; + \sum_{i=1} \theta_i^2 \end{equation*}
If \(\max(0,1 + X\Theta)=0\) then \(J=\sum_i \theta_i^2\text{.}\) So the cost is primarily dependent on \(\|\Theta\|\) in this case. We want to explore why forcing \(\|\Theta\|\) to be small to minimize the cost will correspond to a large margin classifier.
Question 7.25.
What does \(\|\Theta\| \) notation mean? Answer
Its the length of the vector \(\Theta \) which is
\begin{equation*} \sqrt{\sum_{i=0} \theta_i^2}\text{.} \end{equation*}
We could also write this using dot product notation as \(\sum_{i=0} \theta_i^2 =\Theta \cdot \Theta.\) What else do we remember about dot products from Calculus?
Question 7.26.
Suppose \(\Theta=(0,\frac{1}{2},1)\text{.}\) What does \(X \Theta=0 \text{,}\) \(X \Theta=1 \text{,}\) \(X \Theta=-1 \) correspond to? How many features are there? Answer
There are two features, \(x_1, x_2 \text{,}\) and \(X \Theta= 0+\frac{1}{2} x_1 + 1 x_2 =\frac{1}{2} x_1+x_2\text{.}\) Thus,
\begin{equation*} X \Theta=0 \text{ corresponds to } \frac{1}{2} x_1+x_2=0 \text{ or } x_2= -\frac{1}{2} x_1. \end{equation*}
\begin{equation*} X \Theta=1 \text{ corresponds to } \frac{1}{2} x_1+x_2=1 \text{ or } x_2= 1- \frac{1}{2} x_1. \end{equation*}
\begin{equation*} X \Theta=-1 \text{ corresponds to } \frac{1}{2} x_1+x_2=-1 \text{ or } x_2= -1 - \frac{1}{2} x_1. \end{equation*}
Aha! This looks like a decision boundary with margin!
Question 7.27.
Why will the black line have a lower cost than the blue line?
Answer
Remember we are examining the case where \(J \approx \|\Theta\|\text{.}\) We will start by viewing \(\Theta=(\theta_1,\theta_2)\) as a two dimensional vector. (We are assuming \(\theta_0=0\) and dropping it.) Then \(X \Theta = (x_1,x_2)\cdot(\theta_1,\theta_2)\) can be thought of as a dot product.
Question 7.28.
When is a dot product equal to zero? Answer
When the vectors are perpendicular!
Thus \(\Theta\) must be perpendicular to the decision boundary (and through the origin since \(\theta_0=0\text{.}\) ) Let's consider the data point \(x^i \text{.}\) It is class 1 so we must have \(X \cdot \Theta \geq 1\text{.}\)
Question 7.29.
What is \(X \cdot \Theta\) in general? That is, what is the formula for a dot product? Answer
\(\|X\| \; \|\Theta\| \cos(\phi)\) where \(\phi\) is the angle between \(X\) and \(\Theta\text{.}\)
Thus, \(\|X\| \; \|\Theta\| \cos(\phi) \geq 1.\) Let's solve for \(\|\Theta\|\text{.}\) \(\|\Theta\| \geq \frac{\|X\|}{\cos(\phi)} \text{.}\)
Which value of \(\phi \) will produce a smaller value of \(\|\Theta\|\text{?}\) Hint: \(\cos(0)=1\) and \(\cos(90^\circ)=0\text{.}\)

In summary, the cost function

\begin{equation*} J = C \sum_{i=1} ^m \left(\max(0,1- t^i X \Theta) \right) \text{ where } t^i= \begin{cases} \amp +1 \text{ if } y^i=1 \\ \amp -1 \text{ if } y^i=0 \end{cases} \; \; + \sum_{i=1} \theta_i^2 \end{equation*}

The first term in this function essentially computes the sum of the margin violations. It will be zero if the point is outside the margin and on the correct side of the margin for its class. If it is not zero it will be related to the distance to the decision boundary. Minimizing this term ensures that the model makes the margin violations as small and as few as possible.

The second term will push the model to have a small weight vector \(\Theta\) which corresponds to a larger margin.

Question 7.30.

What is the value of \(C\) doing in our model? The value of \(C\) must be positive. What happens in the extreme cases? What happens if \(C=0\text{?}\) What happens if \(C=\infty\text{?}\)

Answer

If \(C=\infty\) then we only care about the first term and do not care about the second term. That is, we do not care about a large margin, we only care about margin violations. This will produce a hard margin model. All points must be outside the margin. You can try this value in the model, with C=float("inf").

If \(C=0\) then we only care about the second term and do not care about the first term. That is, we do not care about margin violations, we only care about a wide margin. This will produce a soft margin model. Note that \(C=0\) is not positive and this is not actually a valid parameter for the model.

Of course, we don't usually want to be at either of these extremes and the goal is to find the right balance between the hard margin and the soft margin.

Subsection 7.3 Linear Implementation

The are several ways to implement SVM for classification with linear models. The most efficient is usually from sklearn.svm import LinearSVC and would be implemented as LinearSVC(C=1000, loss="hinge", random_state=42). We will specify the loss="hinge" to correspond to the cost function discussed above. The default loss is squared_hinge which, not surprisingly, squares the hinge loss. The disadvantage of this is that the linear function becomes a quadratic function, but the advantage is that there is no longer a non-differentiable point. This model will punish large errors more significantly than small errors, which may or may not be helpful. We will stick with the hinge loss, but in general you should consider which loss is most beneficial for your situation.

Question 7.31.
What does C=1000 mean in the model? Answer
A large value of C will mean little regularization. This is an important parameter to tune in our model.
The default parameters and their default values are

  LinearSVC(penalty='l2',loss='squared_hinge',dual=True,
    tol=0.0001,C=1.0,multi_class='ovr',fit_intercept=True,
    intercept_scaling=1,class_weight=None,verbose=0,
    random_state=None,max_iter=1000)
  

Most of these parameters should look familiar by now. We can use the attributes .intercept_ and .coef_ to find the equation of the decision boundary and calculate the margin. As usual we can use .predict to make predictions on new data.

Question 7.32.
What is the margin? How do we calculate it from .intercept_ and .coef_? Answer
The margin will tell us how far each class boundary line is shifted from the decision boundary line. Suppose \(\Theta=(0,\theta_1,\theta_2)\text{.}\) The class boundary lines correspond to \(X \Theta=1 \) and \(X \Theta=-1 \text{.}\) The decision boundary corresponds to \(X \Theta=0 \text{.}\) Since \(X \Theta= \theta_1 x_1 + \theta_2 x_2 \) we have
\begin{equation*} X \Theta=-1 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=-1 \text{ or } x_2= \frac{-1}{\theta_2} - \frac{\theta_1}{\theta_2} x_1. \end{equation*}
\begin{equation*} X \Theta=0 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=0 \text{ or } x_2= -\frac{\theta_1}{\theta_2} x_1. \end{equation*}
\begin{equation*} X \Theta=1 \text{ corresponds to }\theta_1 x_1 + \theta_2 x_2=1 \text{ or } x_2= \frac{1}{\theta_2} - \frac{\theta_1}{\theta_2} x_1. \end{equation*}
Thus the margin is \(\frac{1}{\theta_2}\text{.}\) (Note that this is not the same as the width of the margin, because it is a vertical shift, not a perpendicular distance.)

Let's look at applying SVM to our simple data set from before. It's too small to train/test split, but feature scaling is incredibly important to SVM so we will create a Pipeline to combine StandardScaler and LinearSVC.

Let's graph our decision boundary and margins! Remember that our decision boundary is \(X \Theta=0 \) which corresponds to

\begin{equation*} \theta_0+\theta_1 x_1 + \theta_2 x_2=0 \text{ or } x_2= -\frac{\theta_0}{\theta_2}-\frac{\theta_1}{\theta_2} x_1 \end{equation*}

and the margins are just shifts of this by \(\frac{1}{\theta_2}\text{.}\)

Question 7.33.
What happens if we change the value of C=float("inf")? What about C=.1? C=.001? Answer
Try it and see! Solution
C=1000 C=inf C=0.1 C=0.001
\(\theta_0\) 0.108 0.108 0.080 0.001
\(\theta_1\) 0.685 0.685 0.611 0.013
\(\theta_2\) 0.730 0.730 0.454 0.012
margin 1.369 1.369 2.204 82.189
Large values of \(C\) correspond to smaller margins and fewer margin violations. (None in this case.) Smaller values of \(C\) correspond to larger margins with more margin violations. In the case of \(C=0.001\) all the points are in the margin!

Subsection 7.4 Kernel Trick

In many cases a line is too simple to separate data points effectively. In this case, we can add more features, which is referred to as a kernel. The first type of kernel is a polynomial kernel. For a polynomial kernel we add polynomial features to our data set. (Just like we did for polynomial regression!) We can then find a linear separation in a higher dimensional space.

Example 7.34.
An example of a one dimensional data set is given on the left. There is no way to separate this data with a line! In the image on the right we have added a second feature \(x_2=(x_1)^2 \) and expanded to a second dimension. We can now separate the data with a line!

The kernel trick refers to a technique to make use of all the additional polynomial features without actually having to explicitly construct them. We could do this by using the PolynomialFeatures() and adding it to our Pipeline, but there's an even easier way, using SVC instead of LinearSVC.

We implement SVC with a polynomial feature by using kernel ="poly" and we will need to explore the best choice of parameters degree, C, and coef0. The hyperparameter coef0 controls how much the model is influenced by high- degree terms versus low-degree terms. Small values of coef0 will be closer to a linear decision boundary (favor the low degree terms) and and high values of coef0 will emphasize the high degree terms.

The coef_ attribute is not valid any more for a non-linear kernel, but there is a new attribute that allows us to examine the support vectors support_vectors_ .

To graph the decision boundary, we will have to return to contour maps and using .predict.

Question 7.35.
What do you think happens as we vary C? Answer
Try it and see!

But maybe polynomial features aren't really what we want. Another way to add features is to create new features that measure the similarity of each data point to a set of landmarks. This kernel is called the Radial Basis Kernel and uses the Gaussian Radial Basis function to determine similarity. Let \(l \) be a landmark point then the Guassian Radial Basis function is

\begin{equation*} \phi_\gamma(x,l) = e^{-\gamma \| x-l \|^2} \end{equation*}

Remember that \(\| x-l \|\) is the Euclidean distance from \(x\) to \(l\text{.}\)

Question 7.36.
Consider the landmark \(l=0\text{.}\) What does \(e^{-\gamma x^2 }\) look like? What is \(\gamma \text{?}\) What does adding this add a feature tell us about our data? Answer
\(e^{-\gamma x^2 }\) is a bell-shaped function varying from 0 (at \(x=0\)) to 1 \(as x \to \pm \infty\text{.}\) \(\gamma \) is a parameter which controls the width of the bell shape of \(e^{-\gamma x^2 }\text{.}\)
Adding this feature to our data produces a value near 1 if \(x\) is near the landmark, in this case \(l=0 \text{,}\) and produces a value of \(0\) if \(x\) is not near the landmark. Thus, it is a similarity metric, for how similar \(x\) is to \(l\text{.}\) The gamma parameter controls how close we need to be to \(x\) to produce values near \(1\text{.}\) For \(\gamma=0.1\) values of \(-5 \lt x \gt 5 \) will have nonzero scores to be counted as similar to \(x\text{.}\) For \(\gamma=10\) values of \(-1 \lt x \gt 1 \) will have nonzero scores to be counted as similar to \(x\text{.}\) So larger values of \(\gamma\) mean only points very similar to \(x\text{.}\) will be viewed as similar, and smaller values of \(\gamma\) mean a wider range of points will be viewed as similar to \(x\text{.}\)

Question 7.37.
What happens as we vary \(\gamma\) and \(l\text{?}\) Answer
Try it and see!

Thus, the idea here is to use each landmark to create a feature.

\begin{equation*} f_1(x)=e^{-\gamma \| x-l_1 \|^2}. \end{equation*}

Question 7.38.
What should we use for landmarks? Answer
The standard implementation uses every point in the data set as a landmark, and drops the original features. This creates a data set with the same number of features as data points. Note, this makes our visualization above inaccurate. We can't visual 9D space to use all points as landmarks, but if we use \(l=0\) and \(l=3\) and \(\gamma=.1\) the transformed feature space is two dimensional and easily separable with a linear function.

We implement this similarly with SVC using kernel ="rbf" and we will need to explore the best choice of parameters gamma and C.

Question 7.39.
What is the impact of varying gamma and C? When will C overfit? When will gamma overfit? How do we vary between hard margin and soft margin classification? Which of these combinations seems the best?
Answer
If your model is overfitting, you should reduce \(\gamma\) and \(C\text{;}\) if the model is underfitting, you should increase them. Remember that a hard margin classification will require all data points must be outside the margin and a soft margin classification allows some points to land inside the margin or on the wrong side of the margin. The goal in soft margin classification is to find a good balance between keeping the margin as large as possible and limiting points the number of points that are allowed to be in the margin or on the wrong side of the margin. Essentially the regularlization parameter allows the model to vary between hard margin and soft margin. \(C=.001\) and \(\gamma=0.1\) is clearly underfitting (both are small) \(C=1000\) and \(\gamma=5\) is clearly overfitting (both are large). The other cases are definitely better. I'd lean towards \(C=1000\) and \(\gamma=0.1\) as the best of these models.

Other kernels are possible, but there are some conditions that must be satisfied in order to apply the kernel trick. (See your text for details about Mercer's Theorem if you are interested. Chapter 5, p. 171.) Generally, Radial Basis Kernel is most frequently used kernel with SVC(and is the default kernel.) You can also use SVC with a kernel=linear, but LinearSVC uses a different optimization strategy (liblinear versus libsvm) and will work better for larger data sets.

Question 7.40.
That's a lot of kernels! How do we decide which to use? Answer
Always try LinearSVC first. If the training set is not too large (less than 50,000 data points) you should try Gaussian RBF kernel next. But if your training set has special structure, then you may want to experiment with other kernels especially if there are kernels that are specialized for that type of structure.

Question 7.41.
What is the dual parameter in SVC? Answer
SVM works best when the number of training instances is equal to or greater than the number of features. If the number of training instances is less than the number of features there is a dual form of the SVM problem that will be more effective.

Question 7.42.
There are a lot of hyperparameters to tune. How do we do that? Answer
Use grid search or randomized search! It is often faster to first do a very coarse grid search, then a finer grid search around the best values found. It is important to understand what each hyperparameter does (and how it relates to overfitting versus underfitting!) to search more effectively!

A few last notes:
  1. Feature scaling is very important to the models performance (and not just for the optimization)! Some features may be ignored without feature scaling.
  2. In order to apply the kernel trick, there is also a slight modification to the regularization part of the cost function. The \(\sum \theta_i \) is replaced with \(\Theta^T M \Theta \) where \(M\) depends on the kernel and is chosen for computation reasons. Note that if \(M\) is the identity matrix, then it is almost \(\sum \theta_i \text{.}\) Why almost? Answer
    \(\theta_0\) is included in \(\Theta \) but not \(\sum \theta_{i=1} \text{.}\)
  3. For support vector machines, the output of a model when predicting on a new data point is a class rather than a prediction. (This is different from logistic regression. )
  4. There's a lot more cool stuff to know about support vector machines!! Check out Chapter 5 of Hands-on Machine Learning