Skip to main content

Section 9 Neural Networks

Neural networks are some of the newest and most exciting machine learning models. However, neural networks build off the ideas that you have learned so far. Neural Networks may also be called artificial neural networks (ANN) or deep learning. The goal is to create a machine learning algorithm that can handle really complex problems. Neural networks are already used in a wide array of applications!

  1. image classification
  2. speech recognition
  3. recommendations (video and others)
  4. fake videos, or photos, or text, or ... (GANs = Generative Adversarial Networks)
  5. writing text

Subsection 9.1 Artificial Neurons

We will begin with an intuitive approach. Neural networks are modeled based on how the human brain works.

Here's an image of a Neuron in our brain. It takes inputs from other neurons in the form of neurochemicals. It has to decide to produce a signal or not. That signal travels down the axon and outputs to other neurons. These neurons then connect to other neurons in the brain and so on. The idea of how a single neuron works is not too complicated, but when billions of neurons combine to act simultaneously, our brains can do some amazing things! We will convert the human brain model to a mathematical model in the following way.

Inputs will be our features. Each circle represents a feature. The dotted circle represents the bias term which we always set to one. The middle circle represents the activation function. We can choose any activation function here, but to connect to our previous work we will start with the sigmoid function

\begin{equation*} g({X}{\Theta})= \frac{1}{1+e^{-{\Theta}^T}X}. \end{equation*}

The activation function determines how strongly the neuron should react to the input, or if it should react at all. There are many different choices for the activation function, for example the hinge loss function or a linear function. However, linear functions end up being too simple, so we generally want some non-linear activation function. The output \(g\) represents the output of the activation function. In the simplest case, this will be the answer to our question of making a prediction for the input \(X\text{.}\) The first addition we are going to add to this model is to add weights to each of the inputs.

The new process is that the input features are multiplied by the \(\Theta\) values and then activated with a sigmoid function.

Question 9.1.
This should feel really familiar? If there is just a single neuron what process are we implementing? Answer
This is just logistic regression!

So how do we turn a single neuron into a neural network? Add lots of them and multiple layers of neurons!

The input layer still corresponds to the features. (The bias term is still here, but it is not generally included in neural network diagrams.) In this image, there are three hidden layers, but there could be any number of these hidden layers in general. They are called hidden because the inputs and outputs of these layers are hidden inside the model. The depth refers to the number of hidden layers. The width refers to the number of neurons per layer. Each layer can have a different number of neurons. The output layer is always the last layer and corresponds “the answer”, that is, the prediction for input \(X\) based on our model. At each layer, the outputs from the previous layer are the inputs for the next layer. The \(\Theta\) values correspond to the weights on each of the lines between layers.

We will introduce some notation using a simpler example.

In this example, there are three total layers and one hidden layer. The \(\Theta\) values will weight how much the feature is used in the activation function in the next step.

\(\theta_{ij} ^{(J)}\) maps from unit \(i\) in layer \(J\) to unit \(j\) in layer \(J+1\text{.}\)

\(a_i ^{(J)}\) represents the activation of unit \(i\) in layer \(J\text{.}\)

In this example, there are three input features in the input layer, three activation units in the hidden second layer, and a final output in the third layer. The output in the third layer is the answer or prediction for \(X\text{.}\) There will be two sets of \(\Theta\) values in this example, one from layer 1 to layer 2, \(\Theta_{12}\text{,}\) and one from layer 2 to layer 3, \(\Theta_{23}\text{.}\)

Question 9.2.
What dimensions will the \(\Theta_{12}\) matrix have? What about \(\Theta_{23}\)Answer
The matrix \(\Theta_{12}\) will have dimension \(4 \times 3\text{.}\) And \(\Theta_{12}\) will have dimension \(4 \times 1\text{.}\)
We can represent the steps between layer 1 and layer 2 mathematically:

When building a neural net, we need to

  1. Choose the architecture (choose depth and width and activation, choose features, scale features, etc.)
  2. Train the neural network (solve for theta values, use an optimizer, but more later. vary learning rates, and regularization.)
  3. Tune and test (hyperparameter tuning, depth, width, activation are new hyperparameters as well as regularization, learning rates) There's a lot here! So leave time for this in the homework!

To get a deep intuition for what's happening we will use some simple functions. The functions we will use are logic gates, OR, AND, NOT, XOR. These are common functions from electrical engineering. Example: Consider two inputs X1 and x2. They can be 1=ON or 0=OFF. The OR function tests if one or the other is on.

\(x_1\) \(x_2\) \(x_1\) OR \(x_1\)
0 0 0
1 0 1
0 1 1
1 1 1
We want to create a simple two layer neuron that will give the same output as the OR function on two input features. That is, we want to find \(\Theta\) values in the image below.
We need the activation function

\begin{equation*} g(z)= \frac{1}{1+e^{-{z}}}. \end{equation*}

to produce 0 or 1.

Question 9.3.
When will the sigmoid function output 0? When 1? Answer
Remember the sigmoid function is
so if \(z\) is a large negative number then it will output 0 and if \(z\) is a large positive number then it will output 1. (Here large is probably just \(|z| \gt 5 \text{.}\))
So how can we pick \(\Theta\) values to produce the desired output? Remember \(z=\theta_0 + \theta_1 *x_1 + \theta_2 *x_2\text{.}\)

We want to match up the output values for each input value. We'll begin with input \((0,0)\text{.}\)

\begin{equation*} a_1^{(2)}(0,0)=g(\theta_0 + \theta_1 *0 + \theta_2 *0) =g(\theta_0). \end{equation*}

In order for \(g(\theta_0)=0\text{,}\) we can pick \(\theta_0 = -10\) or any “large” negative number.

Next we examine input \((1,0)\text{.}\)

\begin{equation*} a_1^{(2)}(1,0)=g(\theta_0 + \theta_1 *1 + \theta_2 *0) =g(\theta_0+ \theta_1) =g(-10 +\theta_1). \end{equation*}

In order for \(g(-10 +\theta_1)=1\text{,}\) we need pick \(-10 +\theta_1\) to be any “large” positive number. Thus, we can pick \(\theta_1 = 20\text{.}\)

Next we examine input \((0,1)\text{.}\)

\begin{equation*} a_1^{(2)}(0,1)=g(\theta_0 + \theta_1 *0 + \theta_2 *1) =g(\theta_0+ \theta_2) =g(-10 +\theta_2). \end{equation*}

In order for \(g(-10 +\theta_2)=1\text{,}\) we need pick \(-10 +\theta_2\) to be any “large” positive number. Thus, we can pick \(\theta_2 = 20\text{.}\)

Note we don't have any \(\theta_i\) left to pick, so we better check that this works for the last input case. (If not, we would need more layers.)

Checking input \((1,1)\text{.}\)

\begin{equation*} a_1^{(2)}(1,1)=g(\theta_0 + \theta_1 *1 + \theta_2 *1) =g(\theta_0+ \theta_1+ \theta_2) =g(-10 +20 +20) =1. \end{equation*}

Yay, it works! Thus our activation function \(g(-10+20x_1+20x_2)\) will produce the same output as OR in all cases for two features.

Activity 9.1.

Construct explicit \(\Theta\) values that will produce the same function as x1 AND x2 for a two layer neuron with two input features.

\(x_1\) \(x_2\) \(x_1\) AND \(x_2\)
0 0 0
1 0 0
0 1 0
1 1 1

Construct explicit \(\Theta\) values that will produce the same function as NOT x1 for a two layer neuron with one input feature.

\(x_1\) NOT \(x_1\)
0 1
1 0

For AND, one solution is \(\theta_0 = -30, \theta_1= 20, \theta_2 = 20\text{.}\) For NOT, one solution is \(\theta_0=10, \theta_1=-20\)
Example 9.4.

Sometimes a single neuron is not enough to encode a function. Consider the exclusive or function, XOR, which is true if exactly one of \(x_1\) or \(x_2\) is true.

\(x_1\) \(x_2\) \(x_1\) XOR \(x_2\)
0 0 0
1 0 1
0 1 1
1 1 0

Question 9.5.
Why is it not possible to construct a two layer model with a single neuron. Answer
For a single neuron we would need
to find \(\theta_0,\theta_1,\theta_2\) such that
\begin{equation*} g(0,0)=\theta_0 \lt 0 \\ g(1,0)=\theta_0+\theta_1 \gt 0,\\ g(0,1)=\theta_0+\theta_2 \gt 0, \\ \text{ and } g(1,1)=\theta_0+\theta_1+\theta_2 \lt 0\text{.} \end{equation*}
It is not possible to solve all of these simultaneously, if
\begin{equation*} \theta_0 \lt 0\\ \theta_0+\theta_1 \gt 0\\ \theta_0+\theta_2 \gt 0 \end{equation*}
\begin{equation*} \theta_2 \gt -\theta_0 \text{ and }\theta_0+\theta_1+\theta_2 \gt 0 - \theta_0 \gt 0. \end{equation*}

So let's add a hidden layer with two neurons in the hidden layer.

Question 9.6.
How many \(\Theta\) values do we need to assign here? Answer
There are six \(\Theta\) values between layers 1 and 2 and three \(\Theta\) values between layers 2 and 3. So there are nine different \(\Theta\) values to assign.

This is a lot of \(\Theta\) values to try to assign by hand, so we want to try to build this from logic gates that we already know. We can think of \(x_1\) XOR \(x_2\) as \(x_1\) OR \(x_2\) but NOT (\(x_1\) AND \(x_2\) ).

To see this in a truth table,

\(x_1\) \(x_2\) \(x_1\) XOR \(x_2\) \(x_1\) AND \(x_2\) NOT(\(x_1\) AND \(x_2\)) \(x_1\) OR \(x_2\) [\(x_1\) XOR \(x_2\)] AND [NOT(\(x_1\) AND \(x_2\)) ]
0 0 0 0 1 0 0
1 0 1 0 1 1 1
0 1 1 0 1 1 1
1 1 0 1 0 1 0

Thus, we can produce XOR by choosing \(\Theta\) values so that \(a_1^{(2)} \) evaluates the \(x_1\) OR \(x_2\) function and \(a_2^{(2)} \) evaluates the NOT(\(x_1\) AND \(x_2\)) and \(a_1^{(3)} \) evaluates the NOT(\(x_1\) AND \(x_2\)) \(a_1^{(2)} \) AND \(a_2^{(2)}. \)

The \(\Theta\) values for OR were -10,20,20. Thus,

\begin{equation*} \theta_{01} ^{(1)}=-10 \end{equation*}
\begin{equation*} \theta_{11} ^{(1)}=20 \end{equation*}
\begin{equation*} \theta_{21} ^{(1)}=20 \end{equation*}

The \(\Theta\) values for AND were -30,20,20. Thus,

\begin{equation*} \theta_{01} ^{(2)}=-30 \end{equation*}
\begin{equation*} \theta_{11} ^{(2)}=20 \end{equation*}
\begin{equation*} \theta_{21} ^{(2)}=20 \end{equation*}

In order to create the NOT of a function, we want to switch all the zeros to ones and all the ones to zeros. The simplest way to do that is to switch all the signs of the coefficients. So the \(\Theta\) values for NOT AND are 30,-20,-20. Thus,

\begin{equation*} \theta_{02} ^{(1)}=30 \end{equation*}
\begin{equation*} \theta_{12} ^{(1)}=-20 \end{equation*}
\begin{equation*} \theta_{22} ^{(1)}=-20 \end{equation*}

This produces the three activation functions:

\begin{equation*} a_1^{(2)} = g(-10+20x_1+20x_2) \end{equation*}
\begin{equation*} a_2^{(2)}=g(30-20x_1-20x_2) \end{equation*}
\begin{equation*} a_1^{(3)} = g(-30 +20a_1+20a_2) \end{equation*}

Let's check all the cases to show this works.

\(x_1\) \(x_2\) \(a_1^{(2)} = g(-10+20x_1+20x_2)\) \(a_2^{(2)}=g(30-20x_1-20x_2)\) \(g(-30 +20a_1^{(2)}+20a_2^{(2)})\)
0 0 \(g(-10)=0\) \(g(30)=1\) \(g(-30+20)=0\)
1 0 \(g(-10+20)=1\) \(g(30-20)=1\) \(g(-30+20+20)=1\)
0 1 \(g(-10+20)=1\) \(g(30-20)=1\) \(g(-30+20+20)=1\)
1 1 \(g(-10+20+20)=1\) \(g(30-20-20)=0 \) \(g(-30+20)=0\)

Subsection 9.2 Perceptrons

In the previous section, we discussed simple artificial neurons, but we didn't have a way to train the neurons using data. We could choose specific \(\Theta\) values to make the neuron behave like a logic gate, but of course, this isn't actually machine learning. In this section, we will discuss the idea of perceptrons and see how to train a model using data.

Recall, a neuron has an input layer, an output layer, and possibly any number of hidden layers, as in the image below.

It is very important that the activation function is non-linear. If a linear function were used instead, we would just be using a fancy linear regression model. The depth and width of a neural network would not matter because the entire model could be compressed to a single linear model.

The activation function we used was a weighted sum of input features and the goal was to learn the weights, or choose the weights appropriately.

For a perceptron, also called a threshold logic unit (TLU), we will use a different activation. We will use a step function. Two options for a step function are the heaviside function and the sign function.

We will primarily use the heaviside function. Note that the output is either a 0 or a 1. This will correspond to binary classification. Our perceptron model assumes just two layers, an input and an output layer and the two layers are fully connected. That is, every unit in the input layer is connected to every unit in the output layer. In the image below, the left hand TLU is fully connected, but the right hand TLU is not fully connected.
We will eventually want to allow the more complicated case of layers not being fully connected, but for a basic perceptron model we will require this fully connected condition.

The changes we are making to move from an artificial neuron/logic gate model to a perceptron model include no hidden layers, changing the activation function from the sigmoid function to the heaviside step function, and the layers must be fully connected.

Example 9.7.

We will examine how to train a simple perceptron with two input features and two TLU's. Each TLU outputs either 0 or 1.

Suppose that the input features represent weight and furriness and we are trying to classify each instance as either cat or dog. Each TLU should predict one of these classes. Thus, the first TLU, \(a_1\) will predict if the instance should be labeled cat (1) or not cat (0). Our output from the perceptron will be a vector of length two. If the output is [0,1] then we would would predict that the instance is a dog. However, the output could be [1,1]. What do we predict in this case? In this case, the perceptron is predicting both cat and dog, which does not make sense. Note, that we do not get a probability score, we only get 0 or 1, so we can't determine if the perceptron has a higher degree of confidence in cat or dog.

How do we use data to train a perceptron. The idea from neuroscience is "cells that fire together, wire together". The machine learning version is that the connection weight for \(\theta\) is larger for components that have a strong dependence on each other. We will train the perceptron by feeding one piece of data at a time, evaluating how well it predicts a classification, and updating the \(\theta\)s to improve the prediction. This requires labeled data so this is supervised learning. More specifically,

  1. Send in data \(x = \left[ \begin{matrix} x_1 \\ x_2 \end{matrix} \right]\)
  2. Perceptron makes prediction \(\hat{y} = \left[ \begin{matrix} a_1 \\ a_2 \end{matrix} \right]\)
  3. Compare the prediction to the correct classification \({y} = \left[ \begin{matrix} y_1 \\ y_2 \end{matrix} \right]\)
  4. Update each \(\theta_{ij} \) individually based on how closely the predictions matched the classifications. No change is needed if prediction is correct.

The update formula at the \(s+1\)th step for input unit \(i\) and output unit \(j\text{,}\) is given by

\begin{equation*} \theta_{ij,s+1}=\text{old} \Theta + \text{ update term}\\ = \theta_{ij,s}+ \eta(y_j-a_j)x_i. \end{equation*}

Note if the prediction is correct, \(y_j=a_j\) so the update term is 0 and \(\theta_{ij,s+1} = \theta_{ij,s}\text{.}\) If the prediction is incorrect, then the update term depends on the learning rate, \(\eta\text{.}\)

Let's see how to train our perceptron with one data point, where \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \end{matrix} \right] = \left[ \begin{matrix} 1 \\ 0.5 \\ 1 \end{matrix} \right]\text{.}\) Remember that \(x_0\) is the bias term which we always set to one. We will assume this data point corresponds to a cat so that \({y} = \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right]\text{.}\) We need to initialize the \(\theta_{ij,0}\) to some value, so we will set them all equal to 1. There are two TLUs in our model, so we examine each one individually.

\begin{equation*} a_1=h(\theta_{01,0}+\theta_{11,0}x_1+\theta_{21,0}x_2)\\ h(1+1*.5+1*1)=h(2.5)=1 \end{equation*}

Remember that \(h\) is the heaviside step function so it outputs 1 as long as \(x \gt 0\text{.}\) Thus, \(a_1=y_1\) and our model predicts the correct answer.

\begin{equation*} a_2=h(\theta_{02,0}+\theta_{12,0}x_1+\theta_{22,0}x_2)\\ h(1+1*.5+1*1)=h(2.5)=1 \end{equation*}

In this case, \(a_2 \neq y_2\) and our model predicts the incorrect answer.

Let's apply our update function corresponding to step 1. We need to specify a learning rate. Generally we want this to be small, so we will pick \(\eta=0.1\text{.}\)

\begin{equation*} \theta_{ij,s+1} = \theta_{ij,s}+ \eta(y_j-a_j)x_i. \\ = \theta_{ij,s}+ 0.1(y_j-a_j)x_i. \end{equation*}

Since the prediction \(a_1\) is correct, \((y_j-a_j)=1-1=0\text{,}\) thus \(\theta_{01}=1, \theta_{11}=1, \theta_{21}=1\) remain unchanged. Let's consider \(\theta_{02,1}\text{.}\)

\begin{equation*} \theta_{02,1}=\theta_{02,0}+0.1(y2-a2)*x_0 \end{equation*}

(x_0=bias term) \(=1+0.1(0-1)*1\) \(=1-0.1=.9\)

Question 9.8.
What should \(\theta_{12,1}\) and \(\theta_{22,1}\) be? Answer
\begin{equation*} \theta_{12,1}=\theta_{12,0}+0.1(y2-a2)*x_1 =1+0.1(-1)(0.5)=0.95 \end{equation*}
\begin{equation*} \theta_{22,1}=\theta_{12,0}+0.1(y2-a2)*x_2 = 1+0.1(-1)(1)=0.9 \end{equation*}
We can now visualize the revised perceptron and are ready to send in the next point.

This update process is highly dependent on the size of \(x\) and \(\eta\text{.}\)

Question 9.9.
What should we do first? Answer
We should apply feature scaling first! Its very important here here also!!
We will continue this update process for each data point. This is referred to as one epoch. After sending each data point through once, we will send the data through again. We will continue running through epochs until hopefully we obtain convergence. Really there should be lots of data point. Maybe millions, so we won't do this by hand.

We can examine perceptrons via sklearn. from sklearn.linear_model import Perceptron We will still apply regularization in this model. Previously, the regularization penalty was in the cost function. We don't have that in this case, but we can still penalize large theta values during training.

Switch to Jupyter notebook!

Coming next time: Multilayer perceptrons! We want to create more complex models with multiple perceptrons. But the training is harder and we need backpropagation to update theta values. For a single perceptron, it was easy to solve for the thetas. With more layers, it will take a lot more work to trace back through all the hidden layers to update the theta values.

Subsection 9.3 Multi-Layer Perceptrons

How do we build neural networks from perceptrons? We need multilayer perceptrons, also known as deep neural networks. In this section we will provide an overview of training and architecture and an introduction to Keras.

This setup for a multilayer perceptron model is referred to as a deep neural network if there are lots of hidden layers. It may also be referred to as a feedforward network because the signals only flow in the forward direction. Note that images for neural networks don't usually include the bias term, but it is still included in all layers.

How do we train a Neural Net with lots of hidden layers? Let's recall simple regression.

We could view this as a simple neural network. How did we do training here? We used gradient descent to minimize the cost function J. We updated the \(\Theta\) values with a learning rate and the slope/partial derivative of J with respect to the appropriate theta. That is,

new theta = old theta - (learning rate)(slope at J)

How does \(x_1\) affect the cost J? its just a single partial derivative, \(\frac{\partial{J}}{\partial{\theta_1}} .\)

What does it mean in a deep neural network? We need a new cost function J, and we need to calculate many partial derivatives, \(\frac{\partial{J}}{\partial{\theta_{ij} ^n}} .\)

Every line in the network is a partial derivative that we need to take. This is a lot of derivatives! We need the multivariable chain rule! This is a lot of work!!! But there is a math trick that will help. This is the idea of backpropagation.

We will talk more about the details of backpropagation in the next section, but as an overview, there are two main steps to backpropagation.

  1. Forward pass. Send a small batch of the data into the network. Keep track of each neuron's output. Calculate the cost using a measure of the network error, such as MSE, cross entropy loss, etc.
  2. Backward pass. Compute error due to each output. Calculate gradients of error going backward. Update each theta (similar to gradient descent technique).
  3. Repeat for each batch.

Each epoch corresponds to going through the entire data set, batch by batch, one time. Normally, we will go through through multiple epochs.

In the forward direction, we have to track all possible paths which is a lot. However, in the backward direction we can calculate each piece and store that, so its easier at the next step. We start closer to the cost function.

Things to keep in mind:

  1. We are taking derivatives, if we are using the heaviside step function as in the perceptrons, there is a point where the derivative is undefined, and the derivative is zero at most points. This is a problem. We need activation functions that have derivatives and have nonzero slope.
  2. We need to be smart about initializing thetas. We shouldn't initialize to zero, because it would be hard to learn. Normally we will initialize randomly between 0 and 1. Want to avoid certain symmetries. Dividing by zero issues can lead to convergence problems. There can also be issues based on the choice of initialization.

Some popular activation function are shown below.

The modern functions are useful because they are easier to take the derivatives of ReLU is super easy to differentiate, but has the corner which is not differentiable. Other models have been added because we don't always want zero slope on the left side. However, ReLU is most common activation function currently.

What is the output from an activation function? It will be any number between 0 and 1, so we will have to interpret for classification. It is important that the activation functions are non-linear. (See other class notes for an example of linear functions could be reduced to a single linear function.)

Can use deep neural networks for either regression or classification. In both cases we need to decide

  • How many input neurons? (This should correspond to the numer of inputs.)
  • How many hidden layers?
  • How many neurons per hidden layer?
  • How many output neurons? (If output is a single value, then only need one neuron here. But could have as many output neurons as things we want to predict. For example, an (x,y) location would have two outputs.)
  • Output activation.
  • Loss function.

For regression, there is usually no activation on the output. And typical loss functions are MSE or MAE. MAE is a little bit faster computationally for some cases.

There are others and you can make them up. but standard to use MSE.

For classification, most of the steps are the same. The main differences are in the cost function and activation function on OUTPUT layer. To get a classification we want the output to be 0 or 1. We can apply the logistic/sigmoid function at that step to classify.

For the multiclass case, we will have one output per class. If we don't care about probability interpretation then logistic function is ok. If we do care about interpreting answers as a probability, then we should use the softmax function. Once we have all the outputs, we can apply softmax or argmax to see which class got the highest score, to predict a class.

For the loss function, we want to use cross entropy loss. Remember we saw this before in logistic regression and if \(k=2\) then cross entropy loss reduces to the simpler logistic regression cost function (log-loss).

To implement Neural Networks, there are great packages!

Keras is great for building neural networks and the easiest place to start for experimentation. Keras builds a model one layer at a time. Easiest way is sequential model. Always assumes outputs from one layer are the input for the next layer. [More complex models in functional API]

Tensorflow and Pytorch are more advanced models. There are a lot more options to control here. (Keras is built on top of Tensorflow.)

Switch to Jupyter notebook for Keras example of simple neural net.

Subsection 9.4 Back Propagation

In this section we will introduce the concept of back propagation, its benefits and pitfalls, and a brief introduction to optimizers. The idea of back propagation is doing multivariable calculus on a graph. It makes training much, much faster on the scale of reducing the time from 200,000 years to one week. Its an important reason why neural networks have become so powerful. A full discussion of optimizers would involve lots of linear algebra and multivariable calculus. We aren't going to go through all of those details, but encourage you to research those details! (See references in your text.)

Remember that the goal in a neural network is to find \(\Theta\) values that minimize a loss function \(J\text{.}\)

As part of training, we want to update all of the \(\Theta\) values based on partial derivatives, \(\frac{\partial J}{\partial \theta_{ij}^L}\text{.}\)

We will need partial derivatives from Calculus, so we will include a quick reminder of what partial derivatives are. Suppose we have a function of two variables, \(f(x,y)=2x+3y+xy\text{.}\) Since there are two variables, we can take the derivative with respect to either \(x\) or \(y\text{.}\) The idea is to treat the other variable as a constant.

Example 9.10.
Compute \(\frac{\partial f}{\partial x}\) for \(f(x,y)=2x+3y+xy\text{.}\) We treat \(y\) as a constant in this case. So \(\frac{\partial f}{\partial x}=2+0+y.\)

The next idea we need is the chain rule for partial derivatives. We can think of the chain rule as a mobster who needs to shake down every term to get the needed information. In this case, we consider a function \(f(g,h)\) where \(g,h\) are both function of two variables \(x\) and \(y\text{.}\) That is, \(g(x,y)\) and \(h(x,y)\text{.}\) The chain rule for partial derivatives is

\begin{equation*} \frac{\partial f}{\partial x}= \frac{\partial f}{\partial g}\frac{\partial g}{\partial x} \frac{\partial f}{\partial h}\frac{\partial h}{\partial x}\text{.} \end{equation*}

Both \(g(x,y)\) and \(h(x,y)\) have to give information about \(x\text{.}\)

Example 9.11.
Suppose \(f(x,y) = s(x,y) p(x)\) where \(s(x,y)=x+y\) and \(p(x)=2x-3\text{.}\) There are two ways we can calculate derivates, we can use the chain rule or we can use algebraic simplification. For the chain rule
\begin{equation*} \frac{\partial f}{\partial x}= \frac{\partial f}{\partial s}\frac{\partial s}{\partial x} \frac{\partial f}{\partial p}\frac{\partial p}{\partial x} =p*1+s*2=(2x-3)+2(x+y)=4x+2y-3. \end{equation*}
For algebraic simplification we multiply out s and p.
\begin{equation*} f=(x+y)(2x-3)=2x^2+2xy-3x-3y. \end{equation*}
We can then compute the partial derivative without the chain rule and get \(\frac{\partial f}{\partial x}=4x+2y-3 \text{.}\)

We can visual the functions and derivatives as a graph to connect this idea to neural networks. The chain rule is easier for computer. Algebraic simplificaiton requires symbolic algebra. The functions correspond to nodes.

We can evaluate the composite function by plugging in the values for \(x\) and \(y\) and adding all inputs to a node.
We can now envision the connections between nodes as partial derivatives.
The chain rule corresponds to multiplying along paths and summing the paths that enter the final node.
So we could calculate derivatives this way, but we won't because there is a combinatorial explosion when 'summing up all possible paths'. For our simple example, it isn't so bad.
However, for a larger network, this becomes unmanageable!
We need to be smarter. We need to rethink how to do the flow. Feedforward networks are built to quickly propagate information through the net. The structure of the graph will help. There are two possible modes for differentiation

  • forward-mode differentiations
  • backward-mode differentiation (Also called backprop or reverse node.)

Both modes can help us find \(\frac{\partial f}{\partial x}\) at a point \((x,y)=(1,2)\text{.}\) Step 1: data forward pass -- keep track of simple derivatives store information about simple derivatives.

In forward mode: find \(\frac{\partial}{\partial x}\) at every node by summing the incoming flow.
Plugging in the values for \(x=1\) and \(y=2\) gives
Then we can find
Question 9.12.
What is \(\frac{\partial f}{\partial y}\) for the same graph?
Using the graph we get
We can check that this is right: \(f=(x+y)(2x-3)=2x^2+2xy-3x-3y\) so \(\frac{\partial f}{\partial y}=2x-3=2(1)-3=-1.\)

This is better than redoing the derivative each time, but its still a lot of work in a large network.

So we want to talk about the idea for backward mode! We'll start at the output and move to the input. Instead of starting at \(\frac{\partial}{\partial x}\text{.}\) We want to start with \(\frac{\partial f}{\partial}\) instead.
Last step: df/df=1 df/dp=df/df*df/dp df/ds=df/df*df/ds df/dx= df/dy is easy now because we already calculated everything. In one backwards calculation found all the partial derivatives for every single edge in the graph!! Isn't one backward pass better than a huge number of forward passes. There's a ton of linear algebra under the hood.
Problems with backpropagation. This idea isn't new. We can do derivatives pretty easily.

  1. Vanishing Gradients.
  2. Exploding Gradients
  3. Unstable Gradients

In the case of vanishing gradients, one of the activations doesn't send a very strong signal. That is, small partial derivatives at some layer. Remember that the idea of gradients is that we are hiking down a hill into a valley proportional to the steepness of a hill. If we hit a flat point, then there won't be any update to that theta value. The algorithm might not converge, or will take a long time.

For the case of exploding gradients, this can easily diverge since we are likely to take way too big of a step. Or it could take longer to converge.

For the case of unstable gradients, different layers are learning at different rates. It is difficult to pick a learning rate that works for all of these layers.

We need our signal to flow forward to give us a prediction, but backward to get an update. Need signal to flow in both directions. Makes it hard to choose a good initialization and the activation function has a huge impact on neural net. see book for references. Here's the idea:

We need to initialize the thetas so that we control the volume of the signal from one layer to the next. Radio Signal: amp set to zero --- amp set to max. both are bad. Each layer needs to be initialized in a way that preserves variance. The combination of initialization and activation functions is when NN really took off in effectiveness. But our programs do take care of this, but there are choices to make. Activation functions: sigmoid: could give very little gradient to propagate. slope is near zero in many places.
Hyperbolic tangent is similar. Traditional functions saturate -- too many places that are zero.
The modern functions have fewer places that are zero slope. Hence the change to these units. What type of math is involved in backprop? Mathematical statistics and probability. There are lots of optimizers. See text. Table: 11-2.
Joanna's defaults are RMSProp and ADAM. Both use the idea of momentum. Take a step and stop. Idea of momentum is to keep going in the same direction. Save past gradient information, and if gradient continues in the same direction we keep going at a higher rate. RMSprop just keep track of recent gradients, short window of momentum influence ADAM adaptive moment estimation -- If issues in convergence or instability then try others. all of these are a numerical analysis class on their own. This is an open field of study!

Subsection 9.5 Keras and Tensorflow

Review of sequential model functional model callback -- keep track of runs and keep best one visualization tool called tensorboard Regression Neural Net and Keras Tools copy from notebook?? Idea of sequential model is that one layer flows to the next and all inputs are used. bias term 8+1 * 30 + 31*1 parameters model = keras.models.Sequential([ each layer]) model.compile() SGD ok because model is simple. what should we see in loss? decrease but could increase a little stochastic gradient descent might move a little randomly model.evaluate(testing data) Functional API think about layers as functions, so we specify input not just assume all entries from last layer. We can think about deep and wide, some features could go deep (many hidden layers) or wide (not through all layers) keras.backend.clear_session() input_ = keras.layers.Input(shape=X_train_s.shape) hidden1=keras.layers.Dense(30, activ)(input_) hidden2=keras.layers.Dense(hidden1) concat=keras.layers.concatenate([input_,hidden2]) output=keras.layers.Dense(1)(concat) model= keras.models.Model(inputs=[input_],outputs=[output]) When wide and when deep? based on data but mostly experimentation Could help if we had really different types of data that we wanted to use for a single answer. Maybe voice data and image data. Really helpful tools saving and restoring a model -- keep good notes about what you are trying while experimenting -- it can take a long time to train and don't want to redo that --"name.h5") keras.models.load_model("name.h5") save model in the .fit command. callback build model, compile name_cb=keras.callbacks.ModelCheckpoint("best",save_best_only=True) ..., callbacks=[name_cb]) There's a lot more to play with here! Explore the documentation!! Tensorboard Pretty cool!

Subsection 9.6 Hyperparameter Tuning

Today we will learn about a few tools for fine tuning our neural network models. If you are training neural networks professionally then you should absolutely read chapter 11 along with all of the papers referenced in chapter 11.

Hyperparameter Tuning

  • GridSearchCV or RandomizedSearchCV are usefull if you wrap your model in sklearn

What are some important parameters:

  • Number of Hidden Layers
  • Number of Neurons Per Layer
  • Learning Rate
  • Optimizer Used
  • Batch Size
  • Activiation Functions
  • Numnber of Epochs


  • If you are training a very complicated neural network you almost always want to add in some regularization!

Types of regularization

  • Simple L1 and L2 - same as before applied during training
  • Dropout - drop neurons during training with some probability - all neurons used during prediction
  • Monte Carlo Dropout - boosts preformance of a trained dropout model - applied after training during prediction
  • Max-Norm - a special way of rescaling after each training step

See Jupyter Notebook.

Subsection 9.7 Convolutional Neural Network

The idea again comes from neuroscience and examining the visual cortex.

  1. Each individual neuron has a limited field of vision.
  2. Receptive fields overlap.
  3. Some look for larger/complex patterns and others will look for small/simple details.

For example, a house may be identified by a triangle and vertical and horizontal line segments.

Can we do this with an artificial neural network? In fact, we can! Le Cun presented this idea for what are called convolutional neural networks (or CNN) in 1988. But needed more powerful computers and more labeled data for it to take off and become useful.

The convolutional layer represents the visual cortex. We connect this to a deep fully connected neural network to do the decision making.

But wait! Haven't we already done image classification with deep NN? MNIST has very small images that are extremely simple. So a plain NN worked ok. But it won't work as well for larger, more complex images.

If we wanted to send a 100 by 100 image through a traditional neural network, we must first flatten it to an 10000 array. This can be connected to any number of neurons in the first layer, say 1000. This is a huge parameter set.

Its worse than that though! We are losing the spatial connection between pixels. Its really hard to identify images from isolated pixels. We really need to know how the pixels are connected.

Convolution is a mathematical operation. For example, the Laplace transform features the integral of pointwise multiplication.

Convolution layers are related but different.

What is a convolutional layer? Each convolutional layer maps a subset of the image to a single neuron in the next layer. Each neuron is connected to a small window of pixels. The windows overlap and cover the entire image. Convolutional layers may or may not reduce the size in the layers depending on how the windows overlap. The input layer is not flattened. It is a tensor. (contains width, height, certain number of channels) The number of channels depends on the color (1 channel for gray scale, 3 channels for RGB, all layered on top of each other.)

This use of tensors leads to the name "tensorflow" Both keras and tensorflow expect tensors as input.

More specifically we examine a convolutional layer in

In this example, we have a 3 by 3 window, also called a kernel. We will sum up the bias and weights for the pixels in the kernel.

One issue that must be decided is what to do at the boundaries of the image. We could fill with zeros (padding="same") or we could ignore the boundary points (padding="valid").

The next question is how big of a step should we take.

This is called the stride. The stride tells us how much overlap between windows. Also can reduce dimension based on stride. A stride of (1,1) does not reduce dimension. (Take 1 step left/right and take 1 step up/down.) But a stride of (2,2) would reduce the dimension. Some pixels belong to only one neuron reduction depending on window size.

The next idea is filters.

The goal is to add layers to the map that allow for the theta values in each layer to specialize and recognize different patterns.

Each layer could have any number of filters, maybe as many as 64. Each sheet has different theta values. This could identify line segments, areas of red color, etc. We don't determine what layers do, the CNN will learn this as part of its training. We sum up all the filters to get the next layer. These filters are called feature maps. Within a single convolutional layer you might have 64 feature maps which each learns something different. We sum all of the feature maps to get to the next layer. Each convolutional layer is the same as a hidden layer in the NN.

The Math! If we want to calculate what is happening in neuron \(i,j \) we must sum over height and width of kernel window and sum over all the filters.

This formula multiplies the value by a weight, similar to previous ideas of linear regression.

Convolutional neural nets work really well, but they require a huge amount of memory for large images! This is especially a problem for back propagation since we had to keep so many terms in memory. It is easy to run into memory errors. There are some fixes for this:

  • small batch size
  • increase stride to reduce dimension
  • 16 bit vs 32 bit floats (save less decimal places)
  • distribute data across multiple machines/processors
  • pooling layers

What are pooling layers? The idea is to pool data to reduce dimensionality.

We can subsample to reduce

  • memory
  • computation
  • overfitting

There are two main ways to apply pooling. We can use either max pooling or average pooling. In max pooling, take the maximum value and ignore the rest. That is, examine each window and find the max. For average pooling, take the average of the values in the window.

One consequence of pooling is to add invariance to small transitions. This can be both good and bad. For example, if you want to classify a dog in the snow, you might want to ignore small bits of snow on the dog. Of course, if you want to identify that it is snowing, then you won't want to ignore that information.

We do still want to apply regularization in CNN. We can use kernel regulizers as before specifying either l1 or l2 norms. Dropout can also be applied, but it must be done differently because of the spatial data. We don't want to drop out parts of the image, instead we will probabilistically drop out entire feature maps.

An example of a full CNN is pictured below.

Note that the convolutional layers come first, and they there are traditional deep neural networks.

There are a number of decisions to make:

  • kernel size
  • # of filters
  • activation
  • padding
  • regularization

Sometimes we add normalization and regularization between convolutional and pooling layers.

Sometimes batch normalization is used to help avoid exploding gradients. We would just apply this after the first convolutional layer.

For pooling layer you must specify the kernel size and the type of pooling (max or average).

You can have as many convolutional and pooling layers as you want. Then flatten and send through a traditional deep NN as before.

Time for example in Jupyter notebook. Follows blog -- check it out.