## Answers

Here is the following code please do upvote thankyou

Back-propagation is that the essence of neural net training. it's the practice of fine-tuning the weights of a neural net supported the error rate (i.e. loss) obtained within the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.

So how does this process work, with the vast simultaneous mini-executions involved? Let’s learn by example!

In order to form this instance as subjective as possible, we’re just getting to touch on related concepts (e.g.

Loss functions, optimization functions, etc.) without explaining them, as these topics deserve their own series.

First off, let’s set the model components

Imagine that we've a deep neural network that we'd like to coach . the aim of coaching is to create a model that performs the XOR (exclusive OR) functionality with two inputs and three hidden units, such the training set (truth table) looks something just like the following:

X1 | X2 | Y

0 | 0 | 0

0 | 1 | 1

1 | 0 | 1

1 | 1 | 0

Moreover, we'd like an activation function that determines the activation value at every node within the neural net. For simplicity, let’s choose an identity activation function:

f(a) = a

We also need a hypothesis function that determines what the input to the activation function is. This function goes to be the standard , ever-famous:

h(X) = W0.X0 + W1.X1 + W2.X2

or

h(X) = sigma(W.X) for all (W, X)

Let’s also choose the loss function to be the standard cost function of logistic regression, which looks a touch complicated but is really fairly simple:

Furthermore, we’re getting to use the Batch Gradient Descent optimization function to work out in what direction we should always adjust the weights to urge a lower loss than the one we currently have. Finally, the training rate are going to be 0.1 and every one the weights are going to be initialized to 1.

Our Neural Network

Let’s finally draw a diagram of our long-awaited neural net.

It should look something like this:

The leftmost layer is that the input layer, which takes X0 because the bias term useful 1, and X1 and X2 as input features. The layer within the middle is that the first hidden layer, which also takes a bias term Z0 useful 1. Finally, the output layer has just one output unit D0 whose activation value is that the actual output of the model (i.e. h(x)).

Now we forward-propagate

It is now the time to feed-forward the knowledge from one layer to subsequent . This goes through two steps that happen at every node/unit within the network:

1- Getting the weighted sum of inputs of a specific unit using the h(x) function we defined earlier.

2- Plugging the worth we get from step 1 into the activation function we've (f(a)=a during this example) and using the activation value we get (i.e.

The output of the activation function) because the input feature for the connected nodes within the next layer.

Note that units X0, X1, X2 and Z0 don't have any units connected to them and providing inputs. Therefore, the steps mentioned above don't occur in those nodes. However, for the remainder of the nodes/units, this is often how it all happens throughout the neural net for the primary input sample within the training set:

Unit Z1:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 . 0 + 1 .

0

= 1 = a

z = f(a) = a => z = f(1) = 1

and same goes for the remainder of the units:

Unit Z2:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 . 0 + 1 . 0

= 1 = a

z = f(a) = a => z = f(1) = 1

Unit Z3:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 .

0 + 1 . 0

= 1 = a

z = f(a) = a => z = f(1) = 1

Unit D0:

h(x) = W0.Z0 + W1.Z1 + W2.Z2 + W3.Z3

= 1 . 1 + 1 . 1 + 1 . 1 + 1 .

1

= 4 = a

z = f(a) = a => z = f(4) = 4

As we mentioned earlier, the activation value (z) of the ultimate unit (D0) is that of the entire model. Therefore, our model predicted an output of 1 for the set of inputs {0, 0}. Calculating the loss/cost of the present iteration would follow:

Loss = actual_y - predicted_y

= 0 - 4

= -4

The actual_y value comes from the training set, while the predicted_y value is what our model yielded. therefore the cost at this iteration is adequate to -4.

So where is Back-propagation?

According to our example, we now have a model that doesn't give accurate predictions (it gave us the worth 4 rather than 1) which is attributed to the very fact that its weights haven't been tuned yet (they are all adequate to 1). We even have the loss, that's adequate to -4.

Back-propagation is all about feeding this loss backwards in such how that we will fine-tune the weights supported which. The optimization function (Gradient Descent in our example) will help us find the weights which will — hopefully — yield a smaller loss within the next iteration. So let’s get to it!

If feeding forward happened using the subsequent functions:

f(a) = a

Then feeding backward will happen through the partial derivatives of these functions. there's no got to undergo the working of arriving at these derivatives. All we'd like to understand is that the above functions will follow:

f'(a) = 1

J'(w) = Z .

Delta

where Z is simply the z value we obtained from the activation function calculations within the feed-forward step, while delta is that the loss of the unit within the layer.

I know it’s tons of data to soak up in one sitting, but I suggest you're taking some time and really understand what's happening at every step before going further.

Calculating the deltas

Now we'd like to seek out the loss at every unit/node within the neural net. Why is that? Well, believe it this manner , every loss the the deep learning model arrives to is really the mess that was caused by all the nodes accumulated into one number. Therefore, we'd like to seek out out which node is liable for most of the loss in every layer, in order that we will penalize it during a sense by giving it a smaller weight value and thus lessening the entire loss of the model.

Calculating the delta of each unit are often problematic. However, because of Mr. Andrew Ng, he gave us the shortcut formula for the entire thing:

delta_0 = w .

Delta_1 . f'(z)

where values delta_0, w and f’(z) are those of an equivalent unit’s, while delta_1 is that the loss of the unit on the opposite side of the weighted link. For example:

You can consider it this manner , so as to urge the loss of a node (e.g. Z0), we multiply the worth of its corresponding f’(z) by the loss of the node it's connected to within the next layer (delta_1), by the load of the link connecting both nodes.

This is exactly how back-propagation works. We do the delta calculation step at every unit, back-propagating the loss into the neural net, and checking out what loss every node/unit is liable for .

Let’s calculate those deltas and obtain it over with!

delta_D0 = total_loss = -4

delta_Z0 = W .

Delta_D0 . f'(Z0) = 1 . (-4) . 1 = -4

delta_Z1 = W . delta_D0 .

F'(Z1) = 1 . (-4) . 1 = -4

delta_Z2 = W . delta_D0 . f'(Z2) = 1 .

(-4) . 1 = -4

delta_Z3 = W . delta_D0 . f'(Z3) = 1 . (-4) .

1 = -4

There are a couple of things to note here:

The loss of the ultimate unit (i.e. D0) is adequate to the loss of the entire model. this is often because it's the output unit, and its loss is that the accumulated loss of all the units together, like we said earlier.

The function f’(z) will always give the worth 1, regardless of what the input (i.e. z) is adequate to . this is often because the partial , as we said earlier, follows: f’(a) = 1

The input nodes/units (X0, X1 and X2) don't have delta values, as there's nothing those nodes control within the neural net.

They're only there as a link between the info set and therefore the neural net. this is often merely why the entire layer is typically not included within the layer count.

Updating the weights

All that's left now's to update all the weights we've within the neural net. This follows the Batch Gradient Descent formula:

W := W - alpha . J'(W)

Where W is that the weight at hand, alpha is that the learning rate (i.e. 0.1 in our example) and J’(W) is that the partial of the value function J(W) with reference to W.

Again, there’s no need for us to urge into the maths . Therefore, let’s use Mr. Andrew Ng’s partial of the function:

J'(W) = Z . delta

Where Z is that the Z value obtained through forward-propagation, and delta is that the loss at the unit on the opposite end of the weighted link:

Now we use the Batch Gradient Descent weight update on all the weights, utilizing our partial values that we obtain at every step. it's worth emphasizing thereon the Z values of the input nodes (X0, X1, and X2) are adequate to 1, 0, 0, respectively.

The 1 is that the value of the bias unit, while the zeroes are literally the feature input values coming from the info set. One last note is that there's no particular order to updating the weights. you'll update them in any order you would like , as long as you don’t make the error of updating any weight twice within the same iteration.

In order to calculate the new weights, let’s give the links in our neural nets names:

New weight calculations will happen as follows:

W10 := W10 - alpha . Z_X0 . delta_Z1

= 1 - 0.1 .

1 . (-4) = 1.4

W20 := W20 - alpha . Z_X0 . delta_Z2

= 1 - 0.1 . 1 .

(-4) = 1.4

. . . . .

.

. . . .

. .

. . .

W30 := 1.4

W11 := 1.4

W21 := 1.4

W31 := 1.4

W12 := 1.4

W22 := 1.4

W32 := 1.4

V00 := V00 - alpha . Z_Z0 . delta_D0

= 1 - 0.1 .

1 . (-4) = 1.4

V01 := 1.4

V02 := 1.4

V03 := 1.4

It is important to notice here that the model isn't trained properly yet, as we only back-propagated through one sample from the training set. Doing all we did everywhere again for all the samples will yield a model with better accuracy as we go, trying to urge closer to the minimum loss/cost at every step.

.