Certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
Certain activation functions, like the sigmoid function, squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.
I have used the sklearn dataset to train the neural network, Here is the dataset:
Lets Concentrate on the first layer of the model
Weights before Training:
[-0.1077401 , 0.43455356, -0.60425615, -0.53927976, -0.190355,-0.12216991, 0.17259805, -0.29025432, -0.34041786, 0.5119183 ]
Weights after 1st Epoch:
[ 0.53107524, 0.12173636, 0.5073124 , 0.54416114, -0.19135918,-0.08547983, 0.56848377, 0.44454524, -0.6231898 , 0.0601779 ]
Change in Weight
[-0.00113249, -0.00108778, -0.00292063, -0.00071526, -0.00165403,-0.00002235, -0.00202656, -0.00181794, 0.00232458, -0.00081211]
As you can see that change in weight is very low, means that the weights and biases of the initial layers will not be updated effectively with each training session.
This can lead to overall inaccuracy of model.
Batch normalization reduces this problem by simply normalizing the input so |x| doesn’t reach the outer edges of the sigmoid function, it normalizes the input so that most of it falls in the green region, where the derivative isn’t too small.
It doesn't causes small derivative.
Weights before Training:
[-0.7054342 , 0.55165845, -0.44873548, -0.62863475, 0.61142164, -0.4026693 , 0.4864213 , -0.48546308, 0.32516545, 0.49556655]
Weights after 1st Epoch:
[ 0.53107524, 0.12173636, 0.5073124 , 0.54416114, -0.19135918,-0.08547983, 0.56848377, 0.44454524, -0.6231898 , 0.0601779 ]
Change in Weight
[ 3.5640223, 5.7049985, -10.301827 , 8.888125 , -7.2175856, -6.165191 , -9.538114 , 3.937869 , -5.6343074, 5.639493 ]
Here is the significant change in weight of the model using ReLU activation function.



