Batch Normalization Notes
(Post still in progress. If I find any subtleties in the future I will update this post accordingly)
What is it?
- Method for (in theory) increasing the performance of your neural network:
- In particular helps with the issue of internal covariate shift
- the input data means and variances change in the internal layers during neural network training.
- you can see how this is a huge issue in deep neural networks - small changes in initial hidden layers get propogated and grow during training, affecting the deeper hidden layers later on!
- at each step, the distribution of your input data will change without BatchNormalization.
How do I Use It?
- In your neural network, add a layer where you do the Batch Norm Transform:
- in practice, use
keras
for building this BatchNormalization layer.- keras BatchNormalization Docs
- so if you’re using
keras
- doing a batch norm transform is quite simply just adding a BatchNormalization layer to your model :model.add(keras.layers.BatchNormalization())
- if course you can tweak some parameters if you wish - see the docs for more information.
- if you’re interested in the math and implementing it yourself (which I’m always a big fan of), that’s coming up soon.
- for now, know that the Batch Norm Transform normalizes your data to have a $\mu = 0$ and $\sigma = 1$.
- also the actual normalization per feature for a feature $x$ in a “mini-batch” (call it $B$) looks like :
- \[\hat{x}_B^{(p)} \leftarrow \frac{x_B{ ^{(p)} } – \mu_B^{(p)}}{\sqrt{ \sigma^2{ _B^{(p)} } + \epsilon}}\]
- where $p$ represents the p^th feature.
- in practice, use
Any Drawbacks?
- Gradient Explosion in deep neural networks.
- this Wikipedia page needs a lot of work, wow. Don’t look at me!
Any Funky Things I Should Know?
- Ok, apparently there’s a lot of debate regarding whether Batch Norm directly reduces the internal covariate shift phenomenon.
- A lot of scientists who are smarter than I am and know much more about Deep Learning say that what Batch Norm actually does is smooth the optimization function which is being solved for during training.
- this smoothness supposedly (I want to prove it / see a proof) allows for larger ranges of the learning rate and faster convergence.
- A lot of scientists who are smarter than I am and know much more about Deep Learning say that what Batch Norm actually does is smooth the optimization function which is being solved for during training.
Give Us An Example!
- Here is how I initialized a neural network with BatchNormalization layers (two of them):
# initialize my model
model = Sequential()
# now we stack the layers using .add()
# layer 1 - dense layer with 8 nodes
# notice keras lets us do this without defining an input layer
model.add(keras.layers.Dense(8))
# layer 2 - batch norm time!
model.add(keras.layers.BatchNormalization())
# and so forth...