(Post still in progress. If I find any subtleties in the future I will update this post accordingly)

What is it?

  • Method for (in theory) increasing the performance of your neural network:
    • In particular helps with the issue of internal covariate shift
    • the input data means and variances change in the internal layers during neural network training.
      • you can see how this is a huge issue in deep neural networks - small changes in initial hidden layers get propogated and grow during training, affecting the deeper hidden layers later on!
      • at each step, the distribution of your input data will change without BatchNormalization.

How do I Use It?

  • In your neural network, add a layer where you do the Batch Norm Transform:
    • in practice, use keras for building this BatchNormalization layer.
      • keras BatchNormalization Docs
      • so if you’re using keras - doing a batch norm transform is quite simply just adding a BatchNormalization layer to your model :
        • model.add(keras.layers.BatchNormalization())
        • if course you can tweak some parameters if you wish - see the docs for more information.
    • if you’re interested in the math and implementing it yourself (which I’m always a big fan of), that’s coming up soon.
      • for now, know that the Batch Norm Transform normalizes your data to have a $\mu = 0$ and $\sigma = 1$.
      • also the actual normalization per feature for a feature $x$ in a “mini-batch” (call it $B$) looks like :
        • \[\hat{x}_B^{(p)} \leftarrow \frac{x_B{ ^{(p)} } – \mu_B^{(p)}}{\sqrt{ \sigma^2{ _B^{(p)} } + \epsilon}}\]
        • where $p$ represents the p^th feature.

Any Drawbacks?

  • Gradient Explosion in deep neural networks.
    • this Wikipedia page needs a lot of work, wow. Don’t look at me!

Any Funky Things I Should Know?

  • Ok, apparently there’s a lot of debate regarding whether Batch Norm directly reduces the internal covariate shift phenomenon.
    • A lot of scientists who are smarter than I am and know much more about Deep Learning say that what Batch Norm actually does is smooth the optimization function which is being solved for during training.
      • this smoothness supposedly (I want to prove it / see a proof) allows for larger ranges of the learning rate and faster convergence.

Give Us An Example!

  • Here is how I initialized a neural network with BatchNormalization layers (two of them):

# initialize my model 
model = Sequential()

# now we stack the layers using .add()

# layer 1 - dense layer with 8 nodes
# notice keras lets us do this without defining an input layer 
model.add(keras.layers.Dense(8))

# layer 2 - batch norm time! 
model.add(keras.layers.BatchNormalization())

# and so forth...

References and More Reading :

  1. Wikipedia
  2. How To Use Batch Normalization With Keras?
  3. Easy To Read Keras Docs
  4. More Keras Docs (API)