(Post still in progress. If I find any subtleties in the future I will update this post accordingly)

What is it?

Method for (in theory) increasing the performance of your neural network:
- In particular helps with the issue of internal covariate shift
- the input data means and variances change in the internal layers during neural network training.
  - you can see how this is a huge issue in deep neural networks - small changes in initial hidden layers get propogated and grow during training, affecting the deeper hidden layers later on!
  - at each step, the distribution of your input data will change without BatchNormalization.

How do I Use It?

In your neural network, add a layer where you do the Batch Norm Transform:
- in practice, use keras for building this BatchNormalization layer.
  - keras BatchNormalization Docs
  - so if you’re using keras - doing a batch norm transform is quite simply just adding a BatchNormalization layer to your model :
    - model.add(keras.layers.BatchNormalization())
    - if course you can tweak some parameters if you wish - see the docs for more information.
- if you’re interested in the math and implementing it yourself (which I’m always a big fan of), that’s coming up soon.
  - for now, know that the Batch Norm Transform normalizes your data to have a $\mu = 0$ and $\sigma = 1$.
  - also the actual normalization per feature for a feature $x$ in a “mini-batch” (call it $B$) looks like :
    - \[\hat{x}_B^{(p)} \leftarrow \frac{x_B{ ^{(p)} } – \mu_B^{(p)}}{\sqrt{ \sigma^2{ _B^{(p)} } + \epsilon}}\]
    - where $p$ represents the p^th feature.

Any Drawbacks?

Gradient Explosion in deep neural networks.
- this Wikipedia page needs a lot of work, wow. Don’t look at me!

Any Funky Things I Should Know?

Ok, apparently there’s a lot of debate regarding whether Batch Norm directly reduces the internal covariate shift phenomenon.
- A lot of scientists who are smarter than I am and know much more about Deep Learning say that what Batch Norm actually does is smooth the optimization function which is being solved for during training.
  - this smoothness supposedly (I want to prove it / see a proof) allows for larger ranges of the learning rate and faster convergence.

Give Us An Example!

Here is how I initialized a neural network with BatchNormalization layers (two of them):

# initialize my model 
model = Sequential()

# now we stack the layers using .add()

# layer 1 - dense layer with 8 nodes
# notice keras lets us do this without defining an input layer 
model.add(keras.layers.Dense(8))

# layer 2 - batch norm time! 
model.add(keras.layers.BatchNormalization())

# and so forth...

References and More Reading :