$$
y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta
$$
The mean and standard-deviation are calculated per-dimension over the
mini-batches and \(\gamma\) and \(\beta\) are learnable parameter
vectors of size C (where C is the input size). By default, the elements
of \(\gamma\) are set to 1 and the elements of \(\beta\) are set to
0. The standard-deviation is calculated via the biased estimator,
equivalent to torch_var(input, unbiased = FALSE).
Also by default, during training this layer keeps running estimates of its
computed mean and variance, which are then used for normalization during
evaluation. The running estimates are kept with a default momentum
of 0.1.
If track_running_stats is set to FALSE, this layer then does not
keep running estimates, and batch statistics are instead used during
evaluation time as well.