AMMI Introduction to Deep Learning 6.4. Batch normalization Fran - PowerPoint PPT Presentation

AMMI – Introduction to Deep Learning 6.4. Batch normalization Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ Sun Sep 30 10:42:14 CAT 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

We saw that maintaining proper statistics of the activations and derivatives was a critical issue to allow the training of deep architectures. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 1 / 15

We saw that maintaining proper statistics of the activations and derivatives was a critical issue to allow the training of deep architectures. It was the main motivation behind Xavier’s weight initialization rule. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 1 / 15

We saw that maintaining proper statistics of the activations and derivatives was a critical issue to allow the training of deep architectures. It was the main motivation behind Xavier’s weight initialization rule. A different approach consists of explicitly forcing the activation statistics during the forward pass by re-normalizing them. Batch normalization proposed by Ioffe and Szegedy (2015) was the first method introducing this idea. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 1 / 15

“Training Deep Neural Networks is complicated by the fact that the distri- bution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization /.../” (Ioffe and Szegedy, 2015) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 2 / 15

“Training Deep Neural Networks is complicated by the fact that the distri- bution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization /.../” (Ioffe and Szegedy, 2015) Batch normalization can be done anywhere in a deep architecture, and forces the activations’ first and second order moments, so that the following layers do not need to adapt to their drift. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 2 / 15

During training batch normalization shifts and rescales according to the mean and variance estimated on the batch. Processing a batch jointly is unusual. Operations used in deep models � can virtually always be formalized per-sample. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 3 / 15

During training batch normalization shifts and rescales according to the mean and variance estimated on the batch. Processing a batch jointly is unusual. Operations used in deep models � can virtually always be formalized per-sample. During test, it simply shifts and rescales according to the empirical moments estimated during training. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 3 / 15

If x b ∈ R D , b = 1 , . . . , B are the samples in the batch, we first compute the empirical per-component mean and variance on the batch B m batch = 1 � ˆ x b B b =1 B v batch = 1 m batch ) 2 � ˆ ( x b − ˆ B b =1 Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 4 / 15

If x b ∈ R D , b = 1 , . . . , B are the samples in the batch, we first compute the empirical per-component mean and variance on the batch B m batch = 1 � ˆ x b B b =1 B v batch = 1 m batch ) 2 � ˆ ( x b − ˆ B b =1 from which we compute normalized z b ∈ R D , and outputs y b ∈ R D ∀ b = 1 , . . . , B , z b = x b − ˆ m batch √ ˆ v batch + ǫ y b = γ ⊙ z b + β. where ⊙ is the Hadamard component-wise product, and γ ∈ R D and β ∈ R D are parameters to optimize. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 4 / 15

During inference, batch normalization shifts and rescales independently each component of the input x according to statistics estimated during training: y = γ ⊙ x − ˆ m √ v + ǫ + β. ˆ Hence, during inference, batch normalization performs a component-wise affine transformation. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 5 / 15

During inference, batch normalization shifts and rescales independently each component of the input x according to statistics estimated during training: y = γ ⊙ x − ˆ m √ v + ǫ + β. ˆ Hence, during inference, batch normalization performs a component-wise affine transformation. � As for dropout, the model behaves differently during train and test. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 5 / 15

As dropout, batch normalization is implemented as separate modules that process input components separately. >>> x = torch.empty(1000, 3).normal_() >>> x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.]) >>> x.mean(0) tensor([ -9.9555, 24.9327, 3.0933]) >>> x.std(0) tensor([ 1.9976, 4.9463, 9.8902]) >>> bn = nn.BatchNorm1d(3) >>> with torch.no_grad(): ... bn.bias.copy_(torch.tensor([2., 4., 8.])) ... bn.weight.copy_(torch.tensor([1., 2., 3.])) ... Parameter containing: tensor([ 2., 4., 8.]) Parameter containing: tensor([ 1., 2., 3.]) >>> y = bn(x) >>> y.mean(0) tensor([ 2.0000, 4.0000, 8.0000]) >>> y.std(0) tensor([ 1.0005, 2.0010, 3.0015]) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 6 / 15

As for any other module, we have to compute the derivatives of the loss ℒ with respect to the inputs values and the parameters. For clarity, since components are processed independently, in what follows we consider a single dimension and do not index it. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 7 / 15

We have B m batch = 1 � ˆ x b B b =1 B v batch = 1 � m batch ) 2 ˆ ( x b − ˆ B b =1 ∀ b = 1 , . . . , B , z b = x b − ˆ m batch √ ˆ v batch + ǫ y b = γ z b + β. From which ∂ ℒ ∂ ℒ ∂ y b ∂ ℒ � � ∂γ = ∂γ = z b ∂ y b ∂ y b b b ∂ ℒ ∂ ℒ ∂ y b ∂ ℒ � � ∂β = ∂β = . ∂ y b ∂ y b b b Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 8 / 15

Since each input in the batch impacts all the outputs of the batch , the derivative of the loss with respect to an input is quite complicated. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 9 / 15

Since each input in the batch impacts all the outputs of the batch , the derivative of the loss with respect to an input is quite complicated. ∀ b = 1 , . . . , B , ∂ ℒ = γ ∂ ℒ ∂ z b ∂ y b Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 9 / 15

Since each input in the batch impacts all the outputs of the batch , the derivative of the loss with respect to an input is quite complicated. ∀ b = 1 , . . . , B , ∂ ℒ = γ ∂ ℒ ∂ z b ∂ y b B ∂ ℒ = − 1 ∂ ℒ v batch + ǫ ) − 3 / 2 � 2 (ˆ ( x b − ˆ m batch ) ∂ ˆ v batch ∂ z b b =1 B ∂ ℒ 1 ∂ ℒ � = − √ ˆ ∂ ˆ m batch v batch + ǫ ∂ z b b =1 Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 9 / 15

Since each input in the batch impacts all the outputs of the batch , the derivative of the loss with respect to an input is quite complicated. ∀ b = 1 , . . . , B , ∂ ℒ = γ ∂ ℒ ∂ z b ∂ y b B ∂ ℒ = − 1 ∂ ℒ v batch + ǫ ) − 3 / 2 � 2 (ˆ ( x b − ˆ m batch ) ∂ ˆ v batch ∂ z b b =1 B ∂ ℒ 1 ∂ ℒ � = − √ ˆ ∂ ˆ m batch v batch + ǫ ∂ z b b =1 ∀ b = 1 , . . . , B , ∂ ℒ = ∂ ℒ v batch + ǫ + 2 1 ∂ ℒ m batch ) + 1 ∂ ℒ √ ˆ ( x b − ˆ ∂ x b ∂ z b B ∂ ˆ v batch B ∂ ˆ m batch Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 9 / 15

Since each input in the batch impacts all the outputs of the batch , the derivative of the loss with respect to an input is quite complicated. ∀ b = 1 , . . . , B , ∂ ℒ = γ ∂ ℒ ∂ z b ∂ y b B ∂ ℒ = − 1 ∂ ℒ v batch + ǫ ) − 3 / 2 � 2 (ˆ ( x b − ˆ m batch ) ∂ ˆ v batch ∂ z b b =1 B ∂ ℒ 1 ∂ ℒ � = − √ ˆ ∂ ˆ m batch v batch + ǫ ∂ z b b =1 ∀ b = 1 , . . . , B , ∂ ℒ = ∂ ℒ v batch + ǫ + 2 1 ∂ ℒ m batch ) + 1 ∂ ℒ √ ˆ ( x b − ˆ ∂ x b ∂ z b B ∂ ˆ v batch B ∂ ˆ m batch In standard implementation, ˆ m and ˆ v for test are estimated with a moving average during train, so that it can be implemented as a module which does not need an additional pass through the samples during training. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 6.4. Batch normalization 9 / 15

AMMI Introduction to Deep Learning 6.4. Batch normalization Fran - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 6.4. Batch normalization Fran cois Fleuret https://fleuret.org/ammi-2018/ Sun Sep 30 10:42:14 CAT 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE We saw that maintaining proper statistics of the

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

AMMI Introduction to Deep Learning 11.3. Word embeddings and translation Fran cois Fleuret

AMMI Introduction to Deep Learning 8.4. Optimizing inputs Fran cois Fleuret

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

AMMI Introduction to Deep Learning 7.2. Networks for image classification Fran cois

AMMI Introduction to Deep Learning 1.3. What is really happening? Fran cois Fleuret

AMMI Introduction to Deep Learning 8.3. Visualizing the processing in the input Fran cois

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois Fleuret

AMMI Introduction to Deep Learning 1.2. Current applications and success Fran cois Fleuret

AMMI Introduction to Deep Learning 6.6. Using GPUs Fran cois Fleuret

AMMI Introduction to Deep Learning 8.2. Looking at activations Fran cois Fleuret

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

AMMI Introduction to Deep Learning 7.3. Networks for object detection Fran cois Fleuret

Bag-of-components: an online algorithm for batch learning of mixture models Olivier Schwander

Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: Jacob Eisenstein Linear Models

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell University of

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Linear Classification 1 / 14 The Linear Model In the next few lectures we will extend the

Outline Optimization Unconstrained Optimization Problems Machine Learning and Pattern

FedRec: Federated Recommendation with Explicit Feedback Guanyu Lin 1 , 2 # , Feng Liang 1 , 2 # ,