Understanding Noise in Machine Learning Runtian Zhai School of - - PowerPoint PPT Presentation

understanding noise in machine learning
SMART_READER_LITE
LIVE PREVIEW

Understanding Noise in Machine Learning Runtian Zhai School of - - PowerPoint PPT Presentation

Understanding Noise in Machine Learning Runtian Zhai School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn July 16, 2019 A talk at UCLA. Runtian Zhai


slide-1
SLIDE 1

Understanding Noise in Machine Learning

Runtian Zhai

School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn

July 16, 2019

A talk at UCLA.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 1 / 45

slide-2
SLIDE 2

Introduction: Noise is Everywhere

Noise is everywhere. Since the time as early as the 1920s, statisticians have been searching for ways to combat noise in collected data. In machine learning, this is a more serious problem. Training sets are labeled by humans, and humans always make mistakes. In computer vision, many images are corrupted, blurred, or compressed. An even more dangerous kind of noise is known as the adversarial examples, crafted noise that aims to fool a certain classifjer.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 2 / 45

slide-3
SLIDE 3

Learning with Noise

People have proposed various kinds of ways to learn with noise: Some propose detection methods which detect noisy samples in a dataset so that they can be removed. Others suggest that even noisy samples can be useful. For instance, co-teaching is proposed to help networks learn on noisy datasets. Many defense methods are proposed to fjght against adversarial

  • examples. The most successful one so far is adversarial training,

which is training with on-the-fmy adversarial examples.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 3 / 45

slide-4
SLIDE 4

How Noise Afgects Training

Outline

1

How Noise Afgects Training Critical Learning Periods Difgerent Kinds of Noise are Difgerent Two Phases of Learning

2

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Measuring Dataset Complexity Detecting Noise

3

More Theoretical Approaches Infmuence Function Neural Tangent Kernel Fourier Analysis

Runtian Zhai (PKU) Understanding Noise July 16, 2019 4 / 45

slide-5
SLIDE 5

How Noise Afgects Training Critical Learning Periods

Critical Learning Periods

Critical Learning Periods in Deep Networks Achille et al. (UCLA) [7] In ICLR 2019 In Biology, we are told that the fjrst several weeks after the birth of a baby animal, known as the critical learning period, is critical for its intellectual development. In deep learning this is also true. If a network is trained on noisy images during the fjrst several epochs, then it can never reach high performance even if it is trained on clean images later on.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 5 / 45

slide-6
SLIDE 6

How Noise Afgects Training Critical Learning Periods

Experiment I

To show that the biological behavior also exists in deep learning, the authors did the following experiment: They trained an All-CNN on Cifar-10. During the fjrst N epochs the network was trained on noisy images. After that the network was trained on clean images for another 160 epochs. They used blurred images for noisy images: fjrst downsample the 32 × 32 images to 8 × 8 and then upsample back to 32 × 32 with bilinear interpolation.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 6 / 45

slide-7
SLIDE 7

How Noise Afgects Training Critical Learning Periods

Result: Early Defjcit Has Irremediable Negative Efgect

Runtian Zhai (PKU) Understanding Noise July 16, 2019 7 / 45

slide-8
SLIDE 8

How Noise Afgects Training Critical Learning Periods

Experiment II

They trained the network on noisy images for 40 epochs starting from epoch N, and on clean images in other epochs. The 40 epochs is called the defjcit window. They tested how much test accuracy would decrease with difgerent choice of window onset N.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 8 / 45

slide-9
SLIDE 9

How Noise Afgects Training Critical Learning Periods

Result: Early Epochs are More Critical

Runtian Zhai (PKU) Understanding Noise July 16, 2019 9 / 45

slide-10
SLIDE 10

How Noise Afgects Training Difgerent Kinds of Noise are Difgerent

Difgerent Kinds of Noise

The authors repeated the fjrst experiment with difgerent kinds of noise: Blur: 32 × 32 images downsampled to 8 × 8 then upsampled to 32 × 32 with bilinear interpolation Vertical fmip: Flip the images vertically Label permutation: Use random labels Noise: All images are replaced by random noise. They also tested networks of difgerent depth.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 10 / 45

slide-11
SLIDE 11

How Noise Afgects Training Difgerent Kinds of Noise are Difgerent

Difgerent Kinds of Noise are Difgerent

For Noise the efgect is not as strong. For Vertical fmip and Label permutation, the efgect is very weak. The deeper the network, the stronger the efgect.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 11 / 45

slide-12
SLIDE 12

How Noise Afgects Training Two Phases of Learning

Two Phases of Learning

The authors did fjsher information analysis on the training process: They used the trace of Fisher Information Matrix (FIM) to measure how much information the network had learned. The training period has two phases: In Phase I, FIM rises quickly, showing that the network is learning; In Phase II, FIM drops dramatically (while its performance is still improving), showing that the network starts to forget.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 12 / 45

slide-13
SLIDE 13

How Noise Afgects Training Two Phases of Learning

Two Phases of Learning (cont.)

Many other papers [10, 11] also found that there are two phases during training from the optimization perspective. It is well known that during training, noise fjts more slowly than clean

  • data. Many recent papers argue that in phase I, the network is fjtting

clean data; in phase II, the network is fjtting noise, so it seems like the network is forgetting useful information. This also explains why early stopping works.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 13 / 45

slide-14
SLIDE 14

Noise Fits More Slowly Than Clean Data

Outline

1

How Noise Afgects Training Critical Learning Periods Difgerent Kinds of Noise are Difgerent Two Phases of Learning

2

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Measuring Dataset Complexity Detecting Noise

3

More Theoretical Approaches Infmuence Function Neural Tangent Kernel Fourier Analysis

Runtian Zhai (PKU) Understanding Noise July 16, 2019 14 / 45

slide-15
SLIDE 15

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Deep Networks Can Fit Random Labels

Understanding Deep Learning Requires Rethinking Generalization Zhang et al. [5] In ICLR 2017 In this paper, the authors did the following experiment: they added many kinds of noise to Cifar-10 (random labels, random pixels, gaussian, etc.), and then trained an Inception model on it. They found that Deep networks fjt noisy data easily. However, it takes much longer time than clean data.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 15 / 45

slide-16
SLIDE 16

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Explaining the Results

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks Arora et al. [4] In ICML 2019 In this paper, the authors prove for an overparameterized two-layer fully-connected network that GD (gradient descent) can converge (achieve zero training loss) on datasets with random labels. GD converges more slowly on random labels than on clean labels. Label noise can harm generalization.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 16 / 45

slide-17
SLIDE 17

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Basic Setting

A two-layer ReLU network with m neurons is fW,a(x) = 1 √m

m

r=1

arσ(w⊤

r x)

where x ∈ Rd is the input, W = (w1, ..., wm) ∈ Rd×m is the weight of the fjrst layer and a = (a1, ..., am)⊤ ∈ Rm is the weight of the second

  • layer. Assume ∥x∥2 = 1 and |y| ≤ 1.

At initialization, wr(0) ∼ N(0, κ2I), ar ∼ unif({−1, 1}). Fix the second layer a and only train the fjrst layer W. Denote W(k) as the value of W at step k.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 17 / 45

slide-18
SLIDE 18

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Trajectory Based Analysis

Use MSE (Mean Square Error) as the loss function: Φ(W) = 1 2

n

i=1

(yi − fW,a(xi))2 Let the trajectory of the network be u = (u1, ..., un)⊤, where ui = fW,a(xi). Then the loss function is Φ(W) = 1

2 ∥y − u∥2 2, where

y = (y1, ..., yn)⊤. Train with GD with learning rate η. Defjne H∞ as a Gram matrix: H∞

ij = Ew∼N(0,I)[x⊤ i xjI{w⊤xi ≥ 0, w⊤xj ≥ 0}]

= x⊤

i xj(π − arccos(x⊤ i xj))

2π , ∀i, j ∈ [n]

Runtian Zhai (PKU) Understanding Noise July 16, 2019 18 / 45

slide-19
SLIDE 19

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Main Theorem

Assumptions: The initial variance κ2 and learning rate η are small enough, and the width m is large enough. Lemma: Under the above assumptions, during training the real trajectory {u(k)}∞

k=0 stays close to another sequence {˜

u(k)}∞

k=0 which

has a linear update rule: ˜ u(k + 1) = ˜ u(k) − ηH∞(˜ u(k) − y) . By analyzing the dynamics of ˜ u(k) we can prove that Φ(W(k)) ≈ 1 2

  • (I − ηH∞)ky
  • 2

2

uniformly for all k ≥ 0 with high probability. If H∞ is positive defjnite, we can be sure that Φ(W(k)) → 0 as k → ∞, which implies that GD always converges even if y is random.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 19 / 45

slide-20
SLIDE 20

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Convergence Rate

Write the eigen-decomposition H∞ = ∑n

i=1 λiviv⊤ i , then

∥y − ˜ u(k)∥2

2 = n

i=1

(1 − ηλi)2k(v⊤

i y)2

Since u(k) is very close to ˜ u(k), RHS can be used to estimate the convergence rate. For a set of labels y, if they align with top eigenvectors (i.e. (v⊤

i y)2 is

large for large λi), then GD converges quickly. Otherwise it converges more slowly.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 20 / 45

slide-21
SLIDE 21

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Experimental Result

The authors showed by an experiment that clean labels align with top eigenvectors perfectly, whereas random labels align

  • randomly. This

implies that GD fjts noisy data more slowly.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 21 / 45

slide-22
SLIDE 22

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

Analysis of Generalization

Theorem (Informal Version of Theorem 5.1) If the underlying data distribution D is non-degenerate and the training set is i.i.d. sampled from D, then for any 1-Lipschitz loss function l : R × R → [0, 1] such that l(y, y) = 0, with probability at least 1 − δ over the random initialization and the training samples, the two-layer ReLU network fW(k),a trained by GD has population loss LD(fW(k),a) = E(x,y)∼D[l(fW(k),a(x), y)] bounded as LD(fW(k),a) ≤ √ 2y⊤(H∞)−1y n + 3 √ log(6/δ) 2n + o(1)

Runtian Zhai (PKU) Understanding Noise July 16, 2019 22 / 45

slide-23
SLIDE 23

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

Dataset Complexity

LD(fW(k),a) ≤ √ 2y⊤(H∞)−1y n + 3 √ log(6/δ) 2n + o(1) This formula implies that √ 2y⊤(H∞)−1y/n can be viewed as a complexity measure of data. For a more complicated dataset, it is harder for a deep model to fjt it and to generalize on it. To measure the complexity of difgerent kinds of noise, I did several experiments on noisy Cifar-10, measuring its complexity with the above metric.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 23 / 45

slide-24
SLIDE 24

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

My Experiment

Noise Type Complexity No noise 33.47 Random labels 55.32 Same labels 2.80 AE of a standard model 45.42 AE of an adv trained model 41.61 Gaussian (σ = 0.02) 30.17 Gaussian (σ = 0.05) 22.59 I arbitrarily selected two classes of Cifar-10 (cars and birds) and selected 500 images from each

  • class. Then I normalized

all images to make sure that ∥x∥2 ≤ 1. The labels were set to {+1, −1}. I computed the complexity with the above formula.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 24 / 45

slide-25
SLIDE 25

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

My Experiment

Noise Type Complexity No noise 33.47 Random labels 55.32 Same labels 2.80 AE of a standard model 45.42 AE of an adv trained model 41.61 Gaussian (σ = 0.02) 30.17 Gaussian (σ = 0.05) 22.59 For random labels, the dataset becomes more complex, so it is harder to optimize and general- ize. If all labels are the same (+1), the complexity is close to 0, so it’s very easy for a model to fjt the dataset, which is ob- viously true.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 25 / 45

slide-26
SLIDE 26

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

My Experiment

Noise Type Complexity No noise 33.47 Random labels 55.32 Same labels 2.80 AE of a standard model 45.42 AE of an adv trained model 41.61 Gaussian (σ = 0.02) 30.17 Gaussian (σ = 0.05) 22.59 Adversarial Examples also make the dataset more

  • complex. I generated AE

for a normally trained model and an adversar- ially trained model. It turns out that AE of a normally trained model is more complicated.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 26 / 45

slide-27
SLIDE 27

Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

My Experiment

Noise Type Complexity No noise 33.47 Random labels 55.32 Same labels 2.80 AE of a standard model 45.42 AE of an adv trained model 41.61 Gaussian (σ = 0.02) 30.17 Gaussian (σ = 0.05) 22.59 However, the metric fails to measure the complex- ity of Gaussian noise. It turns out that the dataset plus a larger Gaussian noise is simpler under this metric.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 27 / 45

slide-28
SLIDE 28

Noise Fits More Slowly Than Clean Data Detecting Noise

Detecting Noise

Since noise fjts more slowly, we can detect noise in the following simple way: the more slowly a sample fjts, the more likely it is noise. A better method is co-teaching [12]. Two networks F1 and F2 are trained simultaneously and teach one another: F1 only trains on samples with small loss on F2 and vice versa. These samples are unlikely to be noise because F2 fjts clean data faster than noise.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 28 / 45

slide-29
SLIDE 29

Noise Fits More Slowly Than Clean Data Detecting Noise

Self Distillation

Inspired by co-teaching, Dong et al.∗ suggests that a model can teach itself: it can be taught by its previous checkpoints. At epoch N, the network only trains on samples with small loss at epoch N − n0. In addition, a bunch of papers in ICML 2019 [1, 2, 3] address on how to detect noisy data in the training set. All of them utilize the fact that noise fjts more slowly than clean data.

* View this paper at http://www.runtianz.cn/doc/AIR.pdf. The link expires on Friday. This paper is under review. Please do not distribute.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 29 / 45

slide-30
SLIDE 30

More Theoretical Approaches

Outline

1

How Noise Afgects Training Critical Learning Periods Difgerent Kinds of Noise are Difgerent Two Phases of Learning

2

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Measuring Dataset Complexity Detecting Noise

3

More Theoretical Approaches Infmuence Function Neural Tangent Kernel Fourier Analysis

Runtian Zhai (PKU) Understanding Noise July 16, 2019 30 / 45

slide-31
SLIDE 31

More Theoretical Approaches Infmuence Function

Infmuence Function: Detecting Outliers

Understanding Black-box Predictions via Infmuence Functions Koh et al. [13] Best Paper In ICML 2017 Infmuence function measures how much infmuence a sample in the training set has to the fjnal classifjer. It is a classical technique in

  • statistics. Samples with large infmuence can be regarded as outliers.

Moreover, it can be effjciently computed by second-order optimization techniques.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 31 / 45

slide-32
SLIDE 32

More Theoretical Approaches Infmuence Function

Infmuence Function

Let the training samples be zi = (xi, yi) ∈ X × Y, i = 1, ..., n. The model Fθ is parameterized by θ. Let the empirical risk be

1 n

∑n

i=1 L(zi, θ). The ERM (empirical risk minimizer) is given by

ˆ θ = argminθ

1 n

∑n

i=1 L(zi, θ).

Consider the change in ˆ θ when a point z is removed from the training

  • set. Let the ERM after z is removed be ˆ

θ−z. Statisticans told us that ˆ θ−z − ˆ θ ≈ −1 nI(z) where I(z) = dˆ θϵ,z dϵ |ϵ=0 = −H−1

ˆ θ ∇θL(z, ˆ

θ) where Hˆ

θ = 1 n

∑n

i=1 ∇2 θL(zi, ˆ

θ) is the Hessian.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 32 / 45

slide-33
SLIDE 33

More Theoretical Approaches Infmuence Function

Infmuence Function (cont.)

I(z) is the infmuence function of z. The greater its norm is, the more infmuence z has on the model. Furthermore, if we perturb z = (x, y) to zδ = (x + δ, y), and the resulting ERM is ˆ θzδ,−z, then ˆ θzδ,−z − ˆ θ ≈ −1 n(I(zδ) − I(z)) When δ is a certain kind of noise, we can estimate the complexity of that noise using this formula.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 33 / 45

slide-34
SLIDE 34

More Theoretical Approaches Neural Tangent Kernel

NTK: A Powerful Tool

Neural Tangent Kernel: Convergence and Generalization in Neural Networks Jacot et al. [6] In NIPS 2018 The authors prove that in the infjnite-width limit, a fully-connected network trained with GD evolves along the kernel gradient w.r.t. the NTK Θ, and Θ converges in probability to a deterministic limiting kernel Θ∞. Therefore, kernel principal components of Θ∞ with the highest eigenvalues in eigenspaces are fjt fjrst. Components with low eigenvalues can be regarded as noise.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 34 / 45

slide-35
SLIDE 35

More Theoretical Approaches Fourier Analysis

Fourier Analysis

Many recent papers propose to analyze the efgect of noise using Fourier

  • analysis. For example:

On the Spectral Bias of Neural Networks Rahaman et al. [8] In ICML 2019 A Fourier Perspective on Model Robustness in Computer Vision Yin et al. [9] arXiv:1906.08988

Runtian Zhai (PKU) Understanding Noise July 16, 2019 35 / 45

slide-36
SLIDE 36

More Theoretical Approaches Fourier Analysis

Fourier Analysis

Some results from Fourier analysis: Neural networks are prone to learn towards low frequency functions, which are functions without local fmuctuations. Both Gaussian noise and adversarial examples are high frequency

  • noise. That’s why normally trained networks are so vulnerable to

them. Common defense methods such as training with Gaussian noise and adversarial training improve robustness w.r.t. high frequency noise, but reduce robustness w.r.t. low frequency noise.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 36 / 45

slide-37
SLIDE 37

Take-aways

Deep learning has critical learning periods. Noise in these periods downgrades networks’ performance signifjcantly, while noise in other periods doesn’t have as strong efgect. There are two phases during training. In Phase I the network learns; in Phase II it forgets. The model achieves optimal performance if training early stops at phase transition. Neural networks fjt noise more slowly than clean data, and we can use this fact to detect noise, as in co-teaching and self distillation. Difgerent kinds of noise have difgerent complexity, and have difgerent levels of impact on networks. Several metrics can be used to measure the complexity of noise, such as Arora’s metric and infmuence functions.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 37 / 45

slide-38
SLIDE 38

Open Problem: Finding Relevant Data

Background: In semi-supervised learning (SSL), we have a huge amount of unlabeled data, and we only want to train on data which looks similar to the limited labeled data we have. The question is: how to fjnd data relevant for our task from a huge dataset? Details: Cifar-10 and Cifar-100 are subsets of a large unlabeled dataset called 80 million tiny images. I’d like to select 500k images from it that are the most relevant for Cifar classifjcation. To test the performance I will run an SSL algorithm and see its result. Suggestions: We can try Arora’s measurement of data complexity, infmuence functions, NTK analysis, Fourier analysis, etc. Please contact me if you are interested in solving this problem.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 38 / 45

slide-39
SLIDE 39

Any Questions?

Runtian Zhai (PKU) Understanding Noise July 16, 2019 39 / 45

slide-40
SLIDE 40

Thank you.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 40 / 45

slide-41
SLIDE 41

References

[1] Thulasidasan et al. (2019) Combating Label Noise in Deep Learning using Abstention ICML 2019 [2] Shen et al. (2019) Learning with Bad Training Data via Iterative Trimmed Loss Minimization ICML 2019 [3] Chen et al. (2019) Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels ICML 2019

Runtian Zhai (PKU) Understanding Noise July 16, 2019 41 / 45

slide-42
SLIDE 42

References

[4] Arora et al. (2019) Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks ICML 2019 [5] Zhang et al. (2017) Understanding Deep Learning Requires Rethinking Generalization ICLR 2017 [6] Jacot et al. (2018) Neural Tangent Kernel: Convergence and Generalization in Neural Networks NIPS 2018

Runtian Zhai (PKU) Understanding Noise July 16, 2019 42 / 45

slide-43
SLIDE 43

References

[7] Achille et al. (2019) Critical Learning Periods in Deep Networks ICLR 2019 [8] Rahaman et al. (2019) On the Spectral Bias of Neural Networks ICML 2019 [9] Yin et al. (2019) A Fourier Perspective on Model Robustness in Computer Vision arXiv:1906.08988

Runtian Zhai (PKU) Understanding Noise July 16, 2019 43 / 45

slide-44
SLIDE 44

References

[10] Shwartz-Ziv et al. (2017) Opening the Black Box of Deep Neural Networks via Information arXiv: 1703.00810 [11] Li et al. (2017) Convergence Analysis of Two-layer Neural Networks with ReLU Activation NIPS 2017 [12] Han et al. (2018) Co-teaching: Robust training of deep neural networks with extremely noisy labels NIPS 2018

Runtian Zhai (PKU) Understanding Noise July 16, 2019 44 / 45

slide-45
SLIDE 45

References

[13] Koh et al. (2017) Understanding Black-box Predictions via Infmuence Functions ICML 2017

Runtian Zhai (PKU) Understanding Noise July 16, 2019 45 / 45