Understanding Noise in Machine Learning Runtian Zhai School of - PowerPoint PPT Presentation

Understanding Noise in Machine Learning Runtian Zhai School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn July 16, 2019 A talk at UCLA. Runtian Zhai (PKU) Understanding Noise July 16, 2019 1 / 45

Introduction: Noise is Everywhere Noise is everywhere. Since the time as early as the 1920s, statisticians have been searching for ways to combat noise in collected data. In machine learning, this is a more serious problem. Training sets are labeled by humans, and humans always make mistakes. In computer vision, many images are corrupted, blurred, or compressed. An even more dangerous kind of noise is known as the adversarial examples , crafted noise that aims to fool a certain classifjer. Runtian Zhai (PKU) Understanding Noise July 16, 2019 2 / 45

Learning with Noise People have proposed various kinds of ways to learn with noise: Some propose detection methods which detect noisy samples in a dataset so that they can be removed. Others suggest that even noisy samples can be useful. For instance, co-teaching is proposed to help networks learn on noisy datasets. Many defense methods are proposed to fjght against adversarial examples. The most successful one so far is adversarial training , which is training with on-the-fmy adversarial examples. Runtian Zhai (PKU) Understanding Noise July 16, 2019 3 / 45

How Noise Afgects Training Detecting Noise July 16, 2019 Understanding Noise Runtian Zhai (PKU) Fourier Analysis Neural Tangent Kernel Infmuence Function More Theoretical Approaches 3 Measuring Dataset Complexity Outline Zhang’s Experiment and Its Explanation Noise Fits More Slowly Than Clean Data 2 Two Phases of Learning Difgerent Kinds of Noise are Difgerent Critical Learning Periods How Noise Afgects Training 1 4 / 45

How Noise Afgects Training Critical Learning Periods Critical Learning Periods Critical Learning Periods in Deep Networks Achille et al. (UCLA) [7] In ICLR 2019 In Biology, we are told that the fjrst several weeks after the birth of a baby animal, known as the critical learning period , is critical for its intellectual development. In deep learning this is also true. If a network is trained on noisy images during the fjrst several epochs, then it can never reach high performance even if it is trained on clean images later on. Runtian Zhai (PKU) Understanding Noise July 16, 2019 5 / 45

How Noise Afgects Training Critical Learning Periods Experiment I To show that the biological behavior also exists in deep learning, the authors did the following experiment: They trained an All-CNN on Cifar-10. During the fjrst N epochs the network was trained on noisy images. After that the network was trained on clean images for another 160 epochs. They used blurred images for noisy images: fjrst downsample the bilinear interpolation. Runtian Zhai (PKU) Understanding Noise July 16, 2019 6 / 45 32 × 32 images to 8 × 8 and then upsample back to 32 × 32 with

How Noise Afgects Training Critical Learning Periods Result: Early Defjcit Has Irremediable Negative Efgect Runtian Zhai (PKU) Understanding Noise July 16, 2019 7 / 45

How Noise Afgects Training Critical Learning Periods Experiment II They trained the network on noisy images for 40 epochs starting from epoch N , and on clean images in other epochs. The 40 epochs is called the defjcit window. They tested how much test accuracy would decrease with difgerent choice of window onset N . Runtian Zhai (PKU) Understanding Noise July 16, 2019 8 / 45

How Noise Afgects Training Critical Learning Periods Result: Early Epochs are More Critical Runtian Zhai (PKU) Understanding Noise July 16, 2019 9 / 45

How Noise Afgects Training Difgerent Kinds of Noise are Difgerent Difgerent Kinds of Noise The authors repeated the fjrst experiment with difgerent kinds of noise: Vertical fmip: Flip the images vertically Label permutation: Use random labels Noise: All images are replaced by random noise. They also tested networks of difgerent depth. Runtian Zhai (PKU) Understanding Noise July 16, 2019 10 / 45 Blur: 32 × 32 images downsampled to 8 × 8 then upsampled to 32 × 32 with bilinear interpolation

How Noise Afgects Training Difgerent Kinds of Noise are Difgerent Difgerent Kinds of Noise are Difgerent For Noise the efgect is not as strong. For Vertical fmip and Label permutation, the efgect is very weak. The deeper the network, the stronger the efgect. Runtian Zhai (PKU) Understanding Noise July 16, 2019 11 / 45

How Noise Afgects Training Two Phases of Learning Two Phases of Learning The authors did fjsher information analysis on the training process: They used the trace of Fisher Information Matrix (FIM) to measure how much information the network had learned. The training period has two phases: In Phase I, FIM rises quickly, showing that the network is learning; In Phase II, FIM drops dramatically (while its performance is still improving), showing that the network starts to forget. Runtian Zhai (PKU) Understanding Noise July 16, 2019 12 / 45

How Noise Afgects Training Two Phases of Learning Two Phases of Learning (cont.) Many other papers [10, 11] also found that there are two phases during training from the optimization perspective. It is well known that during training, noise fjts more slowly than clean data. Many recent papers argue that in phase I, the network is fjtting clean data; in phase II, the network is fjtting noise, so it seems like the network is forgetting useful information. This also explains why early stopping works. Runtian Zhai (PKU) Understanding Noise July 16, 2019 13 / 45

Noise Fits More Slowly Than Clean Data Detecting Noise July 16, 2019 Understanding Noise Runtian Zhai (PKU) Fourier Analysis Neural Tangent Kernel Infmuence Function More Theoretical Approaches 3 Measuring Dataset Complexity Outline Zhang’s Experiment and Its Explanation Noise Fits More Slowly Than Clean Data 2 Two Phases of Learning Difgerent Kinds of Noise are Difgerent Critical Learning Periods How Noise Afgects Training 1 14 / 45

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Deep Networks Can Fit Random Labels Understanding Deep Learning Requires Rethinking Generalization Zhang et al. [5] In ICLR 2017 In this paper, the authors did the following experiment: they added many kinds of noise to Cifar-10 (random labels, random pixels, gaussian, etc.), and then trained an Inception model on it. They found that Deep networks fjt noisy data easily. However, it takes much longer time than clean data. Runtian Zhai (PKU) Understanding Noise July 16, 2019 15 / 45

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation Explaining the Results Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks Arora et al. [4] In ICML 2019 In this paper, the authors prove for an overparameterized two-layer fully-connected network that GD (gradient descent) can converge (achieve zero training loss) on datasets with random labels. GD converges more slowly on random labels than on clean labels. Label noise can harm generalization. Runtian Zhai (PKU) Understanding Noise July 16, 2019 16 / 45

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation July 16, 2019 Understanding Noise Runtian Zhai (PKU) value of W at step k . 17 / 45 1 m A two-layer ReLU network with m neurons is Basic Setting ∑ a r σ ( w ⊤ f W , a ( x ) = √ m r x ) r = 1 where x ∈ R d is the input, W = ( w 1 , ..., w m ) ∈ R d × m is the weight of the fjrst layer and a = ( a 1 , ..., a m ) ⊤ ∈ R m is the weight of the second layer. Assume ∥ x ∥ 2 = 1 and | y | ≤ 1. At initialization, w r ( 0 ) ∼ N ( 0 , κ 2 I ) , a r ∼ unif ( {− 1 , 1 } ) . Fix the second layer a and only train the fjrst layer W . Denote W ( k ) as the

Noise Fits More Slowly Than Clean Data 2 , where July 16, 2019 Understanding Noise Runtian Zhai (PKU) Zhang’s Experiment and Its Explanation 18 / 45 n 2 Trajectory Based Analysis Use MSE (Mean Square Error) as the loss function: ∑ Φ( W ) = 1 ( y i − f W , a ( x i )) 2 i = 1 Let the trajectory of the network be u = ( u 1 , ..., u n ) ⊤ , where u i = f W , a ( x i ) . Then the loss function is Φ( W ) = 1 2 ∥ y − u ∥ 2 y = ( y 1 , ..., y n ) ⊤ . Train with GD with learning rate η . Defjne H ∞ as a Gram matrix : H ∞ ij = E w ∼N ( 0 , I ) [ x ⊤ i x j I { w ⊤ x i ≥ 0 , w ⊤ x j ≥ 0 } ] = x ⊤ i x j ( π − arccos ( x ⊤ i x j )) , ∀ i , j ∈ [ n ] 2 π

Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation July 16, 2019 Understanding Noise Runtian Zhai (PKU) 2 2 2 19 / 45 Main Theorem enough, and the width m is large enough. Lemma: Under the above assumptions, during training the real Assumptions: The initial variance κ 2 and learning rate η are small trajectory { u ( k ) } ∞ u ( k ) } ∞ k = 0 stays close to another sequence { ˜ k = 0 which u ( k ) − η H ∞ (˜ u ( k ) − y ) . By has a linear update rule: ˜ u ( k + 1 ) = ˜ analyzing the dynamics of ˜ u ( k ) we can prove that � � � ( I − η H ∞ ) k y Φ( W ( k )) ≈ 1 � � � uniformly for all k ≥ 0 with high probability. If H ∞ is positive defjnite, we can be sure that Φ( W ( k )) → 0 as k → ∞ , which implies that GD always converges even if y is random.

Understanding Noise in Machine Learning Runtian Zhai School of - PowerPoint PPT Presentation

Understanding Noise in Machine Learning Runtian Zhai School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn July 16, 2019 A talk at UCLA. Runtian Zhai

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

NOISE AT WORK AWARENESS SESSION FOR WORKERS WHAT IS NOISE Noise is all around us at home,

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Widening and Improvements Noise Review: Grant Road Hampton St to Santa Rita Rd January 13, 2016

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

HLT MET Noise Filters in Run2011B Alex Mott Caltech Review of Noise Filters HBHE noise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Knowledge Representation for the Semantic Web Lecture 8: Answer Set Programming III Daria

Pr ss ss Pss s

ARTS AS EARLY INTERVENTION: WORKING WITH YOUNG PEOPLE AT RISK OF OFFENDING INTERMISSION:

Three Extremalization Principles in AdS/CFT with Eight Supercharges Yuji Tachikawa (Univ. of

Outline What is Computer Graphics? What is CG used for? CPSC 314 Computer Graphics Defining

Embedding computations in tilings (a perspective of the course) Andrei Romashchenko 30 May 2016

Embedding computations in tilings (Part 1: fixed point tilings) Andrei Romashchenko 31 May 2016

CELEBRATE GOD ATTITUDE OF GRATITUDE Manna and Quail in the Desert (Exodus 16) He said to them,

Understanding Noise in Machine Learning Runtian Zhai School of - PowerPoint PPT Presentation

Understanding Noise in Machine Learning Runtian Zhai School of Electronics Engineering and Computer Science School of Mathematical Science (double major) Peking University zhairuntian@pku.edu.cn July 16, 2019 A talk at UCLA. Runtian Zhai

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

NOISE AT WORK AWARENESS SESSION FOR WORKERS WHAT IS NOISE Noise is all around us at home,

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Widening and Improvements Noise Review: Grant Road Hampton St to Santa Rita Rd January 13, 2016

Noise Programs &amp; NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

HLT MET Noise Filters in Run2011B Alex Mott Caltech Review of Noise Filters HBHE noise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Knowledge Representation for the Semantic Web Lecture 8: Answer Set Programming III Daria

Pr ss ss Pss s

ARTS AS EARLY INTERVENTION: WORKING WITH YOUNG PEOPLE AT RISK OF OFFENDING INTERMISSION:

Three Extremalization Principles in AdS/CFT with Eight Supercharges Yuji Tachikawa (Univ. of

Outline What is Computer Graphics? What is CG used for? CPSC 314 Computer Graphics Defining

Embedding computations in tilings (a perspective of the course) Andrei Romashchenko 30 May 2016

Embedding computations in tilings (Part 1: fixed point tilings) Andrei Romashchenko 31 May 2016

CELEBRATE GOD ATTITUDE OF GRATITUDE Manna and Quail in the Desert (Exodus 16) He said to them,

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1