Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun - PowerPoint PPT Presentation

Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun Dai 2 , Yu Li 3 , Xin Gao 3 , Le Song 1,4 1 Georgia Tech, 2 Google Brain, 3 KAUST, 4 Ant Financial ICML 2020

5-minute Core Message Dynamic Depth stop at different depths for different input samples. no no no yes stop? stop? stop? stop? stop, output 𝒚 𝟓 𝒚 𝟑 𝒚 𝟐 𝒚 𝟒 𝒚 𝟓 stopped depth=4 no yes stop? stop? stop, output 𝒚 𝟑 𝒚 𝟑 𝒚 𝟐 stopped depth=2

5-minute Core Message Motivation 1. Task-imbalanced Meta Learning Task 1: fewer samples Need different numbers of gradient steps for adaptation 𝜄 ' ∇ / ℒ , 𝜄 ()*+, Task 2: more samples ∇ / ℒ - 𝜄 ()*+-

5-minute Core Message Motivation 2. Data-driven Algorithm Design Traditional algorithms have certain stop criteria to determine the number of iterations for each problem. E.g., • iterate until convergence • early stopping to avoid over-fitting stop hand-designed 𝒚 𝒖 𝒚 𝒚 𝒖 update step criteria (output) not satisfied Deep learning based algorithms usually have a fixed number of iterations in the architecture.

5-minute Core Message Motivation 3. Others Image Denoising • Images with different noise levels may need different number of denoising steps. noisy less noisy Image Recognition • ‘early exits’ is proposed to improve the computation efficiency and avoid ‘over-thinking’. [Teerapittayanon et al., 2016; Zamir et al., 2017; Huang et al., 2018, Kaya et al. (2019)]

5-minute Core Message Predictive Model with Stopping Policy Predictive model 𝓖 𝜾 • Transforms the input 𝒚 to generate a path of states 𝒚 , , … , 𝒚 6 Stopping Policy 𝝆 𝝔 • Sequentially observes the states 𝒚 ( and determines the probability of stop at layer 𝑢 Variational stop time distribution 𝒓 𝝔 Stop time distribution induced by stopping policy 𝝆 𝝔 • variational stop time (D, (1 − 𝜌 = (𝑦 B ) 𝒓 𝝔 𝑢 = 𝜌 = (𝑦 ( ) ∏ BC, distribution stop 𝝆 𝝔 0 𝝆 𝝔 0 𝝆 𝝔 0 𝝆 𝝔 1 stop, output 𝒚 𝟓 policy 𝒚 𝒚 𝟑 𝒚 𝟐 𝒚 𝟒 𝒚 𝟓 predictive model 𝜾 𝟐 𝜾 𝟑 𝜾 𝟒 𝜾 𝟓

5-minute Core Message How to learn the optimal ( 𝓖 𝜾 , 𝝆 𝝔 ) efficiently? • Design a joint training objective : ℒ(𝓖 𝜾 , 𝒓 𝝔 ) • Introduce an oracle stop time distribution : 𝒓 ∗ |𝓖 𝜾 : = argmin 𝒓∈Q RST ℒ(𝓖 𝜾 , 𝒓) • Then we decompose the learning procedure into two stages : (i) The oracle model learning stage (ii) The imitation learning stage oracle oracle ℒ(𝓖 𝜾 , 𝒓 ∗ |ℱ 𝜾 ) KL divergence 𝒓 𝝔 𝑟 ∗ |ℱ 𝜾 ℱ 𝜾 𝑟 ∗ | ℱ 𝜾 ∗ optimal ℱ 𝜾 ∗ optimal 𝒓 𝝔 ∗

5-minute Core Message Advantages of our training procedure ü Principled • Two components are optimized towards a joint objective. ü Tuning-free • Weights of different layers in the loss are given by the oracle distribution automatically. • For different input samples, the weights on the layers can be different. ü Efficient Instead of updating 𝜄 and 𝜚 alternatively, 𝜾 is optimized in 1st stage, and then 𝜚 is optimized in 2nd stage. • ü Generic • can be applied to a diverse range of applications. ü Better understanding • A variational Bayes perspective, for better understanding the proposed model and joint training. • A reinforcement learning perspective, for better understanding the learning of the stop policy.

5-minute Core Message Experiments l Learning to optimize: sparse recovery l Task-imbalanced meta learning: few-shot learning l Image denoising l Some observations on image recognition tasks.

Problem Formulation - Models Predictive model 𝓖 𝜾 • 𝒚 ( = 𝑔 / Y (𝒚 (D, ) , for 𝑢 = 1,2, … , 𝑈 Stopping Policy 𝝆 𝝔 • 𝜌 ( = 𝜌 = 𝒚, 𝒚 ( , for 𝑢 = 1,2, … , 𝑈 Variational stop time distribution 𝒓 𝝔 (induced by 𝝆 𝝔 ) (D, (1 − 𝜌 B ) for 𝑢 < 𝑈 𝑟 = 𝑢 = 𝜌 ( ∏ BC, • Pr[not stopped before t] • Help design the training objective and the algorithm.

Problem Formulation – Optimization Objective ℒ ℱ / , 𝑟 = ; 𝑦, 𝑧 = 𝔽 (∼a b 𝑚 𝑧, 𝑦 ( ; 𝜄 − 𝛾𝐼 𝑟 = loss in entropy expectation over 𝑢 • Variational Bayes Perspective min /,= ℒ ℱ / , 𝑟 = ; 𝑦, 𝑧 max /,= 𝒦 hDijk ℱ / , 𝑟 = ; 𝑦, 𝑧 equivalent (i.e., 𝛾 -VAE, ELBO)

Training Algorithm – Stage I Oracle stop time distribution: Interpretation: It is the optimal stop time distribution given a predictive model ℱ / • ∗ ⋅ 𝑧, 𝑦 ≔ argmax 𝑟 / 𝒓∈Q RST 𝒦 hDijk ℱ / , 𝒓; 𝑦, 𝑧 ∗ 𝑢 𝑧, 𝑦 = 𝑞 / 𝑢 𝑧, 𝑦 When 𝛾 = 1 , the oracle is the true posterior, 𝑟 / • • This posterior is computationally tractable, but it requires the 𝑞 / 𝑧 𝑢, 𝑦 ,/h knowledge of the true label 𝑧 . = 6 ∑ (C, 𝑞 / 𝑧 𝑢, 𝑦 ,/h Stage I. Oracle model learning 6 1 1 ∗ ; 𝑦, 𝑧 = max ∗ 𝑢 𝑧, 𝑦 log 𝒒 𝜾 (𝒛|𝒖, 𝒚) max r 𝒦 hDijk 𝓖 𝜾 , 𝒓 𝜾 r r 𝒓 𝜾 |𝒠| |𝒠| / / (s,t)∈𝒠 (s,t)∈𝒠 (C, likelihood of the output at 𝑢 -th layer

Training Algorithm – Stage II Recall: Variational stop time distribution 𝒓 𝝔 𝑢|𝑦 induced by the sequential policy 𝝆 𝝔 ∗ (𝑢|𝑧, 𝑦) , by optimizing the forward KL divergence : Hope: 𝒓 𝝔 𝑢|𝑦 can mimic the oracle distribution 𝒓 𝜾 ∗ Stage II. Imitation With Sequential Policy 6 ∗ | 𝑟 = = − r ∗ ) ∗ KL(𝑟 / ∗ 𝑟 / ∗ 𝑢 𝑧, 𝑦 log 𝑟 = 𝑢 𝑦 − 𝐼(𝑟 / ∗ forward KL divergence (C, Note: If we use reverse KL divergence , then it is equivalent to solving maximum-entropy RL .

Experiment I - Learning To Optimize: Sparse Recovery • Task: Recover 𝑦 ∗ from its noisy measurements 𝑐 = 𝐵𝑦 ∗ + 𝜗 • Traditional Approach: - + 𝜍||𝑦|| , – LASSO formulation min • ½||𝑐 − 𝐵𝑦|| - – Solved by iterative algorithms such as ISTA • Learning-based Algorithm: – Learned ISTA (LISTA) is a deep architecture designed based on ISTA update steps • Ablation study: Whether LISTA with adaptive depth ( LISTA-stop ) is better than LISTA .

Experiment II – Task-imbalanced Meta Learning • Task: Task-imbalanced few-shot learning. Each task contains k -shots for each class where k can vary. • Our variant, MAML-stop : – Built on top of MAML, but MAML-stop learns how many adaptation gradient descent steps are needed for each task. Task-imbalanced setting: Vanilla setting:

Experiment III – Image Denoising • Our variant, DnCNN-stop : – Built on top of one of the most popular models, DnCNN, for the denoising task. * Noise-level 65, 75 are not observed during training.

Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun - PowerPoint PPT Presentation

Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun Dai 2 , Yu Li 3 , Xin Gao 3 , Le Song 1,4 1 Georgia Tech, 2 Google Brain, 3 KAUST, 4 Ant Financial ICML 2020 5-minute Core Message Dynamic Depth stop at different depths for

While Loops Python While Loops Form of the while loop: while condition : Statement Block

Stop? I cannot stop. What? Shall the old African blasphemer stop while he can speak? ~

PREDICT- -HD HD PREDICT BIG QUESTION: What do we need before we can treat HD ? How does

2019-2020 What is a consolidated bus stop? A consolidated bus stop is a centralized stop that

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Solar Cycle 25 in V2 of SSN If possible, also provide: Predict north/south hemispheres

Consolidated Bus Stops 2015-2016 What is a consolidated bus stop? A consolidated bus stop is a

Stop, Question and Frisk Procedure NYS Senator Eric Adams District #20 What is Stop,

Sep. 23, 2019 Stop #5 Advise Planning & Zoning Commission of updates To Stop #3 Make

OREGON STOP PROGRAM Ken Sanchagrin Tiffany Quintero Oregon STOP Program Co-Directors 11

Consolidated Bus Stops 2020-2021 What is a consolidated bus stop? A consolidated bus stop is

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

BRCAnt Stop Me Mollie Smith CEO BRCAnt Stop Me Miss Sanilac County 2016 About BRCAnt

NON STOP N ew smart digital O perations N eeded for a S ustainable T ransition O f P orts J-No.:

10,000 TESTS/DAY GOAL MET 300,444 TESTS COMPLETED TO DATE INCREASED CONTACT TRACING NEARLY

SonicBOOM The Third Generation Berkeley Out-of-Order Machine Jerry Zhao, Ben Korpan, Abe

1 2 3 4 5 Review the characteristics of this SMART design 6 Review the characteristics of this

BigDebug: Debugging Primitives for Interactive Big Data Processing

Optimal Model-Based Production Planning for Refinery Operation Abdulrahman Alattas Advisor:

AGC MEETING OCTOBER 4, 2019 FY21 LETTING (PURPLE) OVERVIEW PROJECT NUMBER: IMN-080-6(235)239

A National Webinar on Projects To Inform Stage 3 Meaningful Use Requirements Through Evidence

Two Stage Allocation Problem Melika Abolhassani Hossein Esfandiari Introduction Googles

Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun - PowerPoint PPT Presentation

Learning To Stop While Learning To Predict Xinshi Chen 1 , Hanjun Dai 2 , Yu Li 3 , Xin Gao 3 , Le Song 1,4 1 Georgia Tech, 2 Google Brain, 3 KAUST, 4 Ant Financial ICML 2020 5-minute Core Message Dynamic Depth stop at different depths for

While Loops Python While Loops Form of the while loop: while condition : Statement Block

Stop? I cannot stop. What? Shall the old African blasphemer stop while he can speak? ~

PREDICT- -HD HD PREDICT BIG QUESTION: What do we need before we can treat HD ? How does

2019-2020 What is a consolidated bus stop? A consolidated bus stop is a centralized stop that

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Solar Cycle 25 in V2 of SSN If possible, also provide: Predict north/south hemispheres

Consolidated Bus Stops 2015-2016 What is a consolidated bus stop? A consolidated bus stop is a

Stop, Question and Frisk Procedure NYS Senator Eric Adams District #20 What is Stop,

Sep. 23, 2019 Stop #5 Advise Planning &amp; Zoning Commission of updates To Stop #3 Make

OREGON STOP PROGRAM Ken Sanchagrin Tiffany Quintero Oregon STOP Program Co-Directors 11

Consolidated Bus Stops 2020-2021 What is a consolidated bus stop? A consolidated bus stop is

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

Evaluation learning algorithm ? Do you want to predict accuracy or predict Charles Sutton

BRCAnt Stop Me Mollie Smith CEO BRCAnt Stop Me Miss Sanilac County 2016 About BRCAnt

NON STOP N ew smart digital O perations N eeded for a S ustainable T ransition O f P orts J-No.:

10,000 TESTS/DAY GOAL MET 300,444 TESTS COMPLETED TO DATE INCREASED CONTACT TRACING NEARLY

SonicBOOM The Third Generation Berkeley Out-of-Order Machine Jerry Zhao, Ben Korpan, Abe

1 2 3 4 5 Review the characteristics of this SMART design 6 Review the characteristics of this

BigDebug: Debugging Primitives for Interactive Big Data Processing

Optimal Model-Based Production Planning for Refinery Operation Abdulrahman Alattas Advisor:

AGC MEETING OCTOBER 4, 2019 FY21 LETTING (PURPLE) OVERVIEW PROJECT NUMBER: IMN-080-6(235)239

A National Webinar on Projects To Inform Stage 3 Meaningful Use Requirements Through Evidence

Two Stage Allocation Problem Melika Abolhassani Hossein Esfandiari Introduction Googles

Sep. 23, 2019 Stop #5 Advise Planning & Zoning Commission of updates To Stop #3 Make