Multiple Instance Learning for Fast, Stable and Early RNN - PowerPoint PPT Presentation

The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS ’18 1

Algorithms for the IDE - EdgeML • A library of machine learning algorithms • Trained on the cloud • Ability to run on tiniest of IoT devices Arduino Uno 2

Previous Work: EdgeML Classifiers ProtoNN Fast(G)RNN Bonsai Kusupati et al., NIPS ’18 Gupta et al., ICML ’17 Kumar et al., ICML ’17 Code: https://github.com/Microsoft/EdgeML 3

Previous Work: EdgeML Applications GesturePod Wake Word Patil et al., (to be submitted) (work in progress) Code: En route 4

Problem 5

Problem • Given time series data point, classify it as a certain class. • GesturePod: – Data: Accelerometer and gyroscope information – Task: Detect if gesture was performed 6

Problem 7

Problem 8

Problem ProtoNN and Bonsai 9

Problem Expensive! Prohibitive on IoT Devices ProtoNN and Bonsai 10

RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 11

RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 12

RNNs are Expensive Observe how k << T. • RNN runs over longer data point – unnecessarily large T and prediction time. • Predictors must recognize signatures with different offsets - requires larger predictors. • Sequential compute. • Also lag. 13

RNNs are Expensive Solution ? Approach 1 of 2 : Exploit the fact that k << T and learn a smaller classifier. How? 14

How ? • STEP 1: Divide X into smaller instances . 15

How ? • STEP 1: Divide X into smaller instances . 16

How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 17

How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 18

How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 19

How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. Note! Most of the instances are just noise. 20

How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 21

How ? • STEP 1: Divide X into smaller instances . Robust Learning • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 22

How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data. (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 23

How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. 24

How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. Standard techniques don’t apply. • Heterogenous. • Ignores temporal structure of the data. 25

How ? Exploit temporal locality with MIL/Robust learning techniques Property 1: Positive instances are clustered together. Property 2: Number of positive instances can be estimated. 26

Algorithm: MI-RNN Two phase algorithm – alternates between identifying positive instances and training on the positive instances. 27

Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 28

Algorithm: MI-RNN • Step 2: Train classifier on this data 31

Algorithm: MI-RNN True positive instances Correctly labeled • Step 2: Train classifier on this data 32

Algorithm: MI-RNN True positive instances Mislabeled instances Correctly labeled Common to all classes • Step 2: Train classifier on this data 33

Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data 34

Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data Classifier will be confused. Low prediction confidence. 35

Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 36

Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 37

Algorithm: MI-RNN • Step 4: Repeat with new labels 38

MI-RNN: Does It Work? 39

MI-RNN: Does It Work? • Of course! 40

MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data 41

MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data • Experiments: Significantly improve accuracy while saving computation – Various tasks: activity recognition, audio keyword detection, gesture recognition 42

MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 43

MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 MI-RNN better than LSTM almost always Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 44

MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim GesturePod-6 8 - 98.00 50 (Gesture detection) 32 94.04 99.13 48 97.13 98.43 MI-RNN better than LSTM almost always DSA-19 32 84.56 87.01 28 (Activity detection) 48 85.35 89.60 64 85.17 88.11 45

MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - 46

MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - MI-RNN achieves same or better accuracy with ½ or ¼ of LSTM hidden dim. 47

MI-RNN in Action Synthetic MNIST: Detecting the presence of Zero. 48

MI-RNN in Action 49

RNNs are Expensive Solution ? Approach 2 of 2 : Early Prediction How? 50

Can we do even better? • For a lot of cases, looking only at a small prefix is enough to classify/reject. Early Prediction 51

Can we do even better? • Existing work: – Assumes pretrained classifier and uses secondary classifiers – Template matching approaches – Separate policy for early classification • Not feasible! 52

Early Prediction Our Approach Inference: Predict at each step – stop as soon as prediction confidence is high. Training: Incentivize early prediction by rewarding correct and early detections. 53

Algorithm: E-RNN Regular Loss: Early Loss: 54

Algorithm: E-RNN Regular Loss: Incentivizes early and consistent prediction. Early Loss: 55

E-RNN: How well does it work? 56

E-RNN: How well does it work? • Abysmally bad  57

E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. 58

E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. • Gets confused easily due to common prefixes! Positive datapoint Negative datapoint 59

Multiple Instance Learning for Fast, Stable and Early RNN - PowerPoint PPT Presentation

The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS 18 1 Algorithms for the IDE - EdgeML

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Learning for Categorization Sample Category Learning Problem A training example is an instance

Classify as a Whole? MULTIPLE INSTANCE LEARNING Set Learning? Multi-Set Learning? Marco Loog

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Anne)Bracy: Career)Path Undergrad)@)Stanford Grad)School)@)UPenn (computer)architecture)

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices

IDC Update on How Big Data Is Redefining High Performance Computing Earl Joseph

Auto-grading for 3D Modeling Assignments in MOOCs Swapneel Mehta Nitin Ayer Chirag Raman

Software Engineering Summer 2017 A Software Crisis Denver International Airport Approved for

Towards automatic estimation of conversation floors within F-formations Chirag Raman , Hayley Hung

Patent Research & Strategy Seminar Law 797 University of Illinois School of Law Fall 2011

Embracing the D Word - Placing Archives Development in the R&D Landscape Cal Lee School of

Multiple Instance Learning for Fast, Stable and Early RNN - PowerPoint PPT Presentation

The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS 18 1 Algorithms for the IDE - EdgeML

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Learning for Categorization Sample Category Learning Problem A training example is an instance

Classify as a Whole? MULTIPLE INSTANCE LEARNING Set Learning? Multi-Set Learning? Marco Loog

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO &amp; PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Anne)Bracy: Career)Path Undergrad)@)Stanford Grad)School)@)UPenn (computer)architecture)

Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices

IDC Update on How Big Data Is Redefining High Performance Computing Earl Joseph

Auto-grading for 3D Modeling Assignments in MOOCs Swapneel Mehta Nitin Ayer Chirag Raman

Software Engineering Summer 2017 A Software Crisis Denver International Airport Approved for

Towards automatic estimation of conversation floors within F-formations Chirag Raman , Hayley Hung

Patent Research &amp; Strategy Seminar Law 797 University of Illinois School of Law Fall 2011

Embracing the D Word - Placing Archives Development in the R&amp;D Landscape Cal Lee School of

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE

Patent Research & Strategy Seminar Law 797 University of Illinois School of Law Fall 2011

Embracing the D Word - Placing Archives Development in the R&D Landscape Cal Lee School of