multiple instance learning
play

Multiple Instance Learning for Fast, Stable and Early RNN - PowerPoint PPT Presentation

The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS 18 1 Algorithms for the IDE - EdgeML


  1. The Edge of Machine Learning Multiple Instance Learning for Fast, Stable and Early RNN Predictions Don Dennis , Microsoft Research India, Joint work with Chirag P., Harsha and Prateek Accepted to NIPS ’18 1

  2. Algorithms for the IDE - EdgeML • A library of machine learning algorithms • Trained on the cloud • Ability to run on tiniest of IoT devices Arduino Uno 2

  3. Previous Work: EdgeML Classifiers ProtoNN Fast(G)RNN Bonsai Kusupati et al., NIPS ’18 Gupta et al., ICML ’17 Kumar et al., ICML ’17 Code: https://github.com/Microsoft/EdgeML 3

  4. Previous Work: EdgeML Applications GesturePod Wake Word Patil et al., (to be submitted) (work in progress) Code: En route 4

  5. Problem 5

  6. Problem • Given time series data point, classify it as a certain class. • GesturePod: – Data: Accelerometer and gyroscope information – Task: Detect if gesture was performed 6

  7. Problem 7

  8. Problem 8

  9. Problem ProtoNN and Bonsai 9

  10. Problem Expensive! Prohibitive on IoT Devices ProtoNN and Bonsai 10

  11. RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 11

  12. RNNs are Expensive • For time series data: • T RNN updates are performed: • T is determined by the data labelling process. Example GesturePod – 2 seconds. 12

  13. RNNs are Expensive Observe how k << T. • RNN runs over longer data point – unnecessarily large T and prediction time. • Predictors must recognize signatures with different offsets - requires larger predictors. • Sequential compute. • Also lag. 13

  14. RNNs are Expensive Solution ? Approach 1 of 2 : Exploit the fact that k << T and learn a smaller classifier. How? 14

  15. How ? • STEP 1: Divide X into smaller instances . 15

  16. How ? • STEP 1: Divide X into smaller instances . 16

  17. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 17

  18. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. 18

  19. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 19

  20. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. Note! Most of the instances are just noise. 20

  21. How ? • STEP 1: Divide X into smaller instances . • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 21

  22. How ? • STEP 1: Divide X into smaller instances . Robust Learning • STEP 2: Identify positive instances. Discard negative (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 22

  23. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data. (noise) instances. • STEP 3: Use these instances to train a smaller classifier. 23

  24. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. 24

  25. How ? • STEP 1: Divide X into smaller instances . Robust Learning Standard techniques don’t apply. • STEP 2: Identify positive • Too much noise. • Ignores temporal structure of the instances. Discard negative data (noise) instances. Traditional Multi Instance • STEP 3: Use these instances Learning (MIL) to train a smaller classifier. Standard techniques don’t apply. • Heterogenous. • Ignores temporal structure of the data. 25

  26. How ? Exploit temporal locality with MIL/Robust learning techniques Property 1: Positive instances are clustered together. Property 2: Number of positive instances can be estimated. 26

  27. Algorithm: MI-RNN Two phase algorithm – alternates between identifying positive instances and training on the positive instances. 27

  28. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 28

  29. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 29

  30. Algorithm: MI-RNN • Step 1: Assign labels Instance = source data 30

  31. Algorithm: MI-RNN • Step 2: Train classifier on this data 31

  32. Algorithm: MI-RNN True positive instances Correctly labeled • Step 2: Train classifier on this data 32

  33. Algorithm: MI-RNN True positive instances Mislabeled instances Correctly labeled Common to all classes • Step 2: Train classifier on this data 33

  34. Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data 34

  35. Algorithm: MI-RNN Common to all classes • Step 2: Train classifier on this data Classifier will be confused. Low prediction confidence. 35

  36. Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 36

  37. Algorithm: MI-RNN Top- κ • Step 3: Wherever possible, use classifier’s prediction score to pick top- κ Should satisfy property 1 and property 2 37

  38. Algorithm: MI-RNN • Step 4: Repeat with new labels 38

  39. MI-RNN: Does It Work? 39

  40. MI-RNN: Does It Work? • Of course! 40

  41. MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data 41

  42. MI-RNN: Does It Work? • Of course! • Theoretical analysis: Convergence to global optima in linear time for nice data • Experiments: Significantly improve accuracy while saving computation – Various tasks: activity recognition, audio keyword detection, gesture recognition 42

  43. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 43

  44. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim HAR-6 8 89.54 91.92 62.5 (Activity detection) 16 92.90 93.89 32 93.04 91.78 MI-RNN better than LSTM almost always Google-13 16 86.99 89.78 50.5 (Audio) 32 89.84 92.61 64 91.13 93.16 WakeWord-2 8 98.07 98.08 50.0 (Audio) 16 98.78 99.07 32 99.01 98.96 44

  45. MI-RNN: Does It Work? Dataset Hidden LSTM MI-RNN Savings % Dim GesturePod-6 8 - 98.00 50 (Gesture detection) 32 94.04 99.13 48 97.13 98.43 MI-RNN better than LSTM almost always DSA-19 32 84.56 87.01 28 (Activity detection) 48 85.35 89.60 64 85.17 88.11 45

  46. MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - 46

  47. MI-RNN: Savings? Dataset Hidden LSTM Hidden MI-RNN Savings Savings at Dim Dim ~1% drop HAR-6 32 93.04 16 93.89 10.5x 42x Google-13 64 91.13 32 92.61 8x 32x WakeWord-2 32 99.01 16 99.07 8x 32x GesturePod-6 48 97.13 8 98.00 72x - DSA-19 64 85.17 32 87.01 5.5x - MI-RNN achieves same or better accuracy with ½ or ¼ of LSTM hidden dim. 47

  48. MI-RNN in Action Synthetic MNIST: Detecting the presence of Zero. 48

  49. MI-RNN in Action 49

  50. RNNs are Expensive Solution ? Approach 2 of 2 : Early Prediction How? 50

  51. Can we do even better? • For a lot of cases, looking only at a small prefix is enough to classify/reject. Early Prediction 51

  52. Can we do even better? • Existing work: – Assumes pretrained classifier and uses secondary classifiers – Template matching approaches – Separate policy for early classification • Not feasible! 52

  53. Early Prediction Our Approach Inference: Predict at each step – stop as soon as prediction confidence is high. Training: Incentivize early prediction by rewarding correct and early detections. 53

  54. Algorithm: E-RNN Regular Loss: Early Loss: 54

  55. Algorithm: E-RNN Regular Loss: Incentivizes early and consistent prediction. Early Loss: 55

  56. E-RNN: How well does it work? 56

  57. E-RNN: How well does it work? • Abysmally bad  57

  58. E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. 58

  59. E-RNN: How well does it work? • Abysmally bad  • In GesturePod-6, we loose 10-12% accuracy attempting to predict early. • Gets confused easily due to common prefixes! Positive datapoint Negative datapoint 59

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend