data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 0 2 : A C C E L E R A T I N G M A C H I N E L E A R N I N G I N F E R E N C E W I T H P R O B A B I L I S T I C P R E D I C A T E S ANNOUNCEMENTS


  1. DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 0 2 : A C C E L E R A T I N G M A C H I N E L E A R N I N G I N F E R E N C E W I T H P R O B A B I L I S T I C P R E D I C A T E S

  2. ANNOUNCEMENTS • Course webpage: – https://jarulraj.github.io/data-analytics-course/ • Start thinking about project topics – Read assigned papers for inspiration • No classes next week • Submit reviews in PDF format – GT username as filename GT 8803 // Fall 2018 2

  3. TODAY’s PAPER • Accelerating Machine Learning Inference with Probabilistic Predicates – Query optimization – ML inference queries • Slides based on a presentation by Yao Lu @ SIGMOD 2018 GT 8803 // Fall 2018 3

  4. TODAY’s PAPER MACHINE TRANSLATION QUERY PROCESSING STORAGE MANAGEMENT HARDWARE ACCELERATION LAYERS OF A DATA ANALYTICS SYSTEM GT 8803 // Fall 2018 4

  5. TODAY’S AGENDA • Problem Overview • Key Idea • Technical Details • Experiments • Discussion GT 8803 // Fall 2018 5

  6. ML INFERENCE ON BIG-DATA PLATFORMS • SQL + user-defined functions – On unstructured data blobs – Videos, Images, and Unstructured text GT 8803 // Fall 2018 6

  7. ML INFERENCE Untrained neural network → Training → Inference on new data Source: What’s the Difference Between Deep Learning Training and Inference?, Michael Copeland, August 2016, NVIDIA Blog GT 8803 // Fall 2018 7

  8. ML INFERENCE QUERY EXAMPLE • Find images of oranges Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → has person ? → has bear ? → has orange ? → ⋯ GT 8803 // Fall 2018 8

  9. ML INFERENCE QUERY EXAMPLE • Inference takes time – Even when the predicate has low selectivity – Perhaps only 1-in-100 images have oranges • Reason – Every image has to be processed by all the UDFs Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 9

  10. PROBLEM OVERVIEW • How can we accelerate such inference queries? Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 10

  11. SOLUTION #1: PREDICATE PUSHDOWN • Traditional query optimization technique – Move filtering of data as close to the source as possible to avoid loading unnecessary data into higher-level operators Join Tables A, B → 𝐐𝐬𝐟𝐞𝐣𝐝𝐛𝐮𝐟𝐭 𝐩𝐨 𝐁, 𝐂 → Result • Cannot push predicates below the UDF – No “contains orange” column exists – Need to construct it using UDF GT 8803 // Fall 2018 11

  12. SOLUTION #2: PRE-COMPUTING • Pre-computing all possible columns – High cost since too many UDFs & query predicates • Not a good fit for ad-hoc queries – Since only certain columns corresponding to certain images will be required • Not a good fit for online queries – Need to do inference on live data GT 8803 // Fall 2018 12

  13. KEY IDEA • Accelerate queries by early filtering Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 13

  14. KEY IDEA • Early filter constraints • Performance – Utility of data reduction >> Execution cost of early filter • Accuracy – Early filtering should not increase false negatives GT 8803 // Fall 2018 14

  15. EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result TRUE POSITIVE FALSE POSITIVE GT 8803 // Fall 2018 15

  16. EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result FALSE DATA NEGATIVE REDUCTION TRUE NEGATIVE GT 8803 // Fall 2018 16

  17. EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result FALSE POSITIVE – FALSE NEGATIVE ↑ GT 8803 // Fall 2018 17

  18. PROBABILISTIC EARLY FILTERING • Unlike queries on relational data – ML applications have in-built tolerance for errors – ML UDFs generate false positives & false negatives • So, filters can also be probabilistic! – Reducing accuracy can increase data reduction rate GT 8803 // Fall 2018 18

  19. PROBABILISTIC PREDICATES • Goal: query speedup + desired accuracy – Train binary classifiers – Group input blobs into two categories • Blobs that disagree with the query predicate • Blobs that may agree with the query predicate • Classifiers are called probabilistic predicates – <Data reduction rate, Execution cost, Accuracy> GT 8803 // Fall 2018 19

  20. PROBABILISTIC PREDICATES (PP s ) Images → PP ABCDEF → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → 𝜏 ABCDEF → Result 10-30 fps 50-1K fps • Low filter execution cost – High data reduction – Minimal impact on accuracy GT 8803 // Fall 2018 20

  21. PROBABILISTIC PREDICATES (PP s ) Images → PP ABCDEF → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → 𝜏 ABCDEF → Result 10-30 fps 50-1K fps • Apply PP directly on raw blob – 5-1000x faster than UDF – Accuracy vs data-reduction curve GT 8803 // Fall 2018 21

  22. SYSTEM WORKFLOW: BASELINE SYSTEM W/O PP s Query QO Input Plan Results a) Baseline System w/o PPs GT 8803 // Fall 2018 22

  23. SYSTEM WORKFLOW: CONSTRUCTING PP s Queries Query Inputs Query Results PP Trainer PPs b) Constructing PPs GT 8803 // Fall 2018 23

  24. SYSTEM WORKFLOW: FULL SYSTEM W/ PP s Query QO* Input PPs Plan* Results c) Full system w/ PPs GT 8803 // Fall 2018 24

  25. challenges • How to build useful PPs – Good trade-off between data reduction rate, cost, and accuracy • Supporting complex query predicates? – Using simple PPs for ad-hoc queries GT 8803 // Fall 2018 25

  26. PART-1: BUILDING USEFUL PP s • Probabilistic predicate – Can be thought of as a decision boundary separating two classes – Any classifier that can identify inputs far away from the decision boundary is an useful PP • Use different techniques for building PPs – Support vector machines (SVMs) – Deep neural networks, etc. GT 8803 // Fall 2018 26

  27. SIMPLE PP USING LINEAR CLASSIFIER PP ABCDEF : 𝑔 x = w ⋅ x + b has orange has no orange PP discards 𝑔 𝑦 ≤ 𝑢ℎ Setting accuracy/ reduction tradeoff threshold ( 𝑢ℎ ) Accuracy = 3/3, Reduction = 5/10 - - - - - + - + - + Accuracy = 2/3, Reduction = 7/10 f(x) → th GT 8803 // Fall 2018 27

  28. PP s for ARBITRARY DATA BLOBS • Input blob characteristics – Linearly separable or not – High dimensional data Documents – Sparse or dense Images Videos Audio GT 8803 // Fall 2018 28

  29. PP s for ARBITRARY DATA BLOBS More ? h x f(x) w f kde (x)= f lsvm (x) d + (x) / d-(x) x f 1 (x) f 2 (x) f .. (x) Shallow DNN: Kernel Density Estimator: Linear SVM: Random forest etc. 𝑔 j x = 𝑕 j (𝑋 j ⋅ 𝑔 jin x + 𝑐 j ) < 𝑢ℎ 𝑔 x = kde h x / kde i x < 𝑢ℎ 𝑔 x = w ⋅ x + b < th any function that fits 𝑔 x < th Cost Cost Cost w/ GPU Training Training Training Inference Inference Inference Linearly-separable data Nonlinearly-separable data GT 8803 // Fall 2018 29

  30. PP s for ARBITRARY DATA BLOBS More ? h x f(x) w f kde (x)= f lsvm (x) d + (x) / d-(x) x f 1 (x) f 2 (x) f .. (x) Shallow DNN: Kernel Density Estimator: Linear SVM: Random forest etc. 𝑔 j x = 𝑕 j (𝑋 j ⋅ 𝑔 jin x + 𝑐 j ) < 𝑢ℎ 𝑔 x = kde h x / kde i x < 𝑢ℎ 𝑔 x = w ⋅ x + b < th any function that fits 𝑔 x < th + + Dimension Reduction Model Selection Example: Feature Hashing, Select the best model Principal Component Analysis GT 8803 // Fall 2018 30

  31. MODEL SELECTION • Given different PP methods, select best PP that maximizes data reduction rate – Test PP on a small sample of data • Model selection insights – Input dataset determines PP selection – Given a blob type, same PP applies for different predicates & accuracy thresholds GT 8803 // Fall 2018 31

  32. PART-2: SUPPORTING COMPLEX PREDICATES • Queries with complex or new predicates – Large space of possible predicates – Costly to train/store a PP for each predicate – PPs for complex predicates do not generalize • Pick best PP combination – Query optimization problem – Inputs: available PPs, predicate, target accuracy – Goal: find PP combination ⇒ max reduction / cost GT 8803 // Fall 2018 32

  33. COMBINING PP s USING QUERY OPTIMIZATION (QO) • Solution – Build PPs for simple predicates – Use QO to assemble PP combinations Red ∧ SUV PP z{|F PP uFv # PPs trained << # predicates Red PP }CD PP wxy SUV GT 8803 // Fall 2018 33

  34. QUERY OPTIMIZATION OVER PP s • Predicate: 𝜏 ABCDEF ∨ 𝜏 •CDCDC ∧ 𝜏 €C• ∧ 𝜏 vAE • Conventional query optimization technique – Ordering predicates by data reduction/cost – Do not focus on combining predicates GT 8803 // Fall 2018 34

  35. STEP #1: SELECT CANDIDATE PP EXPRESSIONS • Explore necessary conditions to satisfy predicate for improving speedup Available: PP €C• , PP vAE , PP ABCDEF , PP •CDCDC 𝜏 ABCDEF ∨ 𝜏 •CDCDC ∧ 𝜏 €C• ∧ 𝜏 vAE Necessary conds. ⇒ PP vAE ∧ PP €C• ∧ PP ABCDEF ∨ PP •CDCDC ⇒ PP vAE ∧ PP €C• ⇒ PP ABCDEF ∨ PP •CDCDC ⇒ PP €C• Greedily find a PP combination that ⇒ PP vAE has the best reduction / cost exponentially many choices GT 8803 // Fall 2018 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend