DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 0 2 : A C C E L E R A T I N G M A C H I N E L E A R N I N G I N F E R E N C E W I T H P R O B A B I L I S T I C P R E D I C A T E S

ANNOUNCEMENTS • Course webpage: – https://jarulraj.github.io/data-analytics-course/ • Start thinking about project topics – Read assigned papers for inspiration • No classes next week • Submit reviews in PDF format – GT username as filename GT 8803 // Fall 2018 2

TODAY’s PAPER • Accelerating Machine Learning Inference with Probabilistic Predicates – Query optimization – ML inference queries • Slides based on a presentation by Yao Lu @ SIGMOD 2018 GT 8803 // Fall 2018 3

TODAY’s PAPER MACHINE TRANSLATION QUERY PROCESSING STORAGE MANAGEMENT HARDWARE ACCELERATION LAYERS OF A DATA ANALYTICS SYSTEM GT 8803 // Fall 2018 4

TODAY’S AGENDA • Problem Overview • Key Idea • Technical Details • Experiments • Discussion GT 8803 // Fall 2018 5

ML INFERENCE ON BIG-DATA PLATFORMS • SQL + user-defined functions – On unstructured data blobs – Videos, Images, and Unstructured text GT 8803 // Fall 2018 6

ML INFERENCE Untrained neural network → Training → Inference on new data Source: What’s the Difference Between Deep Learning Training and Inference?, Michael Copeland, August 2016, NVIDIA Blog GT 8803 // Fall 2018 7

ML INFERENCE QUERY EXAMPLE • Find images of oranges Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → has person ? → has bear ? → has orange ? → ⋯ GT 8803 // Fall 2018 8

ML INFERENCE QUERY EXAMPLE • Inference takes time – Even when the predicate has low selectivity – Perhaps only 1-in-100 images have oranges • Reason – Every image has to be processed by all the UDFs Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 9

PROBLEM OVERVIEW • How can we accelerate such inference queries? Images → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 10

SOLUTION #1: PREDICATE PUSHDOWN • Traditional query optimization technique – Move filtering of data as close to the source as possible to avoid loading unnecessary data into higher-level operators Join Tables A, B → 𝐐𝐬𝐟𝐞𝐣𝐝𝐛𝐮𝐟𝐭 𝐩𝐨 𝐁, 𝐂 → Result • Cannot push predicates below the UDF – No “contains orange” column exists – Need to construct it using UDF GT 8803 // Fall 2018 11

SOLUTION #2: PRE-COMPUTING • Pre-computing all possible columns – High cost since too many UDFs & query predicates • Not a good fit for ad-hoc queries – Since only certain columns corresponding to certain images will be required • Not a good fit for online queries – Need to do inference on live data GT 8803 // Fall 2018 12

KEY IDEA • Accelerate queries by early filtering Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result GT 8803 // Fall 2018 13

KEY IDEA • Early filter constraints • Performance – Utility of data reduction >> Execution cost of early filter • Accuracy – Early filtering should not increase false negatives GT 8803 // Fall 2018 14

EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result TRUE POSITIVE FALSE POSITIVE GT 8803 // Fall 2018 15

EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result FALSE DATA NEGATIVE REDUCTION TRUE NEGATIVE GT 8803 // Fall 2018 16

EARLY FILTERING Images → Filter → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → σ ABCDEF → Result FALSE POSITIVE – FALSE NEGATIVE ↑ GT 8803 // Fall 2018 17

PROBABILISTIC EARLY FILTERING • Unlike queries on relational data – ML applications have in-built tolerance for errors – ML UDFs generate false positives & false negatives • So, filters can also be probabilistic! – Reducing accuracy can increase data reduction rate GT 8803 // Fall 2018 18

PROBABILISTIC PREDICATES • Goal: query speedup + desired accuracy – Train binary classifiers – Group input blobs into two categories • Blobs that disagree with the query predicate • Blobs that may agree with the query predicate • Classifiers are called probabilistic predicates – <Data reduction rate, Execution cost, Accuracy> GT 8803 // Fall 2018 19

PROBABILISTIC PREDICATES (PP s ) Images → PP ABCDEF → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → 𝜏 ABCDEF → Result 10-30 fps 50-1K fps • Low filter execution cost – High data reduction – Minimal impact on accuracy GT 8803 // Fall 2018 20

PROBABILISTIC PREDICATES (PP s ) Images → PP ABCDEF → 𝐕𝐄𝐆_𝐙𝐏𝐌𝐏𝐰𝟑 → 𝜏 ABCDEF → Result 10-30 fps 50-1K fps • Apply PP directly on raw blob – 5-1000x faster than UDF – Accuracy vs data-reduction curve GT 8803 // Fall 2018 21

SYSTEM WORKFLOW: BASELINE SYSTEM W/O PP s Query QO Input Plan Results a) Baseline System w/o PPs GT 8803 // Fall 2018 22

SYSTEM WORKFLOW: CONSTRUCTING PP s Queries Query Inputs Query Results PP Trainer PPs b) Constructing PPs GT 8803 // Fall 2018 23

SYSTEM WORKFLOW: FULL SYSTEM W/ PP s Query QO* Input PPs Plan* Results c) Full system w/ PPs GT 8803 // Fall 2018 24

challenges • How to build useful PPs – Good trade-off between data reduction rate, cost, and accuracy • Supporting complex query predicates? – Using simple PPs for ad-hoc queries GT 8803 // Fall 2018 25

PART-1: BUILDING USEFUL PP s • Probabilistic predicate – Can be thought of as a decision boundary separating two classes – Any classifier that can identify inputs far away from the decision boundary is an useful PP • Use different techniques for building PPs – Support vector machines (SVMs) – Deep neural networks, etc. GT 8803 // Fall 2018 26

SIMPLE PP USING LINEAR CLASSIFIER PP ABCDEF : 𝑔 x = w ⋅ x + b has orange has no orange PP discards 𝑔 𝑦 ≤ 𝑢ℎ Setting accuracy/ reduction tradeoff threshold ( 𝑢ℎ ) Accuracy = 3/3, Reduction = 5/10 - - - - - + - + - + Accuracy = 2/3, Reduction = 7/10 f(x) → th GT 8803 // Fall 2018 27

PP s for ARBITRARY DATA BLOBS • Input blob characteristics – Linearly separable or not – High dimensional data Documents – Sparse or dense Images Videos Audio GT 8803 // Fall 2018 28

PP s for ARBITRARY DATA BLOBS More ? h x f(x) w f kde (x)= f lsvm (x) d + (x) / d-(x) x f 1 (x) f 2 (x) f .. (x) Shallow DNN: Kernel Density Estimator: Linear SVM: Random forest etc. 𝑔 j x = 𝑕 j (𝑋 j ⋅ 𝑔 jin x + 𝑐 j ) < 𝑢ℎ 𝑔 x = kde h x / kde i x < 𝑢ℎ 𝑔 x = w ⋅ x + b < th any function that fits 𝑔 x < th Cost Cost Cost w/ GPU Training Training Training Inference Inference Inference Linearly-separable data Nonlinearly-separable data GT 8803 // Fall 2018 29

PP s for ARBITRARY DATA BLOBS More ? h x f(x) w f kde (x)= f lsvm (x) d + (x) / d-(x) x f 1 (x) f 2 (x) f .. (x) Shallow DNN: Kernel Density Estimator: Linear SVM: Random forest etc. 𝑔 j x = 𝑕 j (𝑋 j ⋅ 𝑔 jin x + 𝑐 j ) < 𝑢ℎ 𝑔 x = kde h x / kde i x < 𝑢ℎ 𝑔 x = w ⋅ x + b < th any function that fits 𝑔 x < th + + Dimension Reduction Model Selection Example: Feature Hashing, Select the best model Principal Component Analysis GT 8803 // Fall 2018 30

MODEL SELECTION • Given different PP methods, select best PP that maximizes data reduction rate – Test PP on a small sample of data • Model selection insights – Input dataset determines PP selection – Given a blob type, same PP applies for different predicates & accuracy thresholds GT 8803 // Fall 2018 31

PART-2: SUPPORTING COMPLEX PREDICATES • Queries with complex or new predicates – Large space of possible predicates – Costly to train/store a PP for each predicate – PPs for complex predicates do not generalize • Pick best PP combination – Query optimization problem – Inputs: available PPs, predicate, target accuracy – Goal: find PP combination ⇒ max reduction / cost GT 8803 // Fall 2018 32

COMBINING PP s USING QUERY OPTIMIZATION (QO) • Solution – Build PPs for simple predicates – Use QO to assemble PP combinations Red ∧ SUV PP z{|F PP uFv # PPs trained << # predicates Red PP }CD PP wxy SUV GT 8803 // Fall 2018 33

QUERY OPTIMIZATION OVER PP s • Predicate: 𝜏 ABCDEF ∨ 𝜏 •CDCDC ∧ 𝜏 €C• ∧ 𝜏 vAE • Conventional query optimization technique – Ordering predicates by data reduction/cost – Do not focus on combining predicates GT 8803 // Fall 2018 34

STEP #1: SELECT CANDIDATE PP EXPRESSIONS • Explore necessary conditions to satisfy predicate for improving speedup Available: PP €C• , PP vAE , PP ABCDEF , PP •CDCDC 𝜏 ABCDEF ∨ 𝜏 •CDCDC ∧ 𝜏 €C• ∧ 𝜏 vAE Necessary conds. ⇒ PP vAE ∧ PP €C• ∧ PP ABCDEF ∨ PP •CDCDC ⇒ PP vAE ∧ PP €C• ⇒ PP ABCDEF ∨ PP •CDCDC ⇒ PP €C• Greedily find a PP combination that ⇒ PP vAE has the best reduction / cost exponentially many choices GT 8803 // Fall 2018 35

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 0 2 : A C C E L E R A T I N G M A C H I N E L E A R N I N G I N F E R E N C E W I T H P R O B A B I L I S T I C P R E D I C A T E S ANNOUNCEMENTS

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Optimal ranking of online search requests for long-term revenue maximization Pierre LEcuyer

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Policy tradeoffs under risk of abrupt climate change by Y. Tsur and A. Zemel Comments by

The algebra of integrated partial belief systems Manuele Leonelli 1 , Eva Riccomagno 2 , James Q.

Konstantin Turitsyn T-4 & CNLS, Los Alamos National Laboratory Landau Institute for

AST 1420 Galactic Structure and Dynamics Today: dynamics of stars in galactic disks Last

Adventures of a Long-Range Walker Thierry DAUXOIS CNRS & ENS Lyon 1 Stefano Ruffo

Hoare logic Lecture 5: Introduction to separation logic Jean Pichon-Pharabod University of

Sambuz

Useful Links

Newsletter

Mail Us

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // JOY ARULRAJ L E C T U R E # 0 2 : A C C E L E R A T I N G M A C H I N E L E A R N I N G I N F E R E N C E W I T H P R O B A B I L I S T I C P R E D I C A T E S ANNOUNCEMENTS

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Optimal ranking of online search requests for long-term revenue maximization Pierre LEcuyer

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

Policy tradeoffs under risk of abrupt climate change by Y. Tsur and A. Zemel Comments by

The algebra of integrated partial belief systems Manuele Leonelli 1 , Eva Riccomagno 2 , James Q.

Konstantin Turitsyn T-4 &amp; CNLS, Los Alamos National Laboratory Landau Institute for

AST 1420 Galactic Structure and Dynamics Today: dynamics of stars in galactic disks Last

Adventures of a Long-Range Walker Thierry DAUXOIS CNRS &amp; ENS Lyon 1 Stefano Ruffo

Hoare logic Lecture 5: Introduction to separation logic Jean Pichon-Pharabod University of

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Konstantin Turitsyn T-4 & CNLS, Los Alamos National Laboratory Landau Institute for

Adventures of a Long-Range Walker Thierry DAUXOIS CNRS & ENS Lyon 1 Stefano Ruffo