via Threshold-Based Pruning Edward Gan & Peter Bailis 1 - PowerPoint PPT Presentation

Scalable Kernel Density Classification via Threshold-Based Pruning Edward Gan & Peter Bailis 1

MacroBase: Analytics on Fast Streams • Increasing Streaming Data • Manufacturing, Sensors, Mobile • Multi-dimensional + Latent anomalies • Running in production • see CIDR17, SIGMOD17 • End-to-end operator cascades for: • Feature Transformation • Statistical Classification • Data Summarization 2

Example: Space Shuttle Sensors 8 Sensors Total End-Goal: Explain anomalous speed / flow measurements. “Fuel Flow” “Flight Speed” Speed Flow Status Problem: Model distribution of 28 27 Fpv Close speed / flow measurements. 34 43 High 52 30 Rad Flow 28 40 Rad Flow [UCI Repository] … … 3

Difficulties in Data Modelling Data Histogram Gaussian Model Poor Fit 4

Difficulties in Data Modelling Data Histogram Mixture of Gaussians Inaccurate: Gaps not captured 5

Kernel Density Estimation (KDE) Data Histogram Kernel Density Estimate Much better fit 6

KDE: Statistical Gold Standard • Guaranteed to converge to the underlying distribution • Provides normalized, true probability densities • Few assumptions about shape of distribution: inferred from data 7

KDE Usage Distribution of Bowhead Whales Galaxy Mass Distribution [L.T. Quackenbush et al, Arctic 2010] [Sloan Digital Sky Survey] 8

KDE Definition Kernels Each point in dataset contributes a kernel Kernel: localized Gaussian “bump” Kernels summed up to form estimate Mixture of N Gaussians: N is the dataset size Final Estimate Training Data 9

Problem: KDE does not scale 𝑃(𝑜) to compute single density 𝑔(𝑦) 𝑃(𝑜 2 ) to compute all densities in data 𝑦 2 hours to compute on 1M points on 2.9Ghz Core i5 How can we speed this up? Training Data 10

Strawman Optimization: Histograms Training Dataset Binned Counts Grid computation Benefit: Runtime depends on grid size rather than N Problem: Bin explosion in high dimensions [Wand, J. of Computational and Graphical Statics 1994] 11

Stepping Back: What users need Anomaly Explanation SELECT flight_mode FROM shuttle_sensors WHERE kde(flow,speed) < threshold Hypothesis Testing SELECT color FROM galaxies WHERE kde(x,y,z) < threshold 12

From Estimation to Classification SELECT flight_mode FROM shuttle_sensors WHERE kde(flow,speed) < Threshold KDE Model Classification Training Data 13

End to End Query Kernel Density Estimation Classification 𝑦 Threshold Filter Densities Training Data ቊHigh 𝑗𝑔 𝑔 𝑦 ≥ 𝑢 Low 𝑗𝑔 𝑔 𝑦 < 𝑢 Unnecessary for final output 14

End to End Query Kernel Density Estimation Classification Training Data 𝑦 Threshold Filter ቊHigh 𝑗𝑔 𝑔 𝑦 ≥ 𝑢 Low 𝑗𝑔 𝑔 𝑦 < 𝑢 15

Recap • KDE can model complex distributions • Problem: KDE scales quadratically with dataset size • Real Usage: KDE + Predicates = Kernel Density Classification • Idea: Apply Predicate Pushdown to KDE 16

tkdc Algorithm Overview 1. Pick a threshold 2. Repeat: Calculate bounds on point density 3. Stop when we can make a classification 17

Classifying the density based on bounds Upper Bound Upper Bound True Density 𝑔(𝑦) Classified Lower Bound Density Threshold Classified Lower Bound Threshold Pruning Rules 18

Iterative Refinement Hit Pruning Rule Upper Bound Threshold Density How to compute bounds? Lower Bound Algorithm Iterations 19

k-d tree Spatial Indices Divide N-dimensional space Nodes for each Region 1 axis at a time Track # of points + bounding box 20

Bounding the densities Given from k-d tree: Bounding Box, # Points Contained Total contribution from a region can be bounded 𝑔(𝑦) Total Contribution Maximum Contribution Minimum Contribution [Gray & Moore, ICDM 2003] 21

Iterative Refinement Initial Estimate Step 1 Step 2 Step 3 split 𝑔(𝑦) split split k-d tree root node Priority Queue : Split nodes with largest uncertainty first 22

tkdc Algorithm Overview 1. Pick a threshold • User-Specified • Automatically Inferred 2. Calculate bounds on a density • k-d tree bounding boxes 3. Refine the bounds until we can classify • Priority-queue guided region splitting 23

Automatic Threshold Selection • Probability Densities hard to work with: • Unpredictable • Huge range of magnitudes • Good Default: capture a set % of the data SELECT Quantile(kde(A,B), 1%) from shuttle_sensors • Bootstrapping Threshold Kernel Better Classification Estimate Threshold • Classification for computing thresholds • See paper for details 24

tkdc Complete Algorithm • Pick a threshold • Inferred given desired % level • Calculate bounds on a density • k-d tree bounding boxes • Refine the bounds until we can make classification • Priority-queue guided region splitting 25

Theorem: Expected Runtime 𝑜 number of training points 𝑒 dimensionality of data 100𝑁 ≈ 10,000 x 100 million data points, 2-dimensions 1 100𝑁 2 100𝑁 ≈ 10 x 100 million data points, 8-dimensions 7 100𝑁 8 26

Runtime in practice: Experimental Setup Single Threaded, In-memory Total Time = Training Time + Threshold Estimation + Classify All Threshold = 1% classification rate Baselines: • simple: naïve for loop over all points • kdtree: k-d tree approximate density estimation, no threshold • radial: iterates through points, pruning > certain radius 27

KDE Performance Improvement 5000x 1000x radial kdtree radial kdtree 28

Threshold Pruning Contribution 29

tkdc scales well with dataset size Our Algorithm: tkdc Asymptotic Speedup 30

Conclusion SELECT flight_mode FROM shuttle_sensors KDE: WHERE kde(flow, speed) < Threshold Powerful & Expensive Classification Training Data KDE Model Real Queries: MacroBase Systems Techniques: Predicate Pushdown, k-d tree indices: https://github.com/stanford-futuredata/tKDC 1000x, Asymptotic Speedups 31

via Threshold-Based Pruning Edward Gan & Peter Bailis 1 - PowerPoint PPT Presentation

Scalable Kernel Density Classification via Threshold-Based Pruning Edward Gan & Peter Bailis 1 MacroBase: Analytics on Fast Streams Increasing Streaming Data Manufacturing, Sensors, Mobile Multi-dimensional + Latent anomalies

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

Watershed Below TMDL Threshold At TMDL Threshold Above TMDL Threshold Water Quality Overview

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees & Pruning Requests Criteria

Identification of Pruning Branches for for Automated Dormant Pruning M Manoj Karkee j K k

Welcome to the DCGO Presentation Basic Pruning Agenda Reasons for Pruning Tools

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan

More on games (Ch. 5.4-5.6) Announcements Writing 2 posted Minimax Pruning in real life:

Properties of - //the leaf node (terminal state) ) 9 ( The - algorithm //the leaf

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Alpha- -beta pruning beta pruning Example Alpha Example reduce the branching factor of

Threshold Implementations Svetla Nikova Threshold Implementations A provably secure

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

Draft 1 Density estimation by Randomized Quasi-Monte Carlo Pierre LEcuyer Joint work with

Understanding the Structure of Programs is Difficult Software Clustering Developers create

Pr r tss

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

via Threshold-Based Pruning Edward Gan & Peter Bailis 1 - PowerPoint PPT Presentation

Scalable Kernel Density Classification via Threshold-Based Pruning Edward Gan & Peter Bailis 1 MacroBase: Analytics on Fast Streams Increasing Streaming Data Manufacturing, Sensors, Mobile Multi-dimensional + Latent anomalies

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

Watershed Below TMDL Threshold At TMDL Threshold Above TMDL Threshold Water Quality Overview

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees &amp; Pruning Requests Criteria

Identification of Pruning Branches for for Automated Dormant Pruning M Manoj Karkee j K k

Welcome to the DCGO Presentation Basic Pruning Agenda Reasons for Pruning Tools

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan

More on games (Ch. 5.4-5.6) Announcements Writing 2 posted Minimax Pruning in real life:

Properties of - //the leaf node (terminal state) ) 9 ( The - algorithm //the leaf

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Alpha- -beta pruning beta pruning Example Alpha Example reduce the branching factor of

Threshold Implementations Svetla Nikova Threshold Implementations A provably secure

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

Draft 1 Density estimation by Randomized Quasi-Monte Carlo Pierre LEcuyer Joint work with

Understanding the Structure of Programs is Difficult Software Clustering Developers create

Pr r tss

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

Notes and Announcements Midterm exam: Oct 20 , Wednesday, In Class Late Homeworks Turn

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Sub-quadratic Markov tree mixture models for probability density estimation Sourour Ammar 1 , Ph.

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees & Pruning Requests Criteria