via Threshold-Based Pruning Edward Gan & Peter Bailis 1 - - PowerPoint PPT Presentation

via threshold based pruning
SMART_READER_LITE
LIVE PREVIEW

via Threshold-Based Pruning Edward Gan & Peter Bailis 1 - - PowerPoint PPT Presentation

Scalable Kernel Density Classification via Threshold-Based Pruning Edward Gan & Peter Bailis 1 MacroBase: Analytics on Fast Streams Increasing Streaming Data Manufacturing, Sensors, Mobile Multi-dimensional + Latent anomalies


slide-1
SLIDE 1

Scalable Kernel Density Classification via Threshold-Based Pruning

Edward Gan & Peter Bailis

1

slide-2
SLIDE 2
  • Increasing Streaming Data
  • Manufacturing, Sensors, Mobile
  • Multi-dimensional + Latent anomalies
  • Running in production
  • see CIDR17, SIGMOD17
  • End-to-end operator cascades for:
  • Feature Transformation
  • Statistical Classification
  • Data Summarization

MacroBase: Analytics on Fast Streams

2

slide-3
SLIDE 3

Example: Space Shuttle Sensors

3

8 Sensors Total

“Fuel Flow” “Flight Speed” [UCI Repository] Speed Flow Status 28 27 Fpv Close 34 43 High 52 30 Rad Flow 28 40 Rad Flow … …

End-Goal: Explain anomalous speed / flow measurements. Problem: Model distribution of speed / flow measurements.

slide-4
SLIDE 4

Difficulties in Data Modelling

4

Data Histogram Gaussian Model

Poor Fit

slide-5
SLIDE 5

Difficulties in Data Modelling

Inaccurate: Gaps not captured

Data Histogram Mixture of Gaussians

5

slide-6
SLIDE 6

Kernel Density Estimation (KDE)

6

Data Histogram Kernel Density Estimate

Much better fit

slide-7
SLIDE 7

KDE: Statistical Gold Standard

  • Guaranteed to converge to the underlying distribution
  • Provides normalized, true probability densities
  • Few assumptions about shape of distribution: inferred from data

7

slide-8
SLIDE 8

KDE Usage

Galaxy Mass Distribution

[Sloan Digital Sky Survey]

Distribution of Bowhead Whales

[L.T. Quackenbush et al, Arctic 2010] 8

slide-9
SLIDE 9

KDE Definition

9

Each point in dataset contributes a kernel Kernel: localized Gaussian “bump” Kernels summed up to form estimate Mixture of N Gaussians: N is the dataset size

Training Data Kernels Final Estimate

slide-10
SLIDE 10

Problem: KDE does not scale

10

𝑦

𝑃(𝑜) to compute single density 𝑔(𝑦)

How can we speed this up?

Training Data

𝑃(𝑜2) to compute all densities in data 2 hours to compute on 1M points

  • n 2.9Ghz Core i5
slide-11
SLIDE 11

Strawman Optimization: Histograms

Training Dataset Binned Counts Grid computation Benefit: Runtime depends on grid size rather than N Problem: Bin explosion in high dimensions

11 [Wand, J. of Computational and Graphical Statics 1994]

slide-12
SLIDE 12

SELECT flight_mode FROM shuttle_sensors WHERE kde(flow,speed) < threshold

Stepping Back: What users need

12

SELECT color FROM galaxies WHERE kde(x,y,z) < threshold Anomaly Explanation Hypothesis Testing

slide-13
SLIDE 13

From Estimation to Classification

13

SELECT flight_mode FROM shuttle_sensors WHERE kde(flow,speed) < Threshold Training Data KDE Model Classification

slide-14
SLIDE 14

End to End Query

14

Kernel Density Estimation Threshold Filter ቊHigh 𝑗𝑔 𝑔 𝑦 ≥ 𝑢 Low 𝑗𝑔 𝑔 𝑦 < 𝑢

𝑦

Training Data Densities Classification

Unnecessary for final output

slide-15
SLIDE 15

End to End Query

15

Kernel Density Estimation Threshold Filter ቊHigh 𝑗𝑔 𝑔 𝑦 ≥ 𝑢 Low 𝑗𝑔 𝑔 𝑦 < 𝑢

𝑦

Training Data Classification

slide-16
SLIDE 16

Recap

  • KDE can model complex distributions
  • Problem: KDE scales quadratically with dataset size
  • Real Usage: KDE + Predicates = Kernel Density Classification
  • Idea: Apply Predicate Pushdown to KDE

16

slide-17
SLIDE 17

tkdc Algorithm Overview

  • 1. Pick a threshold
  • 2. Repeat: Calculate bounds on point density
  • 3. Stop when we can make a classification

17

slide-18
SLIDE 18

Classifying the density based on bounds

18

Threshold

Upper Bound Lower Bound

Density

Upper Bound Lower Bound True Density 𝑔(𝑦)

Classified Classified Threshold Pruning Rules

slide-19
SLIDE 19

Iterative Refinement

19

Threshold

Upper Bound Lower Bound

Density Algorithm Iterations Hit Pruning Rule

How to compute bounds?

slide-20
SLIDE 20

k-d tree Spatial Indices

Divide N-dimensional space 1 axis at a time

20

Nodes for each Region Track # of points + bounding box

slide-21
SLIDE 21

Bounding the densities

Given from k-d tree: Bounding Box, # Points Contained Total contribution from a region can be bounded

[Gray & Moore, ICDM 2003] 21

Total Contribution Maximum Contribution Minimum Contribution 𝑔(𝑦)

slide-22
SLIDE 22

Iterative Refinement

Initial Estimate Priority Queue: Split nodes with largest uncertainty first

22

Step 1

k-d tree root node split split split 𝑔(𝑦)

Step 2 Step 3

slide-23
SLIDE 23

tkdc Algorithm Overview

  • 1. Pick a threshold
  • User-Specified
  • Automatically Inferred
  • 2. Calculate bounds on a density
  • k-d tree bounding boxes
  • 3. Refine the bounds until we can classify
  • Priority-queue guided region splitting

23

slide-24
SLIDE 24

Automatic Threshold Selection

  • Probability Densities hard to work with:
  • Unpredictable
  • Huge range of magnitudes
  • Good Default: capture a set % of the data

SELECT Quantile(kde(A,B), 1%) from shuttle_sensors

  • Bootstrapping
  • Classification for computing thresholds
  • See paper for details

Kernel Classification Better Threshold Threshold Estimate

24

slide-25
SLIDE 25

tkdc Complete Algorithm

  • Pick a threshold
  • Inferred given desired % level
  • Calculate bounds on a density
  • k-d tree bounding boxes
  • Refine the bounds until we can make classification
  • Priority-queue guided region splitting

25

slide-26
SLIDE 26

Theorem: Expected Runtime

26

100 million data points, 2-dimensions

100𝑁 100𝑁

1 2

≈ 10,000x 100 million data points, 8-dimensions

100𝑁 100𝑁

7 8

≈ 10x 𝑜 number of training points 𝑒 dimensionality of data

slide-27
SLIDE 27

Runtime in practice: Experimental Setup

Single Threaded, In-memory Total Time = Training Time + Threshold Estimation + Classify All Threshold = 1% classification rate Baselines:

  • simple:

naïve for loop over all points

  • kdtree:

k-d tree approximate density estimation, no threshold

  • radial:

iterates through points, pruning > certain radius

27

slide-28
SLIDE 28

KDE Performance Improvement

28

radial radial

5000x 1000x

kdtree kdtree

slide-29
SLIDE 29

Threshold Pruning Contribution

29

slide-30
SLIDE 30

tkdc scales well with dataset size

30

Asymptotic Speedup

Our Algorithm: tkdc

slide-31
SLIDE 31

Conclusion

Predicate Pushdown, k-d tree indices: 1000x, Asymptotic Speedups

31

Training Data KDE Model Classification

SELECT flight_mode FROM shuttle_sensors WHERE kde(flow, speed) < Threshold KDE: Powerful & Expensive Real Queries: MacroBase Systems Techniques:

https://github.com/stanford-futuredata/tKDC