Sketching Linear Classifiers Over Data Streams Kai Sheng Tai - PowerPoint PPT Presentation

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis, Gregory Valiant Stanford University

High-dimensional linear classifiers on streams Ubiquitous: spam detection, ad click prediction, network traffic classification, ... Fast: computationally cheap inference and updates Adaptive: updated online in response to changing data distributions Problem: high memory usage Lots of features ⇒ more expressive classifiers, BUT more memory needed to store weights

Example: Traffic classification with limited memory Features Network packet Accept Version: IPv4 Src[:1] = 136 Src: 136.0.1.1 Dest[:1] = 129 Classifier Dest: 129.0.1.1 Src[:2] = 136.0 ... Dest[:2] = 129.0 ... Reject (filter) Want classifiers that adhere to strict memory budgets (e.g., 1MB) But also want accuracy: network switch more features, feature combinations

More broadly: Online learning on memory-constrained devices

Problem: How to restrict memory usage while preserving accuracy? Proposal 1: Use only most informative features? - In streaming setting, often don’t know feature importance a priori - Feature importance can change over time (e.g., spam classification) Proposal 2: Use only most frequent features? - Most frequent ≠ most informative

Sketches for memory-limited stream processing Long line of work on memory-efficient sketches for stream processing e.g., identifying the k most frequent items in a stream (“heavy hitters” problem) - Count-Sketch [Charikar, Chen & Farach-Colton ‘02] - Count-Min Sketch [Cormode & Muthukrishnan ‘05] Can we adapt existing sketching algorithms for use in memory-limited streaming classification ? Yes — our contribution. Weight-Median Sketch: a new sketch for linear classifiers Main idea: most frequent items → highest-magnitude weights

This work: Sketched linear classifiers with online updates Instead of high-dimensional classifier, maintain a memory-efficient sketch

This work: Sketched linear classifiers with online updates Update the classifier as new examples are observed Instead of high-dimensional classifier, maintain a memory-efficient sketch

This work: Sketched linear classifiers with online updates Update the classifier as new examples are observed How accurate is the sketched classifier? How do the sketched weights relate to the weights of the original, high-dimensional model? Instead of high-dimensional classifier, maintain a memory-efficient sketch

Related Work Streaming Algorithms Finding frequent items in data streams [Charikar et al. ‘02, Cormode & Muthukrishnan ‘05, etc.] Identifying differences between streams [Schweller et al. ‘04, Cormode & Muthukrishnan ‘05, etc.] Machine Learning Resource-constrained learning [Konecny et al. ‘15, Gupta et al. ‘17, Kumar et al. ‘17] Sparsity-inducing regularization [Tibshirani ‘96 & many others] Learning compressed classifiers [Shi et al. ‘09, Weinberger et al. ‘09, Calderbank et al. ‘09] (e.g., feature hashing)

1. algorithm 2. evaluation 3. applications

The WM-Sketch: an extension of the Count-Sketch Count-Sketch update WM-Sketch update k/s k/s hash hash i i s s sketch of weights sketch of counts ( s x k/s array) gradient estimates count increments Count-Sketch: maintain a low-dimensional sketch of counts WM-Sketch: maintain a low-dimensional sketch of weights Update: hash each index i to s buckets, apply additive update Update: gradient descent on sketched weights

The WM-Sketch: an extension of the Count-Sketch Count-Sketch query WM-Sketch query k/s k/s hash hash i i s s compute median compute median → estimated weight → estimated count sketch of weights sketch of counts Same query procedure Count-Sketch → low-error estimates of largest counts WM-Sketch → low-error estimates of largest-magnitude weights (note: standard feature hashing does not support weight recovery)

WM-Sketch Analysis: Guarantees on weight approximation error We compare the optimal weights for the original data, w * (i.e., the minimizer of the empirical loss) to those recovered from the optimal sketched weights, w est Theorem (informal) Let d be the dimension of the data. With probability , the maximum entrywise approximation error is for sketch size good approximation of only need sketch dimension high-magnitude weights much smaller than d

Important optimization in practice: Store large weights in a heap index value i 5.0 large weights j -4.2 k 3.5 small weights … … sketch min-heap ordered by weight magnitude Anytime queries for estimated top- k weights Reduces “bad” collisions with large weights in sketch Significantly improves classification accuracy and weight recovery accuracy in practice

1. algorithm 2. evaluation - Classification accuracy - Weight recovery accuracy 3. applications

Classification accuracy: WM-Sketch improves on Feature Hashing use only most frequent features feature hashing + heap error of uncompressed logistic regression

Weight recovery: WM-Sketch improves on heavy hitters track most frequent features feature hashing + heap better

1. algorithm 2. evaluation 3. applications - Network monitoring - Identifying correlated events

Network monitoring: what are the largest relative differences? Network packet Features Flow A Version: IPv4 Src[:1] = 136 logistic Src: 136.0.1.1 Dest[:1] = 129 regression with Dest: 129.0.1.1 Src[:2] = 136.0 WM-Sketch ... Dest[:2] = 129.0 ... Flow B Largest weights → features (e.g., IP prefixes) with largest relative differences between flows Previous work: “relative deltoids” in data streams [CM’05] network switch Outperforms Count-Min baselines (even when baselines are given 8x memory budget)

Explaining outliers: which features indicate anomalies? IP City Latency 136.0.1.1 San Francisco 10ms label = -1 161.0.1.1 New York 12ms 129.0.1.1 Houston 500ms label = +1 … … … (e.g. >99th percentile) logistic regression with WM-Sketch Return features most indicative of being an outlier (weights can be interpreted as log-odds ratio ) Streaming outlier explanation (e.g., MacroBase [Bailis et al. ‘17] ) Feature Weight City=Houston +4.2 Outperforms heavy hitter-based methods for identifying City=Austin +2.0 “high-risk” features IP=129.x.x.x +1.5 … …

Identifying correlations: which events tend to co-occur? Token 1 Token 2 Label United States +1 real co-occurring events computer science +1 computer the -1 synthetic event pair Text stream … … … Features = event pairs logistic regression with WM-Sketch Largest weights → events that are strongly correlated Exact counter-based approach: 188MB memory Pair Weight Approximation with WM-Sketch: 1.4MB memory (United, States) +4.5 (Barack, Obama) +4.0 (computer, science) +2.0 Approximation ⇒ >100x less memory usage … …

Takeaways Weight-Median Sketch: - Count-Sketch for linear classification - Improved feature hashing - Lightweight, memory-efficient classifiers everywhere Stream processing: - Many tasks can be formulated as classification problems - Still lots of room for exploration kaishengtai.github.io Paper: tinyurl.com/ wmsketch kst@cs.stanford.edu Code: github.com/stanford-futuredata/ wmsketch @kaishengtai

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai - PowerPoint PPT Presentation

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis, Gregory Valiant Stanford University High-dimensional linear classifiers on streams Ubiquitous: spam detection, ad click prediction, network traffic

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

New constructions of Kakeya and Besicovitch sets Yuval Peres 1 Based on work with Y. Babichenko,

Dynamics of Schwarz reflections: mating rational maps with groups (Joint with Seung-Yeop Lee,

Optimizing volume with prescribed diameter or minimum width B. Gonz alez Merino* (joint with

4. Molecular dynamics Understanding Molecular Simulation Molecular Simulations Molecular

Eigenvalues of random normal matrices Random normal matrix model: Droplets Example: concentric

Valence of Harmonic Polynomials and Topology of Quadrature Domains, Everything is Complex

Y P O C T O N O TMS in Special Populations D E S A E Alexander Rotenberg, M.D., Ph.D.

@ausinheiler posturemovementpain.com The Hypochondriac's Diary: Using Self

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai - PowerPoint PPT Presentation

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis, Gregory Valiant Stanford University High-dimensional linear classifiers on streams Ubiquitous: spam detection, ad click prediction, network traffic

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

New constructions of Kakeya and Besicovitch sets Yuval Peres 1 Based on work with Y. Babichenko,

Dynamics of Schwarz reflections: mating rational maps with groups (Joint with Seung-Yeop Lee,

Optimizing volume with prescribed diameter or minimum width B. Gonz alez Merino* (joint with

4. Molecular dynamics Understanding Molecular Simulation Molecular Simulations Molecular

Eigenvalues of random normal matrices Random normal matrix model: Droplets Example: concentric

Valence of Harmonic Polynomials and Topology of Quadrature Domains, Everything is Complex

Y P O C T O N O TMS in Special Populations D E S A E Alexander Rotenberg, M.D., Ph.D.

@ausinheiler posturemovementpain.com The Hypochondriac's Diary: Using Self

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams