One Sketch for All Joel A. Tropp Department of Mathematics The - PowerPoint PPT Presentation

One Sketch for All ❦ Joel A. Tropp Department of Mathematics The University of Michigan jtropp@umich.edu Joint work with Anna C. Gilbert Martin J. Strauss Roman Vershynin Research supported in part by NSF and DARPA 1

or, Heavy Hitters on Steroids* *Allegedly One Sketch for All (MMDS 2006) 2

The Heavy Hitters Problem 1 0.5 0 −0.5 −1 0 50 100 150 200 250 Data: A signal s with d real entries Query: Find locations and magnitudes of m largest entries ❧ Interesting case: d is massive and m is big ❧ Easy if signal is explicit (aggregate / one pass model) ❧ Challenging in streaming data model One Sketch for All (MMDS 2006) 3

Streaming Data Model ❧ Think of components of s as items in WalMart inventory ❧ Cash register records a sequence of additive updates, e.g., . . . Beer +3 Diapers − 1 Ammo +50 Beer +2 . . . ❧ Total sales are implicitly determined by the sum of updates ❧ Query: What items were sold or returned most? Reference: Muthukrishnan 2003 One Sketch for All (MMDS 2006) 4

Consequences of Streaming Model ❧ Must be able to process updates quickly ❧ Linear processing useful for signed additive updates Φ ( s + u ) = Φ s + Φ u ❧ The signal evolves, so the heavy hitters evolve ❧ Must respond correctly to a query at any time One Sketch for All (MMDS 2006) 5

Sublinearity in Dimension ❧ Since d is massive, want to limit resource usage to polylog( d ) ❧ Storage ❧ Computation time ❧ Randomness ❧ Locations and magnitudes of m heavy hitters take about m log( d/m ) bits of storage Moral: Heavy Hitters is possible with sublinear resources One Sketch for All (MMDS 2006) 6

Sketching ❧ A synopsis data structure maintains a small sketch of the data ❧ In many cases, sketch is a random linear projection ❧ Sketch supports two operations: ❧ Update revises the sketch to reflect a change in the data ❧ Query returns an estimate of a data statistic ❧ For Heavy Hitters, ❧ Update supports signed additive changes to one signal component ❧ Query returns m signal positions and approximate values Reference: Gibbons–Matias 1998 One Sketch for All (MMDS 2006) 7

One Sketch for All ❧ Many randomized sketches offer guarantees of the form On each signal, with high probability, the query succeeds ❧ May be too weak if ❧ Many queries are made or ❧ Updates are adaptive, adversarial, worst-case, etc. ❧ Better to have a guarantee of the form With high probability, on all signals, the query succeeds ❧ This criterion has not appeared in data stream literature, but see Cand` es et al. 2004 and Donoho 2004 One Sketch for All (MMDS 2006) 8

Desiderata for Heavy Hitters Want a synopsis data structure with these properties: 1. Uniformity: Sketch works for all signals simultaneously 2. Optimal Size: Sketch uses m polylog( d ) storage 3. Optimal Speed: Update and query times are m polylog( d ) 4. High Quality: Answer to query has near-optimal error One Sketch for All (MMDS 2006) 9

Algorithm 1: Chaining Pursuit ❧ Uniform: Yes ❧ Storage: O ( m log 2 d ) ❧ Update time: Amortized m o (1) polylog( d ) ❧ Query time: m 1+ o (1) polylog( d ) ❧ Error bounds: � s − � s � 1 ≤ C log m � s − s m � 1 � s − � s � weak-1 ≤ C � s − s m � 1 One Sketch for All (MMDS 2006) 10

Algorithm 2: HHS Pursuit ❧ Uniform: Yes ❧ Storage: m polylog( d ) /ε 2 ❧ Update time: m polylog( d ) /ε 2 ❧ Query time: m 2 polylog( d ) /ε 4 ❧ Error bounds: � s − � s � 1 ≤ (1 + ε ) � s − s m � 1 ε √ m � s − s m � 1 � s − � s � 2 ≤ � s − s m � 2 + One Sketch for All (MMDS 2006) 11

Compressible Signals ❧ Results nontrivial for compressible signals : � � � ≤ C k − α � s ( k ) for α ≥ 1 ❧ Tail behavior for α < 1 : � s − s m � 1 ≍ m 1 − α � s − s m � 2 ≍ m 1 / 2 − α ❧ Compressible signals are extremely common One Sketch for All (MMDS 2006) 12

Related Work Reference Uniform Opt. Storage Sublin. Query GMS X � � CM X � � CRT, Don X � � Chaining � � � HHS � � � Remark: The numerous contributions in this area are not strictly comparable. References: Gilbert et al. 2002, 2005; Cormode–Muthukrishnan 2005; Cand` es–Romberg–Tao 2004, Donoho 2004, . . . One Sketch for All (MMDS 2006) 13

Dimension Reduction for Sparse Vectors ❧ Let X ⊂ ℓ d 1 be the set of all m -sparse signals ❧ The Chaining sketch embeds X in ℓ 1 with dimension O ( m log 2 ( d )) ❧ The embedding is bi-Lipshitz with polylogarithmic distortion ❧ Chaining algorithm allows sublinear-time reconstruction of sparse signals from their sketches ❧ Tolerant to noise in signal and in sketch ❧ Log error may be connected with lower bounds [Charikar–Sahai 2002] One Sketch for All (MMDS 2006) 14

Contributions ❧ Ask new questions: 1. Is a uniform guarantee possible? 2. What is the best error bound? ❧ New technical ideas: 1. Restricted isometries 2. Operator norm bounds ❧ Careful analysis: 1. Detailed results on random matrices 2. Understanding and controlling noise propagation One Sketch for All (MMDS 2006) 15

Overall Structure of Algorithms 1. Identify candidate heavy hitters 2. Estimate their magnitudes 3. Cull the herd 4. Update the sketch 5. Iterate the procedure One Sketch for All (MMDS 2006) 16

Different Intuitions Chaining Algorithm ❧ Finds a constant proportion of the heavy hitters at each iteration ❧ Requires careful culling of candidate heavy hitters ❧ Careful analysis of “internal noise” HHS Algorithm ❧ Finds a constant proportion of the signal energy at each iteration ❧ Must identify heavy hitters near noise level to find signal energy ❧ Careful analysis of batch estimation procedure One Sketch for All (MMDS 2006) 17

Locating a Heavy Hitter ❧ Suppose the signal contains one “spike” and no noise ❧ log 2 d bit tests will identify its location, e.g.,   0   0         1   0 0 0 0 1 1 1 1 0 MSB   0       B 1 s = 0 0 1 1 0 0 1 1 = 1   0   0 1 0 1 0 1 0 1 0 LSB   0     0 0 bit-test matrix · signal = location in binary One Sketch for All (MMDS 2006) 18

Isolating Heavy Hitters ❧ To use bit tests, the measurements need to isolate many spikes ❧ Assign each of d signal positions at random to one of O ( m ) different subsets ❧ Repeat to drive down failure probability 1 1 0.8 0 0.6 −1 0 50 100 150 200 250 0.4 1 0.2 0 0 −0.2 −1 0 50 100 150 200 250 −0.4 1 −0.6 0 −0.8 −1 −1 0 50 100 150 200 250 0 50 100 150 200 250 One Sketch for All (MMDS 2006) 19

The Sketches Chaining: ❧ Multiple trials of isolation + bit tests HHS: ❧ Multiple trials of isolation + noise reduction + bit tests ❧ Separate sketch for estimation One Sketch for All (MMDS 2006) 20

Estimation for HHS ❧ Maintain separate sketch v to estimate size of candidates: v = P F s where P is a random projection to m polylog( d ) /ε 2 coordinates, and F is the DFT ❧ Given list L of candidates, estimate magnitudes with LS: s L = ( P F L ) † v � ❧ Error estimate via new norm bound for restricted isometries � � 1 � P F x � 2 ≤ c � x � 2 + √ m � x � 1 One Sketch for All (MMDS 2006) 21

Chaining Algorithm Number of spikes m , sketches, random projectors Inputs: A list of m spike locations and values Output: For each of O (log m ) passes: For each trial: For each measurement: Use bit tests to identify the spike position Use a bit test to estimate the spike magnitude Retain m/ 2 k distinct spikes with largest values Retain spike positions that appear in most trials Estimate final spike magnitudes using medians Encode the spikes using the projection operator Subtract the encoded spikes from the sketch Prune output to largest m spikes One Sketch for All (MMDS 2006) 22

HHS Algorithm Number of spikes m , sketches, random projectors Inputs: A list of m spike locations and values Output: Run Chaining Pursuit to get first signal estimate For each of O (log m ) passes: For each measurement: Use bit tests to identify a spike position Retain spikes that appear frequently Use LS to estimate magnitudes of new candidate spikes Retain largest O ( m ) spikes identified to date Encode the spikes using the projection operators Subtract the encoded spikes from the original sketch Prune output to largest m spikes One Sketch for All (MMDS 2006) 23

To learn more... Web: http://www.umich.edu/~jtropp E-mail: jtropp@umich.edu ❧ Matlab code for Chaining Pursuit* is freely available! ❧ GSTV, “Sublinear approximation of compressible signals,” SPIE IIM, April 2006 ❧ —, “Algorithmic dimension reduction in the ℓ 1 norm for sparse vectors,” submitted April 2006 ❧ HHS Pursuit still in preparation... One Sketch for All (MMDS 2006) 24

One Sketch for All Joel A. Tropp Department of Mathematics The - PowerPoint PPT Presentation

One Sketch for All Joel A. Tropp Department of Mathematics The University of Michigan jtropp@umich.edu Joint work with Anna C. Gilbert Martin J. Strauss Roman Vershynin Research supported in part by NSF and DARPA 1 or, Heavy Hitters

Sketching Streams Chris Taylor DoD Overview What-Why Sketch? Sketches Hyper Log Log

Cynthia Gaub North Middle School Everett Washington www.artechtivity.com About Sketch-up State

Review SketchNet: Sketch Classification with Web Images [CVPR `16] (Speaker. Doheon Lee)

Similix Sketch Tool The Similix Sketch Tool is A tool for making easy sketches of future

Sketch Me That Shoe Heechan Shin CS688 Student paper presentation Sketch Me That Shoe (

Sketch Representation Myungjee Jung Nikolai Ilinykh Supervisor: Stefan Schneider Overview 1.

A transcriptional sketch of a human breast A transcriptional sketch of a human breast cancer by

Count-Min Sketch Analysis Probability Preliminaries Proof of the claim Anil Maheshwari

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

GREEN B SKETCH MODEL: THE WALKER RIGHTER SKETCH MODEL GOALS Li#s25pounds32inches

Count-Min Sketch Complexity Analysis Markovs Inequality Anil Maheshwari Proof of the claim

Wordie Wordie | Sketch Model Concept Wordie is an interactive toy that helps children learn a

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

. . . . . . 0 1 p-1 0 1 p-1 All-to-one Reduction Figure 4.1 One-to-all broadcast and

Enhancing Sketch-Based Image Retrieval by Re-Ranking and Relevance Feedback Heechan Shin CS688

SketchNet: Sketch Classification with Web Images[CVPR `16] CS688 Paper Presentation 1 Doheon Lee

Formalizing Connections Between Motion Planning and Machine Learning Siddhartha Srinivasa

Easy Lock-Free Programming in Non-Volatile Memory Tia ianzheng Wang Justin Levandoski

Drug Eruptions When to Worry 1. Type of drug reaction 2. Statistics: Which drugs are most

Query Fresh: Log Shipping on Steroids Tianzheng Wang* Ryan Johnson Ippokratis Pandis *Currently

Interactive Proofs Lecture 16 What the all-powerful can convince mere mortals of 1 Recap 2

ARGONNE ! CHICAGO foste r@m cs.a nl.gov Grid Com puting Ia n Foste r Ma the m a tics a nd Com

systemd-nspawn is chroot on steroids LinuxCon Europe 2013 Lennart Poettering October 2013

Gremlins of rationality An analysis of the perception