Recap Hashing-based sketch techniques summarize large data sets - PowerPoint PPT Presentation

Recap  Hashing-based sketch techniques summarize large data sets  Summarize vectors: – Test equality (fingerprints) – Recover approximate entries (count-min, count sketch) – Approximate Euclidean norm (F 2 ) and dot product – Approximate number of non-zero entries (F 0 ) – Approximate set membership (Bloom filter) 2 Streams, Sketching and Big Data

Advanced Topics  L p Sampling – L 0 sampling and graph sketching – L 2 sampling and frequency moment estimation  Matrix computations – Sketches for matrix multiplication – Compressed matrix multiplication  Hashing to check computation – Matrix product checking – Vector product checking  Lower bounds for streaming and sketching – Basic hard problems (Index, Disjointness) – Hardness via reductions 3 Streams, Sketching and Big Data

Sampling from Sketches  Given inputs with positive and negative weights  Want to sample based on the overall frequency distribution – Sample from support set of n possible items – Sample proportional to (absolute) weights – Sample proportional to some function of weights  How to do this sampling effectively?  Recent approach: L p sampling 4 Streams, Sketching and Big Data

L p Sampling  L p sampling: use sketches to sample i w/prob (1± e ) f i p / ǁfǁ p p  “Efficient” solutions developed of size O( e -2 log 2 n) – [Monemizadeh, Woodruff 10] [Jowhari, Saglam, Tardos 11]  L 0 sampling enables novel “graph sketching” techniques – Sketches for connectivity, sparsifiers [Ahn, Guha, McGregor 12]  L 2 sampling allows optimal estimation of frequency moments 5 Streams, Sketching and Big Data

L 0 Sampling  L 0 sampling: sample with prob (1± e ) f i 0 /F 0 – i.e., sample (near) uniformly from items with non-zero frequency  General approach: [Frahling, Indyk, Sohler 05, C., Muthu, Rozenbaum 05] – Sub-sample all items (present or not) with probability p – Generate a sub-sampled vector of frequencies f p – Feed f p to a k-sparse recovery data structure  Allows reconstruction of f p if F 0 < k – If f p is k-sparse, sample from reconstructed vector – Repeat in parallel for exponentially shrinking values of p 6 Streams, Sketching and Big Data

Sampling Process p=1/U k-sparse recovery p=1  Exponential set of probabilities, p=1, ½, ¼, 1/8, 1/16… 1/U – Let N = F 0 = |{ i : f i  0}| – Want there to be a level where k-sparse recovery will succeed – At level p, expected number of items selected S is Np – Pick level p so that k/3 < Np  2k/3  Chernoff bound: with probability exponential in k, 1  S  k – Pick k = O(log 1/  ) to get 1-  probability 7 Streams, Sketching and Big Data

k-Sparse Recovery  Given vector x with at most k non-zeros, recover x via sketching – A core problem in compressed sensing/compressive sampling  First approach: Use Count-Min sketch of x – Probe all U items, find those with non-zero estimated frequency – Slow recovery: takes O(U) time  Faster approach: also keep sum of item identifiers in each cell – Sum/count will reveal item id – Avoid false positives: keep fingerprint of items in each cell  Can keep a sketch of size O(k log U) to recover up to k items Sum,  i : h(i)=j i Count,  i : h(i)=j x i Fingerprint,  i : h(i)=j x i r i 8 Streams, Sketching and Big Data

Uniformity  Also need to argue sample is uniform – Failure to recover could bias the process  Pr[ i would be picked if k=n] = 1/F 0 by symmetry  Pr[ i is picked ] = Pr[ i would be picked if k=n  S  k]  (1-  )/F 0  So (1-  )/N  Pr[i is picked]  1/N  Sufficiently uniform (pick  = e ) 9 Streams, Sketching and Big Data

Application: Graph Sketching  Given L 0 sampler, use to sketch (undirected) graph properties  Connectivity: want to test if there is a path between all pairs  Basic alg: repeatedly contract edges between components  Use L 0 sampling to provide edges on vector of adjacencies  Problem: as components grow, sampling most likely to produce internal links 10 Streams, Sketching and Big Data

Graph Sketching  Idea: use clever encoding of edges [ Ahn, Guha, McGregor 12]  Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i  When node i and node j get merged, sum their L 0 sketches – Contribution of edge (i,j) exactly cancels out + i = j  Only non-internal edges remain in the L 0 sketches  Use independent sketches for each iteration of the algorithm – Only need O(log n) rounds with high probability  Result: O(poly-log n) space per node for connectivity 11 Streams, Sketching and Big Data

Other Graph Results via sketching  K-connectivity via connectivity – Use connectivity result to find and remove a spanning forest – Repeat k times to generate k spanning forests F 1 , F 2 , … F k – Theorem: G is k-connected if  i=1k F i is k-connected  Bipartiteness via connectivity: – Compute c = number of connected components in G – Generate G’ over V  V’ so (u,v)  E  (u, v’)  E’, (u’, v)  E’ – If G is bipartite, G’ has 2c components, else it has <2c components  (Weight of the) Minimum spanning tree: – Round edge weights to powers of (1+ e ) – Define n i = number of components on edges lighter than (1+ e ) i – Fact: weight of MST on rounded weights is  i e (1+ e ) i n i 12 Streams, Sketching and Big Data

Application: F k via L 2 Sampling  Recall, F k =  i f i k 2 /F 2  Suppose L 2 sampling samples f i with probability f i – And also estimates sampled f i with relative error e k-2 (with estimates of F 2 , f i )  Estimator: X = F 2 f i – Expectation: E[X] = F 2  i f ik-2  f i2 / F 2 = F k – Variance: Var[X]  E[X 2 ] =  i f i 2 /F 2 (F 2 f i k-2 ) 2 = F 2 F 2k-2 13 Streams, Sketching and Big Data

Rewriting the Variance  Want to express variance F 2 F 2k-2 in terms of F k and domain size n  Hölder’s inequality:  x, y   ǁxǁ p ǁyǁ q for 1  p, q with 1/p+1/q=1 – Generalizes Cauchy-Shwarz inequality, where p=q=2.  So pick p=k/(k-2) and q = k/2 for k > 2. Then  1 n , (f i ) 2   ǁ1 n ǁ k/(k-2) ǁ( f i ) 2 ǁ k/2 F 2  n (k-2)/k F k 2/k (1)  Also, since ǁxǁ p+a  ǁxǁ p for any p  1, a > 0 – Thus ǁxǁ 2k-2  ǁxǁ k for k  2 – So F 2k-2 = ǁfǁ 2k-22k-2  ǁfǁ k2k-2 = F k2-2/k (2)  Multiply (1) * (2) : F 2 F 2k-2  n 1-2/k F k 2 – So variance is bounded by n 1-2/k F k 2 14 Streams, Sketching and Big Data

F k Estimation  For k  3, we can estimate F k via L 2 sampling: – Variance of our estimate is O(F k 2 n 1-2/k ) – Take mean of n 1-2/k e - 2 repetitions to reduce variance – Apply Chebyshev inequality: constant prob of good estimate – Chernoff bounds: O(log 1/  ) repetitions reduces prob to   How to instantiate this? – Design method for approximate L 2 sampling via sketches – Show that this gives relative error approximation of f i – Use approximate value of F 2 from sketch – Complicates the analysis, but bound stays similar 15 Streams, Sketching and Big Data

L 2 Sampling Outline  For each i, draw u i uniformly in the range 0…1 – From vector of frequencies f, derive g so g i = f i /√ u i – Sketch g i vector 2 > t=F 2 / e threshold  Sample: return (i, f i ) if there is unique i with g i – Pr[ g i2 > t   j  i : g j2 < t]= Pr[g i2 > t]  j  i Pr[g j2 < t] = Pr[u i < e f i 2 /F 2 ]  j  i Pr[u j > e f j 2 /F 2 ] = ( e f i 2 /F 2 )  j  i (1 - e f j 2 /F 2 ) ≈ e f i 2 /F 2  Probability of returning anything is not so big:  i e f i 2 /F 2 = e – Repeat O(1/ e log 1/  ) times to improve chance of sampling 16 Streams, Sketching and Big Data

L 2 sampling continued 2  F 2 / e , estimate f i = u i g i  Given (estimated) g i s.t. g i  Sketch size O( e -1 log n) means estimate of f i 2 has error ( e f i 2 + u i 2 ) – With high prob, no u i < 1/poly(n), and so F 2 (g) = O(F 2 (f) log n) – Since estimated f i2 /u i2  F 2 / e , u i2  e f i2 /F 2 2 with error e f i 2 sufficient for estimating F k  Estimating f i  Many details omitted See Precision Sampling paper [Andoni Krauthgamer Onak 11] – 17 Streams, Sketching and Big Data

Advanced Topics  L p Sampling – L 0 sampling and graph sketching – L 2 sampling and frequency moment estimation  Matrix computations – Sketches for matrix multiplication – Compressed matrix multiplication  Hashing to check computation – Matrix product checking – Vector product checking  Lower bounds for streaming and sketching – Basic hard problems (Index, Disjointness) – Hardness via reductions 18 Streams, Sketching and Big Data

Matrix Sketching  Given matrices A, B, want to approximate matrix product AB  Compute normed error of approximation C: ǁ AB – C ǁ  Give results for the Frobenius (entrywise) norm ǁ  ǁ F – ǁCǁ F = (  i,j C i,j2 ) ½ – Results rely on sketches, so this norm is most natural 19 Streams, Sketching and Big Data

Recap Hashing-based sketch techniques summarize large data sets - PowerPoint PPT Presentation

Recap Hashing-based sketch techniques summarize large data sets Summarize vectors: Test equality (fingerprints) Recover approximate entries (count-min, count sketch) Approximate Euclidean norm (F 2 ) and dot product

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

PARTNERSHIPS FOR CHILDREN Branding and Positioning :: FINAL WORKSHOP RECAP WORKSHOP RECAP //

1 7 Wonders Recap 2 Inspiring Travel 7 Wonders Recap 2 3 Responses Scenic Byways 7 Wonders

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Interactive Proofs Lecture 16 What the all-powerful can convince mere mortals of 1 Recap 2

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2015 - Lecture 13: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2018

Welcome! Todays Agenda: Grand Recap Exam Now What Todays Agenda:

FUNCTIONS OF SEVERAL VARIABLES MATH 200 MAIN GOALS FOR TODAY Be able to describe and sketch

Reverse mathematics and marriage problems with finitely many solutions Noah A. Hughes noah.hughes

INNER LIGHT IDEATION SKETCH BRENDAN LANE ANDREW TSO CHRISTIE WONG KEN CALDER From

The category of diagrammatic logics Dominique Duval partly with Christian Lair University of

Abstract Datatypes for Differential Programming Benjamin MacAdam and many others. . . May 30,

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 Outline Heavy Hitter Continued

P I O N Thank You Aaron France Konstantin Itskov Yutaka Takeda Adam Kiss Lander Noterman

Quantum thermodynamics: 1 Mauro Paternostro Queens University Belfast Advanced School on

Recap Hashing-based sketch techniques summarize large data sets - PowerPoint PPT Presentation

Recap Hashing-based sketch techniques summarize large data sets Summarize vectors: Test equality (fingerprints) Recover approximate entries (count-min, count sketch) Approximate Euclidean norm (F 2 ) and dot product

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

PARTNERSHIPS FOR CHILDREN Branding and Positioning :: FINAL WORKSHOP RECAP WORKSHOP RECAP //

1 7 Wonders Recap 2 Inspiring Travel 7 Wonders Recap 2 3 Responses Scenic Byways 7 Wonders

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Interactive Proofs Lecture 16 What the all-powerful can convince mere mortals of 1 Recap 2

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2015 - Lecture 13: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker &amp; Debabrata Panja - April-July 2018

Welcome! Todays Agenda: Grand Recap Exam Now What Todays Agenda:

FUNCTIONS OF SEVERAL VARIABLES MATH 200 MAIN GOALS FOR TODAY Be able to describe and sketch

Reverse mathematics and marriage problems with finitely many solutions Noah A. Hughes noah.hughes

INNER LIGHT IDEATION SKETCH BRENDAN LANE ANDREW TSO CHRISTIE WONG KEN CALDER From

The category of diagrammatic logics Dominique Duval partly with Christian Lair University of

Abstract Datatypes for Differential Programming Benjamin MacAdam and many others. . . May 30,

Lecture 4 Barna Saha AT&amp;T-Labs Research September 19, 2013 Outline Heavy Hitter Continued

P I O N Thank You Aaron France Konstantin Itskov Yutaka Takeda Adam Kiss Lander Noterman

Quantum thermodynamics: 1 Mauro Paternostro Queens University Belfast Advanced School on

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2018

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 Outline Heavy Hitter Continued