Homomorphic Sketches Shrinking Big Data without Sacrificing - - PowerPoint PPT Presentation

homomorphic sketches
SMART_READER_LITE
LIVE PREVIEW

Homomorphic Sketches Shrinking Big Data without Sacrificing - - PowerPoint PPT Presentation

Homomorphic Sketches Shrinking Big Data without Sacrificing Structure Andrew McGregor University of Massachusetts ?=? Can test whether two n bit files are identical by comparing O(log n) bit fingerprints of each file. ? ? More generally,


slide-1
SLIDE 1

Homomorphic Sketches

Shrinking Big Data without Sacrificing Structure

Andrew McGregor

University of Massachusetts

slide-2
SLIDE 2

Can test whether two n bit files are identical by comparing O(log n) bit fingerprints of each file.

?=?

slide-3
SLIDE 3

?≈?

More generally, can construct sketches of files to estimate Hamming distance between the files. Many results such as distinct elements, entropy, frequency moments, quantiles, histograms, linear regression, clustering, shape approximation...

slide-4
SLIDE 4

Basic Idea: Treat file as vector; use linear projections to reduce dimension while preserving properties. Extensive theory with connections to compressed sensing, metric embeddings; widely applicable since parallelizable and suitable for stream processing. Most existing work concerns numerical statistics of data such as frequency and feature vectors...

          v             M   =  Mv   =  Mv   =  Mv  

slide-5
SLIDE 5

Is it possible to analyze richer combinatorial and group-theoretic structure via linear sketches? Can we make compression “homomorphic” and run algorithms on sketched data?

Algorithm

ANSWER

BIG DATA

Algorithm

Compress

small data

slide-6
SLIDE 6

Suppose n files encode rows of an adjacency matrix, e.g., each file is a list of friends in a social network. Theorem: Can check graph connectivity with O(polylog n) bit fingerprints of each file.

slide-7
SLIDE 7

Hamming distance isn’t robust to misalignments. Theorem: Can check equality of files up to rotation with fingerprints of length D(n) polylog n. More generally, we have homomorphic fingerprints: given a fingerprint, can compute the fingerprint of rotation.

“Tie quick brown fox jumped

  • ver tie lazy dog.”

“quick brown fox jumped over tie lazy dog. Tie”

CYCLIC ROTATION FINGERPRINT OPERATION

* D(n) is the number of divisors of n.

slide-8
SLIDE 8
  • II. Misalignment
  • I. Connectivity
  • I. Connectivity

a) Connectivity via O(polylog n) bit Fingerprints b) Extension to Estimating Cuts and Eigenvalues

Joint work with Kook Jin Ahn and Sudipto Guha

slide-9
SLIDE 9

Sketches for Connectivity

  • Theorem: Can check graph connectivity w.h.p. using

O(polylog n) bit fingerprint of each adjacency list.

  • Corollary: Can monitor connectivity in a dynamic graph

stream where edges are both inserted and deleted.

  • Note: Previous stream work assumed no edge deletions.
  • e.g., [Feigenbaum, Kannan, McGregor, Suri, Zhang 2004, 2005], [McGregor 2005]
  • [Jowhari, Ghodsi 2005], [Zelke 2008], [Sarma, Gollapudi, Panigrahy 2008, 2009]
  • [Ahn, Guha 2009, 2011], [Konrad, Magniez, Mathieu 2012], [Goel, Kapralov, Khanna 2012]
slide-10
SLIDE 10
  • Suppose there’s a bridge (u,v) in the graph, i.e., Alice and Bob

have a friendship that is essential to global connectivity.

  • It seems that at least one of their fingerprints needs Ω(n) bits:
  • One of their fingerprints must contain info about the bridge.
  • Alice and Bob don’t know their friendship is special.
  • Alice and Bob may each have Ω(n) friends.

This can’t be possible?!

slide-11
SLIDE 11
  • Template: Exploit homomorphic properties of linear sketches

and emulate a classical algorithm in sketch space.

Original Graph Sketch Space

Algorithm Algorithm ANSWER Sketch

How we do it...

slide-12
SLIDE 12

Algorithm (Spanning Forest):

  • 1. For each node: pick incident edge
  • 2. For each connected comp: pick incident edge
  • 3. Repeat until no edges between connected comp.

Lemma: After O(log n) rounds selected edges include spanning forest.

Ingredient 1: Basic Algorithm

slide-13
SLIDE 13

For node i, let ai be vector indexed by node pairs. Non- zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i. Lemma: For any subset of nodes S⊂V , Lemma: There exists random M: ℝN→ℝpolylog N such that for any a∈ℝN, can deduce some e ∈ support(a) from Ma.

[Jowhari, Saglam, Tardos 2011]

Ingredient 2: Sketching Neighborhoods

1 2 3 5 4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 = 1 1 a2 = −1 1 support ( X

i∈S

ai ) = E(S, V \ S) a1 + a2 = 1 1

slide-14
SLIDE 14

Sketch for node j: Maj Runs Algorithm in Sketch Space: Use Maj to get incident edge on each node j For i=2 to log n: To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

− → e ∈ support( X

j∈S

aj) = X

j∈S

Maj = M( X

j∈S

aj)

Detail: Actually each player sends log n independent sketches M1aj, M2aj, ... and central player uses Miaj when emulating ith iteration of the algorithm.

) = E(S, V \ S)

slide-15
SLIDE 15

Extension to Sparsification

  • Theorem: Can test k-connectivity using O(k polylog n) bit

fingerprints of each adjacency list.

  • Theorem: Can (1+ε)-approximate every graph cut using

O(ε-2 polylog n) bit fingerprints of each adjacency list.

  • Theorem: Can construct a spectral sparsifier H using

O(ε-2 n2/3 polylog n) bit fingerprints of each adjacency list.

  • where LG and LH are the Laplacians of G and H.
slide-16
SLIDE 16

Algorithm: For i=1 to k:

  • Let Fi be spanning forest of G(V

,E-F1-...-Fi-1) Lemma: F1+...+Fk contains either all the edges across a cut in G or ≥ k of them. Call such a graph, a k-skeleton. Sketch: Simultaneously construct k independent connectivity sketches M1(G), M2(G), ..., Mk(G). Run Algorithm in Sketch Space: Use M1(G) to find a spanning forest F1 of G Use M2(G)-M2(F1)=M2(G-F1) to find F2 Use M3(G)-M3(F1)-M3(F2)=M3(G-F1-F2) to find F3...

k-Connectivity

Basic Algorithm Emulation in Sketch Space

slide-17
SLIDE 17

(1+ε)-Approx of All Cuts

Theorem (Fung et al.) Sample edge e w/p pe and weight by 1/pe. If pe = ε-2 log2 n/ce where ce is size of min e cut, then all cuts are preserved up to factor 1+ε. Algorithm: Let Gi be graph with edges sampled w/p 2-i. Construct k-skeleton Hi for each Gi where k= 2ε-2 log2n. Theorem: e is in some Hi w/p at least pe Proof: Let C be edges in min u-v cut in G. For i= -log pe, we have |C∩Gi|<k by the Chernoff bound. Hence e∊Hi iff e∊Gi which happens w/p pe

1/2 1/ 4 1/8 1/16 ... pe ... 1/n ce/2 ce/ 4 ce/8 ce/16 ... ε-2 log2 n ... ce/n 1 2 3 4 ...

  • log pe

... log n i P[e∊Gi] E[|C∩Gi|]

slide-18
SLIDE 18
  • II. Misalignment
  • I. Connectivity
  • II. Misalignment

a) Testing Equality with Rotation b) Matching Lower Bound

Joint work with Alexandr Andoni, Assaf Goldberger, Ely Porat

slide-19
SLIDE 19

Fingerprints for Rotation

  • Theorem: There’s a D(n) polylog n bit fingerprint F that is:
  • Useful: F(a) and F(b) determine if a, b∈ℤn are rotations w.h.p.
  • Homomorphic: From F(a) can construct F(any rotation of a)
  • Linear: From F(a) and F(b) can compute F(a+b).
  • Theorem: Fingerprints with above properties need D(n) bits.
  • Extension: (t + D(n)) polylog n bit fingerprints F(a) and F(b)

determine if a,b are within t substitutions of being rotations.

“Tie quick brown fox jumped

  • ver tie lazy dog.”

“quick brown fox jumped over tie lazy dog. Tie”

CYCLIC ROTATION

slide-20
SLIDE 20

Rabin-Karp: For some p and r, encode a=a0a1a2...an-1 as Fermat’ s Little Thm: If p=n+1 prime, rn=1 mod p and so, So, if b is k-shift of a then Schwartz-Zippel: If r is random and g non-zero: Conclusion: No false negatives but likely false positives.

False Start: Fermat’ s Little Theorem

P[g(r) = 0] ≤ (n − 1)/p = 1 − O(1/n) f (r, a) = a0 + a1r + a2r 2 + ... an−1r n−1 mod p g(r) = r kf (r, a) − f (r, b) = 0 rf (r, a0a1 ... an−1) = a0r + a1r 2 + a2r 3 + ... + an−1r n = an−1 + a0r + a1r 2 + ... + an−2r n−1 = f (r, an−1a0 ... an−2)

slide-21
SLIDE 21

Evaluate g on roots of xn-1 but work in larger field xn-1 factorizes as D(n) irreducible polys over rationals: At least one ɸi has no shared roots with g: If ɸi shares one root, ɸi divides g (Abel’ s Irred. Thm) Can’ t all divide g because g has degree ≤ n-1 Suffices to test g on an arbitrary root of each ɸi Bad News: Can’ t guarantee g(r) has finite precision. Good News: Work modulo a random p. Can show ɸi still doesn’ t share roots with g whp by analyzing resultant.

Beyond Schwartz-Zippel

x10 − 1 = Φ1(x)Φ2(x)Φ5(x)Φ10(x) = (x − 1)(1 + x)(1 − x + x2 − x3 + x4)(1 + x + x2 + x3 + x4)

slide-22
SLIDE 22

Can recover D(n) bits about a from F(a) by summing the fingerprints of rotations To deduce from and compare for all g until matches. To deduce and compare for all g, g’=α-g until matches. And so on for other divisors of n...

Lower Bound: Basic Idea

F(gggggg) β = a1 + a3 + a5 F(gg 0gg 0gg 0) F(a0a1a2a3a4a5) F(a0a1a2a3a4a5) + F(a2a3a4a5a0a1) + F(a4a5a0a1a2a3) = F(βγβγβγ) α = X ai F(a0a1a2a3a4a5) + F(a1a2a3a4a5a0) + ... + F(a5a0a1a2a3a4) = F(αααααα)

slide-23
SLIDE 23

Thanks!

  • Homomorphic Sketches: Compress using sketches such

that we can run algorithms on compressed data directly. Resulting algorithms are parallelizable + streamable.

  • Graphs: Dimensionality reduction for preserving

structural properties. Enables dynamic graph streaming.

  • Fingerprinting with Misalignments: Tight bounds on size of

fingerprint necessary for testing equality up to rotations.

slide-24
SLIDE 24