algorithms for querying noisy distributed streaming
play

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin - PowerPoint PPT Presentation

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias and Szegedy 1996) high-speed


  1. Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1

  2. The “big data” models The streaming model (Alon, Matias and Szegedy 1996) – high-speed online data – limited storage RAM CPU The k-site model – data is distributedly stored C – limited network bandwidth · · · S k S 1 S 3 S 2 2-1

  3. k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). C ∅ one round · · · S k S 1 S 3 S 2 x 1 x 2 x 3 x k 3-1

  4. k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). – no constraint on #bits can be sent or C ∅ received by each site at each round. one round (usually balanced) – do not count local · · · computation S k S 1 S 3 S 2 (usually linear) x 1 x 2 x 3 x k 3-2

  5. k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction 4-1

  6. k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction C = · · · S k S 1 S 3 S 2 4-2

  7. We will start with the k -site model, and will mention the streaming model at the end 5-1

  8. Sketching Q: How many distinct elements ( F 0 ) in the union of the k bags? global sketch = C merge { local sketches } · · · S k S 1 S 3 S 2 local · · · sketch 6-1

  9. Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector 7-1

  10. Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation 7-2

  11. Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation Simple and useful : used in many statistical/graph/algebraic problems in streaming, compressive sensing, . . . 7-3

  12. But what if the data is noisy? Real world distributed datasets are often noisy! C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-1

  13. But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-2

  14. But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C Cannot use linear sketches :( · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-3

  15. Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. 9-1

  16. Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. “sublinear algorithm workshop 2016” “JHU sublinear algorithm” “sublinear John Hopkins” Queries of the same meaning sent to Google 9-2

  17. Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. 10-1

  18. Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. In the big data models, we want communication/space-efficient algorithms (o(input size)); cannot afford a comprehensive de-duplication. 10-2

  19. Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. 11-1

  20. Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. Goal : minimize communication & #rounds 11-2

  21. Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! 12-1

  22. Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! Remark 2 . We assume transitivity: if u ∼ v , v ∼ w then u ∼ w . In other words, the noise is “well-shaped”. One may come up with the following problematic situation: we have a ∼ b , b ∼ c , . . . , y ∼ z , however, a �∼ z . For many specific metic spaces, our algorithms still work if the number of “outliers” is small. 12-2

  23. Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. 13-1

  24. Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. Remark 4 . Does there exist a magic hash function that (1) map (only) items in same group into same bucket and (2) can be described succinctly? Answer: NO For specific metrics, tools such as LSHs may help 13-2

  25. A few notations C · · · S k S 1 S 3 S 2 • We have k sites (machines), each holding a multiset of items S i . • Let multiset S = � i ∈ [ k ] S i , let m = | S | . • Under the transitivity assumption, S can be partitioned into a set of groups G = { G 1 , . . . , G n } . Each group G i represents a distinct universe element. • ˜ O ( · ) hides poly log( m /ǫ ) factors. 14-1

  26. Our results noisy data noise-free data (comm.) items rounds bits ˜ ˜ O (min { k /ǫ 3 , k 2 /ǫ 2 } ) Ω( k /ǫ 2 ) [WZ12,WZ14] O (1) F 0 ˜ ˜ L 0 -sampling O ( k ) O (1) Ω( k ) O (( k p − 1 + k 3 ) /ǫ 3 ) ˜ Ω( k p − 1 /ǫ 2 ) [WZ12] F p ( p ≥ 1) O (1) √ ˜ O (min { k /ǫ, 1 /ǫ 2 } ) ǫ , 1 k ( φ, ǫ )-HH O (1) Ω(min { ǫ 2 } ) [HYZ12,WZ12] ˜ O ( k /ǫ 2 ) Ω( k /ǫ 2 ) [WZ12] Entropy O (1) i ∈ [ n ] | G i | p . 1. p-th frequency moment F p ( S ) = � We consider F 0 and F p ( p ≥ 1), and allow a (1 + ǫ )-approximation. 2. L 0 -sampling on S : return a group G i (or an arbitrary item in G i ) uniformly at random from G . 3. ( φ, ǫ ) -heavy-hitter of S (0 < ǫ ≤ φ ≤ 1) (definition omitted) | G i | m 4. Empirical entropy : Entropy( S ) = � m log | G i | . i ∈ [ n ] We allow a (1 + ǫ )-approximation. 15-1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend