Algorithms for Querying Noisy Distributed/Streaming Datasets Qin - PowerPoint PPT Presentation

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1

The “big data” models The streaming model (Alon, Matias and Szegedy 1996) – high-speed online data – limited storage RAM CPU The k-site model – data is distributedly stored C – limited network bandwidth · · · S k S 1 S 3 S 2 2-1

k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). C ∅ one round · · · S k S 1 S 3 S 2 x 1 x 2 x 3 x k 3-1

k -site model k sites and 1 coordinator . – each site has a 2-way communication channel with the coordinator. – each site S i has a piece of data x i . The coordinator has ∅ . Task : compute f ( x 1 , . . . , x k ) together via communication. – The coordinator reports the answer. – computation is divided into rounds. Goal : minimize both • total #bits of comm. (o(Input); best polylog(Input)) • and #rounds ( O (1) or polylog(Input)). – no constraint on #bits can be sent or C ∅ received by each site at each round. one round (usually balanced) – do not count local · · · computation S k S 1 S 3 S 2 (usually linear) x 1 x 2 x 3 x k 3-2

k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction 4-1

k -site model (cont.) Communication → time, energy, bandwidth, . . . Also network monitoring, sensor Input Output networks, etc. Shuffle Map Reduce The MapReduce model. The BSP model. Abstraction C = · · · S k S 1 S 3 S 2 4-2

We will start with the k -site model, and will mention the streaming model at the end 5-1

Sketching Q: How many distinct elements ( F 0 ) in the union of the k bags? global sketch = C merge { local sketches } · · · S k S 1 S 3 S 2 local · · · sketch 6-1

Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector 7-1

Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation 7-2

Linear sketching Random linear mapping M : R n → R k where k ≪ n . g ( Mx ) ≈ f ( x ) = M Mx x sketching vector linear mapping The data. e.g., a frequency vector Perfect for distributed and streaming computation Simple and useful : used in many statistical/graph/algebraic problems in streaming, compressive sensing, . . . 7-3

But what if the data is noisy? Real world distributed datasets are often noisy! C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-1

But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-2

But what if the data is noisy? Real world distributed datasets are often noisy! We (have to) consider similar items as one element. Then how to compute F 0 ? C Cannot use linear sketches :( · · · S k S 1 S 3 S 2 Joseph Smith, Joe Smith, 800 Mt. Road Joe Smith, 800 Mt. Road Joseph Smith, Springfield 800 Mount Springfield 800 Mountain Av Springfield Av springfield 8-3

Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. 9-1

Noisy data is universal Music, Images, ... After compressions, resize, reformat, etc. “sublinear algorithm workshop 2016” “JHU sublinear algorithm” “sublinear John Hopkins” Queries of the same meaning sent to Google 9-2

Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. 10-1

Related to Entity Resolution Related to Entity Resolution: Identify and link/group different manifestations of the same real world object. Very important in data cleaning / integration. Have been studied for 40 years in DB, also in AI, NT. E.g. [Gill& Goldacre’03, Koudas et al.’06, Elmagarmid et al.’07, Herzog et al.’07, Dong& Naumann’09, Willinger et al.’09, Christen’12] for introductions, and [Getoor and Machanavajjhala’12] for a toturial. Centralized, detect items representing the same entity, merge/output all distinct entities. In the big data models, we want communication/space-efficient algorithms (o(input size)); cannot afford a comprehensive de-duplication. 10-2

Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. 11-1

Our problems and goal C · · · S k S 1 S 2 S 3 Problem : how to perform in the k -site model robust statistical estimation comm. efficiently? Assume all parties are provided with an oracle (e.g., a distance function and a threshold) determining whether two items u , v rep. the same entity (denoted by u ∼ v ) or not We will design a framework so that users can plug-in any “distance function” at run time. Goal : minimize communication & #rounds 11-2

Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! 12-1

Remarks Remark 1 . We do not specify the distance function in our algorithms, for two reasons: (1) Allows our algorithms to work with any distance functions. (2) Sometimes it is very hard to assume that similarities between items can be expressed by a well-known distance function: “AT&T Corporation” is closer to “IBM Corporation” than “AT&T Corp” under the edit distance! Remark 2 . We assume transitivity: if u ∼ v , v ∼ w then u ∼ w . In other words, the noise is “well-shaped”. One may come up with the following problematic situation: we have a ∼ b , b ∼ c , . . . , y ∼ z , however, a �∼ z . For many specific metic spaces, our algorithms still work if the number of “outliers” is small. 12-2

Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. 13-1

Remarks (cont.) Remark 3 . Clustering will help? Answer: NO. #clusters can be linear. Remark 4 . Does there exist a magic hash function that (1) map (only) items in same group into same bucket and (2) can be described succinctly? Answer: NO For specific metrics, tools such as LSHs may help 13-2

A few notations C · · · S k S 1 S 3 S 2 • We have k sites (machines), each holding a multiset of items S i . • Let multiset S = � i ∈ [ k ] S i , let m = | S | . • Under the transitivity assumption, S can be partitioned into a set of groups G = { G 1 , . . . , G n } . Each group G i represents a distinct universe element. • ˜ O ( · ) hides poly log( m /ǫ ) factors. 14-1

Our results noisy data noise-free data (comm.) items rounds bits ˜ ˜ O (min { k /ǫ 3 , k 2 /ǫ 2 } ) Ω( k /ǫ 2 ) [WZ12,WZ14] O (1) F 0 ˜ ˜ L 0 -sampling O ( k ) O (1) Ω( k ) O (( k p − 1 + k 3 ) /ǫ 3 ) ˜ Ω( k p − 1 /ǫ 2 ) [WZ12] F p ( p ≥ 1) O (1) √ ˜ O (min { k /ǫ, 1 /ǫ 2 } ) ǫ , 1 k ( φ, ǫ )-HH O (1) Ω(min { ǫ 2 } ) [HYZ12,WZ12] ˜ O ( k /ǫ 2 ) Ω( k /ǫ 2 ) [WZ12] Entropy O (1) i ∈ [ n ] | G i | p . 1. p-th frequency moment F p ( S ) = � We consider F 0 and F p ( p ≥ 1), and allow a (1 + ǫ )-approximation. 2. L 0 -sampling on S : return a group G i (or an arbitrary item in G i ) uniformly at random from G . 3. ( φ, ǫ ) -heavy-hitter of S (0 < ǫ ≤ φ ≤ 1) (definition omitted) | G i | m 4. Empirical entropy : Entropy( S ) = � m log | G i | . i ∈ [ n ] We allow a (1 + ǫ )-approximation. 15-1

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin - PowerPoint PPT Presentation

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias and Szegedy 1996) high-speed

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Wavelets for Efficient Querying of Large Wavelets for Efficient Querying of Large

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

The problem Combining querying of XML data with ontology queries Example XML document

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Resolution Limits of Non-Adaptive Querying for Noisy 20 Questions Estimation Lin Zhou EECS

DISTRIBUTED STREAMING TEXT EMBEDDING METHOD => DISTRIBUTED TRAINING WITH PYTORCH SNU 2018 - 2

Congestion Control in Distributed Media Streaming Lin Ma and Wei Tsang Ooi National

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Earnings Call for Q2-17 Results SAFE HARBOR PROVISION Certain statements made herein that use

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 25: Dependable

Associative containers The art of inserting gracefully Jean Guegant Conditional insertion: if

2019 CCIM President Carole Brill, CCIM 2019 Commercial Real Estate Forecasts Presented by

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University

DEBT VALUATION AND INTEREST Chapter 9 Principles Applied in This Chapter Principle 1: Money

FIT100 FIT100 FIT100 Anno unc e me nts FIT100 FIT100 FIT100 Pro je c t 3B Build the

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin - PowerPoint PPT Presentation

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias and Szegedy 1996) high-speed

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Wavelets for Efficient Querying of Large Wavelets for Efficient Querying of Large

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

The problem Combining querying of XML data with ontology queries Example XML document

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Resolution Limits of Non-Adaptive Querying for Noisy 20 Questions Estimation Lin Zhou EECS

DISTRIBUTED STREAMING TEXT EMBEDDING METHOD =&gt; DISTRIBUTED TRAINING WITH PYTORCH SNU 2018 - 2

Congestion Control in Distributed Media Streaming Lin Ma and Wei Tsang Ooi National

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Earnings Call for Q2-17 Results SAFE HARBOR PROVISION Certain statements made herein that use

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 25: Dependable

Associative containers The art of inserting gracefully Jean Guegant Conditional insertion: if

2019 CCIM President Carole Brill, CCIM 2019 Commercial Real Estate Forecasts Presented by

Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University

DEBT VALUATION AND INTEREST Chapter 9 Principles Applied in This Chapter Principle 1: Money

FIT100 FIT100 FIT100 Anno unc e me nts FIT100 FIT100 FIT100 Pro je c t 3B Build the

DISTRIBUTED STREAMING TEXT EMBEDDING METHOD => DISTRIBUTED TRAINING WITH PYTORCH SNU 2018 - 2

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 25: Dependable