Efficient Algorithms for Streaming Datasets with Near-Duplicates - PowerPoint PPT Presentation

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing May 4, 2017 1-1

Disclaimer Not really a survey talk; results are all very recent, and solutions may be quite premature. 2-1

Disclaimer Not really a survey talk; results are all very recent, and solutions may be quite premature. Agenda 1. Background and motivation 2. Distinct elements on data with near-duplicates 3. Similarity join under edit distance 2-2

Model of computation The Streaming Model – high-speed online data – want space/time efficient algorithms RAM 1 7 9 1 7 3 2 E.g., what is the number of distinct elements? CPU 3-1

Linear sketches Problem : given a data vector x ∈ R d , compute f ( x ) Can do this using linear sketches recover = M Mx g ( Mx ) ≈ f ( x ) x linear mapping sketching vector (sometimes embeds a hash function) 4-1

Linear sketches Problem : given a data vector x ∈ R d , compute f ( x ) Can do this using linear sketches recover = M Mx g ( Mx ) ≈ f ( x ) x linear mapping sketching vector (sometimes embeds a hash function) Simple and useful : used extensively in streaming/distributed algorithms, compressive sensing, . . . 4-2

Linear sketches in the streaming model RAM 1 7 9 1 7 3 2 View each incoming element i as updating x ← x + e i Can update the sketching vector incrementally space = = + M ( x + e i ) Mx M e i size of sketch Mx time ≤ space = + M i (usually) Mx 5-1

Real-world data is often noisy music, images, videos... after compressions, resize, photoshop, etc. 6-1

Real-world data is often noisy music, images, videos... after compressions, resize, photoshop, etc. “theory and applications of hashing” “theory application of hash” “dagstuhl hashing” “dagstuhl seminar hash” Queries of the same meaning sent to Google 6-2

Robust streaming algorithms RAM We have to consider near-duplicates as one element. Then how to compute f ( x )? CPU 7-1

Linear sketches do not work Linear sketches do not work. Why? Items representing the same entity may be hashed into different coordinates of the sketching vector 8-1

Magic hash functions? Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly? Answer: (In general) No. 9-1

Magic hash functions? Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly? Answer: (In general) No. Some hashing functions may help (will discuss later) 9-2

History and the New Question Related to Entity Resolution: Identify and group different manifestations of the same real world object. Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT. Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities. 10-1

History and the New Question Related to Entity Resolution: Identify and group different manifestations of the same real world object. Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT. Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities. Question : Can we analyze data with near-duplicates in the streaming model space/time efficiently? 10-2

Distinct Elements • Data : points in a metric space • Problem : compute # robust distinct elements ( F 0 ) (Useful in: traffic monitoring, query optimization, . . . ) Robust F 0 : Given threshold α , partition the input item set S into the set of groups G = { G 1 , . . . , G n } of minimum-cardinality so that ∀ p , q ∈ G i , d ( p , q ) ≤ α . – Chen, Z., SIGMOD 2016 (will discuss today) – Chen, Z., ???? (extend to sliding windows and ℓ 0 -sampling) 11-1

Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . 12-1

Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset 12-2

Well-shaped dataset ( α, β )-sparse dataset: pairs of items in the same group has distance at most α ; pairs of items in different groups have distance at least β . If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset Will talk about general datasets later. 12-3

Algorithm for ( α, β ) ( β > 2 α ) well-shaped datasets in 2D G 1 A random grid G of side length α/ 2 G 2 G 3 13-1

Simple sampling (needs two passes) Algorithm Simple Sampling 1. Sample η ∈ ˜ O (1 /ǫ 2 ) non-empty cells C 2. Use another pass to compute for each sampled cell C , w ( C ) = 1 / w ( G C ) , where G C is the (only) group intersecting C , and w ( G C ) is #cells G C intersects 3. Output z η · � C ∈C w ( C ), where z is the #non-empty cells in G Gives a (1 + ǫ )-approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits space and 2 passes. 14-1

Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C 15-1

Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? 15-2

Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells 15-3

Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells Maintain the collection using a hash function h : That is, all cells C with h ( C ) = 1 15-4

Bucket sampling • Cannot sample cell early: most sampled cell will be empty thus useless for the estimation. • Cannot sample late: cannot obtain the “neighborhood” information to compute w ( C ) for a sampled C What to do? We sample a collection of cells implicitly , but only maintain the neighborhood info. for “non-empty” sampled cells Maintain the collection using a hash function h : That is, all cells C with h ( C ) = 1 Maintain h s.t. |{ C | h ( C ) = 1 ∧ ∃ p ∈ S , d ( p , C ) ≤ α }| = O (1 /ǫ 2 ) 15-5

Bucket sampling (cont.) G 1 G 2 sampled cell store one point of each non-empty neighboring cell; used to compute the weight of the sampled cell. G 3 For a well-shaped dataset, can get (1 + ǫ )-approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits space, ˜ O (1) time per item. 16-1

General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) 17-1

General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem 17-2

General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem However, we can still guarantee the following even without knowing the value δ 17-3

General datasets For general datasets, we introduce F 0 -ambiguity: The F 0 -ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that • S \ T is well-shaped • F 0 ( S \ T ) ≥ (1 − δ ) F 0 ( S ) Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1 / 2 without an Ω( m ) space, by reducing it to the Diameter problem However, we can still guarantee the following even without knowing the value δ For a dataset with F 0 -ambiguity δ , can get a (1 + O ( ǫ + δ )) approximation of robust F 0 using ˜ O (1 /ǫ 2 ) bits 17-4

Efficient Algorithms for Streaming Datasets with Near-Duplicates - PowerPoint PPT Presentation

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Abilene Observatory Datasets Matt Zekauskas, matt@internet2.edu 03-Jun-2004 Major Datasets,

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Spectral Properties of the Quantum Random Energy Model Simone Warzel Zentrum Mathematik, TUM

1-Resiliency of Bipermutive CA Rules AUTOMATA 2013 - September 17-19 - Giessen Alberto Leporati,

Latin American Week on Coding and Information Covering problems in hierarchical poset spaces over

Todays exercises 7.1: Covering Radius Example 7.2: Random Covering Code 7.3: Exact

Proximity search Matteo Fischetti, Michele Monaci University of Padova Rome, July 2013 1 MIP

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &

Efficient Algorithms for Streaming Datasets with Near-Duplicates - PowerPoint PPT Presentation

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Abilene Observatory Datasets Matt Zekauskas, matt@internet2.edu 03-Jun-2004 Major Datasets,

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Spectral Properties of the Quantum Random Energy Model Simone Warzel Zentrum Mathematik, TUM

1-Resiliency of Bipermutive CA Rules AUTOMATA 2013 - September 17-19 - Giessen Alberto Leporati,

Latin American Week on Coding and Information Covering problems in hierarchical poset spaces over

Todays exercises 7.1: Covering Radius Example 7.2: Random Covering Code 7.3: Exact

Proximity search Matteo Fischetti, Michele Monaci University of Padova Rome, July 2013 1 MIP

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban Feb 22, 2008 1

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &amp;

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &