Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data - PowerPoint PPT Presentation

Stream Algorithmics Albert Bifet March 2012

Data Streams Big Data & Real Time

Data Streams Data Streams ◮ Sequence is potentially infinite ◮ High amount of data: sublinear space ◮ High speed of arrival: sublinear time per example ◮ Once an element from a data stream has been processed it is discarded or archived Big Data & Real Time

Data Stream Algorithmics Example Puzzle: Finding Missing Numbers ◮ Let π be a permutation of { 1 , . . . , n } . ◮ Let π − 1 be π with one element missing. ◮ π − 1 [ i ] arrives in increasing order Task: Determine the missing number Big Data & Real Time

Data Stream Algorithmics Example Use a n -bit Puzzle: Finding Missing Numbers vector to ◮ Let π be a permutation of { 1 , . . . , n } . memorize all the ◮ Let π − 1 be π with one element numbers ( O ( n ) missing. space) ◮ π − 1 [ i ] arrives in increasing order Task: Determine the missing number Big Data & Real Time

Data Stream Algorithmics Example Puzzle: Finding Missing Numbers Data Streams: ◮ Let π be a permutation of { 1 , . . . , n } . O ( log ( n )) space. ◮ Let π − 1 be π with one element missing. ◮ π − 1 [ i ] arrives in increasing order Task: Determine the missing number Big Data & Real Time

Data Stream Algorithmics Example Data Streams: O ( log ( n )) space. Puzzle: Finding Missing Numbers Store ◮ Let π be a permutation of { 1 , . . . , n } . ◮ Let π − 1 be π with one element n ( n + 1 ) � − π − 1 [ j ] . missing. 2 j ≤ i ◮ π − 1 [ i ] arrives in increasing order Task: Determine the missing number Big Data & Real Time

Data Streams Approximation algorithms ◮ Small error rate with high probability ◮ An algorithm ( ǫ, δ ) − approximates F if it outputs ˜ F for which Pr [ | ˜ F − F | > ǫ F ] < δ . Big Data & Real Time

Data Stream Algorithmics Examples 1. Compute different number of pairs of IP addresses seen in a router 2. Compute top-k most used words in tweets Two problems: find number of distinct items and find most frequent items.

8 Bits Counter 1 0 1 0 1 0 1 0 What is the largest number we can store in 8 bits?

8 Bits Counter What is the largest number we can store in 8 bits?

8 Bits Counter f ( x ) = log ( 1 + x ) / log ( 2 ) 100 80 60 40 20 0 0 20 40 60 80 100 x f ( 0 ) = 0 , f ( 1 ) = 1

8 Bits Counter f ( x ) = log ( 1 + x ) / log ( 2 ) 10 8 6 4 2 0 0 2 4 6 8 10 x f ( 0 ) = 0 , f ( 1 ) = 1

8 Bits Counter f ( x ) = log ( 1 + x / 30 ) / log ( 1 + 1 / 30 ) 10 8 6 4 2 0 0 2 4 6 8 10 x f ( 0 ) = 0 , f ( 1 ) = 1

8 Bits Counter f ( x ) = log ( 1 + x / 30 ) / log ( 1 + 1 / 30 ) 100 80 60 40 20 0 0 20 40 60 80 100 x f ( 0 ) = 0 , f ( 1 ) = 1

8 bits Counter M ORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 What is the largest number we can store in 8 bits?

8 bits Counter M ORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 1 / 2 we can store 2 × 256 � with standard deviation σ = n / 2

8 bits Counter M ORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 2 − c then E [ 2 c ] = n + 2 with variance σ 2 = n ( n + 1 ) / 2

8 bits Counter M ORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 If p = b − c then E [ b c ] = n ( b − 1 ) + b , σ 2 = ( b − 1 ) n ( n + 1 ) / 2

Data Stream Algorithmics Examples 1. Compute different number of pairs of IP addresses seen in a router IPv4: 32 bits IPv6: 128 bits 2. Compute top-k most used words in tweets Find number of distinct items

Data Stream Algorithmics Memory unit Size Binary size 10 3 2 10 kilobyte (kB/KB) 10 6 2 20 megabyte (MB) 10 9 2 30 gigabyte (GB) 10 12 2 40 terabyte (TB) 10 15 2 50 petabyte (PB) 10 18 2 60 exabyte (EB) 10 21 2 70 zettabyte (ZB) 10 24 2 80 yottabyte (YB) Find number of distinct items IPv4: 32 bits IPv6: 128 bits

Data Stream Algorithmics Example 1. Compute different number of pairs of IP addresses seen in a router IPv4: 32 bits, IPv6: 128 bits Using 256 words of 32 bits accuracy of 5% Find number of distinct items

Data Stream Algorithmics Example 1. Compute different number of pairs of IP addresses seen in a router Selecting n random numbers, ◮ half of these numbers have the first bit as zero, ◮ a quarter have the first and second bit as zero, ◮ an eigth have the first, second and third bit as zero.. A pattern 0 i 1 appears with probability 2 − ( i + 1 ) , so n ≈ 2 i + 1 Find number of distinct items

Data Stream Algorithmics F LAJOLET -M ARTIN P ROBABILISTIC C OUNTING A LGORITHM 1 Init bitmap [ 0 . . . L − 1 ] ← 0 2 for every item x in the stream do index = ρ ( hash ( x )) ✄ position of the least significant 1-bit 3 4 if bitmap [ index ] = 0 5 then bitmap [ index ] = 1 6 b ← position of leftmost zero in bitmap return 2 b / 0 . 77351 7 E [ pos ] ≈ log 2 φ n ≈ log 2 0 . 77351 · n σ ( pos ) ≈ 1 . 12

Data Stream Algorithmics item x hash ( x ) ρ ( hash ( x )) bitmap a 0110 1 01000 b 1001 0 11000 c 0111 1 11000 d 1100 0 11000 a b e 0101 1 11000 f 1010 0 11000 a b b = 2 , n ≈ 2 2 / 0 . 77351 = 5 . 17

Data Stream Algorithmics F LAJOLET -M ARTIN P ROBABILISTIC C OUNTING A LGORITHM 1 Init bitmap [ 0 . . . L − 1 ] ← 0 2 for every item x in the stream 3 do index = ρ ( hash ( x )) ✄ position of the least significant 1-bit 4 if bitmap [ index ] = 0 then bitmap [ index ] = 1 5 6 b ← position of leftmost zero in bitmap return 2 b / 0 . 77351 7 1 Init M ← −∞ 2 for every item x in the stream 3 do M = max ( M , ρ ( h ( x )) b ← M + 1 ✄ position of leftmost zero in bitmap 4 return 2 b / 0 . 77351 5

Data Stream Algorithmics Stochastic Averaging Perform m experiments in parallel √ σ ′ = σ/ m Relative accuracy is 0 . 78 / √ m H YPER L OG L OG C OUNTER ◮ the stream is divided in m = 2 b substreams ◮ the estimation uses harmonic mean ◮ Relative accuracy is 1 . 04 / √ m

Data Stream Algorithmics H YPER L OG L OG C OUNTER 1 Init M [ 0 . . . b − 1 ] ← −∞ 2 for every item x in the stream do index = h b ( x ) 3 M [ index ] = max ( M [ index ] , ρ ( h b ( x )) 4 return α m m 2 / � m − 1 j = 0 2 − M [ j ] 5 h ( x ) = 010011000111 h 3 ( x ) = 001 and h 3 ( x ) = 011000111

Methodology Paolo Boldi Facebook Four degrees of separation Big Data does not need big machines, it needs big intelligence

Data Stream Algorithmics Examples 1. Compute different number of pairs of IP addresses seen in a router 2. Compute top-k most used words in tweets Find most frequent items

Data Stream Algorithmics M AJORITY Init counter c ← 0 1 2 for every item s in the stream 3 do if counter is zero 4 then pick up the item 5 if item is the same 6 then increment counter 7 else decrement counter Find the item that it is contained in more than half of the instances

Data Stream Algorithmics F REQUENT 1 for every item i in the stream 2 do if item i is not monitored 3 do if < k items monitored 4 then add a new item with count 1 5 else if an item z whose count is zero exists 6 then replace this item z by the new one 7 else decrement all counters by one 8 else ✄ item i is monitored 9 increase its counter by one Figure : Algorithm F REQUENT to find most frequent items

Data Stream Algorithmics L OSSY C OUNTING 1 for every item i in the stream 2 do if item i is not monitored 3 then add a new item with count 1 + ∆ 4 else ✄ item i is monitored 5 increase its counter by one 6 if ⌊ n / k ⌋ � = ∆ 7 then ∆ = ⌊ n / k ⌋ 8 decrement all counters by one 9 remove items with zero counts Figure : Algorithm L OSSY C OUNTING to find most frequent items

Data Stream Algorithmics S PACE S AVING 1 for every item i in the stream 2 do if item i is not monitored 3 do if < k items monitored 4 then add a new item with count 1 5 else replace the item with lower counter 6 increase its counter by one 7 else ✄ item i is monitored 8 increase its counter by one Figure : Algorithm S PACE S AVING to find most frequent items

Data Stream Algorithmics h 2 ( j ) h 4 ( j ) h 3 ( j ) h 1 ( j ) 4 +I 3 +I j 2 +I 1 +I Figure : A CM sketch structure example of ǫ = 0 . 4 and δ = 0 . 02

Count-Min Sketch A two dimensional array with width w and depth d � e � � ln 1 � w = , d = ǫ δ It uses space wd with update time d CM-Sketch computes frequency data adding and removing real values.

Count-Min Sketch A two dimensional array with width w and depth d � e � � ln 1 � w = , d = ǫ δ It uses space wd = e ǫ ln 1 δ with update time d = ln 1 δ CM-Sketch computes frequency data adding and removing real values.

Data Stream Algorithmics Problem Given a data stream, choose k items with the same probability, storing only k elements in memory. R ESERVOIR S AMPLING

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data - PowerPoint PPT Presentation

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Massive Data Algorithmics Lecture 3: External Search Trees Massive Data Algorithmics Lecture 3:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Massive Data Algorithmics Lecture 7: Range Searching Massive Data Algorithmics Lecture 7: Range

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Pedagogical Introduction Algorithmics and C Programming Lecture 0 Karim Bouzoubaa Objective

Algorithmics and C basis Introduction For beginners . . . Definition of algorithm Examples

Multivariate Algorithmics for Voting Britta Dorn University of Ulm, Germany FET11 Britta

Points, Distances, and Cellular Automata: Geometric and Spatial Algorithmics Luidnel Maignan

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

ATLAS I/O Overview Peter van Gemmeren (ANL) gemmeren@anl.gov for many in ATLAS 8/23/2018 Peter

CYBERSECURITY STRATEGIES TO MANAGE BUSINESS RISKS A C O N V E R S A T I O N W I T H H O R N E

Learning to Hash with its Application to Big Data Retrieval and Mining o Department of

NAMED DATA NETWORKING IN SCIENTIFIC APPLICATIONS Susmit Shannigrahi, Chengyu Fan and Christos

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Logical Foundations of Continuous Query Languages for Data Streams Carlo Zaniolo Carlo Zaniolo

COL106: Data Structures and Algorithms Ragesh Jaiswal, IITD Ragesh Jaiswal, IITD COL106: Data