Topics in TCS 0 -sampling Raphal Clifford Introduction to 0 - PowerPoint PPT Presentation

Topics in TCS ℓ 0 -sampling Raphaël Clifford

Introduction to ℓ 0 sampling Over a large data set that assigns counts to tokens, the goal of an ℓ 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency.

Introduction to ℓ 0 sampling Over a large data set that assigns counts to tokens, the goal of an ℓ 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency. This is non-trivial because we want to use small space and counts can be both positive and negative.

Introduction to ℓ 0 sampling Over a large data set that assigns counts to tokens, the goal of an ℓ 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency. This is non-trivial because we want to use small space and counts can be both positive and negative. Consider a stream of visits by customers to the busy website of some business or organization. An analyst might want to sample uniformly from the set of all distinct customers who visited the website. ( ℓ 0 -sampling)

Introduction to ℓ 0 sampling Over a large data set that assigns counts to tokens, the goal of an ℓ 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency. This is non-trivial because we want to use small space and counts can be both positive and negative. Consider a stream of visits by customers to the busy website of some business or organization. An analyst might want to sample uniformly from the set of all distinct customers who visited the website. ( ℓ 0 -sampling) Or an analyst might want to sample with probability proportional to their visit frequency. ( ℓ 1 -sampling)

Approximate ℓ 0 sampling The ℓ 0 -sampling cannot be solved exactly in sublinear space deterministically.

Approximate ℓ 0 sampling The ℓ 0 -sampling cannot be solved exactly in sublinear space deterministically. We will see a randomised approximate algorithm.

Approximate ℓ 0 sampling The ℓ 0 -sampling cannot be solved exactly in sublinear space deterministically. We will see a randomised approximate algorithm. Let � f � 0 be the number of tokens with non-zero frequency. Define the probability for token i as 1 π i = , if i ∈ supp f � f � 0 π i = 0 , otherwise We assume that f � = 0 .

The overall idea We will sample substreams randomly in such a way that there is a good chance that one is strictly 1-sparse. We will run a sparse recovery algorithm on each substream.

The overall idea We will sample substreams randomly in such a way that there is a good chance that one is strictly 1-sparse. We will run a sparse recovery algorithm on each substream. Our method for achieving this is called “geometric sampling" as each substream samples tokens with geometrically decreasing probability.

The overall idea We will sample substreams randomly in such a way that there is a good chance that one is strictly 1-sparse. We will run a sparse recovery algorithm on each substream. Our method for achieving this is called “geometric sampling" as each substream samples tokens with geometrically decreasing probability. We will use our sparse recovery and detection algorithm to report the index of the token with non-zero frequency.

The overall idea We will sample substreams randomly in such a way that there is a good chance that one is strictly 1-sparse. We will run a sparse recovery algorithm on each substream. Our method for achieving this is called “geometric sampling" as each substream samples tokens with geometrically decreasing probability. We will use our sparse recovery and detection algorithm to report the index of the token with non-zero frequency. The reported token will be uniformly sampled from all tokens with non-zero frequency.

ℓ 0 -sampling algorithm Where log n is written it should be read as ⌈ log 2 n ⌉ . We will write D ℓ for the ℓ th instance of a 1-sparse recovery algorithm. initialise for each ℓ from 0 to log n choose h ℓ : [ n ] → { 0 , 1 } ℓ uniformly at random set D ℓ = 0 process ( j , c ) for each ℓ from 0 to log n # probability 2 − ℓ if h ℓ ( j ) = 0 then feed ( j , c ) to D ℓ # 1 -sparse recovery output for each ℓ from 0 to log n if D ℓ reports strictly 1 -sparse output ( i , ℓ ) and stop # token, frequency output FAIL

ℓ 0 -sampling algorithm example 1 2 3 4 6 7 8 5 Figure: Frequency vector f • The non-zero frequency item tokens are 2 , 5 , 7.

ℓ 0 -sampling algorithm example ℓ Prob. Tokens included ℓ = 0 1 2 , 5 , 7 ℓ = 1 1 / 2 2 , 5 1 2 3 4 6 7 8 ℓ = 2 1 / 4 7 5 ℓ = 3 1 / 8 2 Figure: Frequency vector f process ( j , c ) • The non-zero frequency item for each ℓ from 0 to log n tokens are 2 , 5 , 7. if h ℓ ( j ) = 0 then • We make 4 substreams. feed ( j , c ) to D ℓ

ℓ 0 -sampling algorithm example ℓ Prob. Tokens included ℓ = 0 1 2 , 5 , 7 ℓ = 1 1 / 2 2 , 5 1 2 3 4 6 7 8 ℓ = 2 1 / 4 7 5 ℓ = 3 1 / 8 2 Figure: Frequency vector f process ( j , c ) • The non-zero frequency item for each ℓ from 0 to log n tokens are 2 , 5 , 7. if h ℓ ( j ) = 0 then • We make 4 substreams. feed ( j , c ) to D ℓ • With high probability we return 7.

ℓ 0 -sampling analysis I • Let d = | supp( f ) | . We want to compute a lower bound for the probability that a substream is strictly 1-sparse.

ℓ 0 -sampling analysis I • Let d = | supp( f ) | . We want to compute a lower bound for the probability that a substream is strictly 1-sparse. • For a fixed level ℓ , define indicator r.v. X j = 1 if token j is selected in level ℓ . Let S = X 1 + · · · + X d . The event that the substream is strictly 1-sparse is { S = 1 } .

ℓ 0 -sampling analysis I • Let d = | supp( f ) | . We want to compute a lower bound for the probability that a substream is strictly 1-sparse. • For a fixed level ℓ , define indicator r.v. X j = 1 if token j is selected in level ℓ . Let S = X 1 + · · · + X d . The event that the substream is strictly 1-sparse is { S = 1 } . • We have E X j = p , q = 1 − p and E ( X j X k ) = p 2 if j � = k and p = p 2 + pq otherwise.

ℓ 0 -sampling analysis I • Let d = | supp( f ) | . We want to compute a lower bound for the probability that a substream is strictly 1-sparse. • For a fixed level ℓ , define indicator r.v. X j = 1 if token j is selected in level ℓ . Let S = X 1 + · · · + X d . The event that the substream is strictly 1-sparse is { S = 1 } . • We have E X j = p , q = 1 − p and E ( X j X k ) = p 2 if j � = k and p = p 2 + pq otherwise. • By Chebyshev, Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ E ( S − 1) 2 = E ( S 2 ) − 2 E ( S ) + 1 � � = E ( X j X k ) − 2 E ( X j ) + 1 j , k ∈ [ d ] j ∈ [ d ] = d 2 p 2 + dpq − 2 dp + 1

ℓ 0 -sampling analysis II • Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ d 2 p 2 + dpq − 2 dp + 1.

ℓ 0 -sampling analysis II • Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ d 2 p 2 + dpq − 2 dp + 1. • The probability that a substream is strictly 1-sparse is therefore at least 2 dp − d 2 p 2 − dpq = dp (1 − ( d − 1) p ) > dp (1 − dp ).

ℓ 0 -sampling analysis II • Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ d 2 p 2 + dpq − 2 dp + 1. • The probability that a substream is strictly 1-sparse is therefore at least 2 dp − d 2 p 2 − dpq = dp (1 − ( d − 1) p ) > dp (1 − dp ). • If p = c / d for c ∈ (0 , 1) then the probability that a substream is strictly 1-sparse is at least c (1 − c ).

ℓ 0 -sampling analysis II • Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ d 2 p 2 + dpq − 2 dp + 1. • The probability that a substream is strictly 1-sparse is therefore at least 2 dp − d 2 p 2 − dpq = dp (1 − ( d − 1) p ) > dp (1 − dp ). • If p = c / d for c ∈ (0 , 1) then the probability that a substream is strictly 1-sparse is at least c (1 − c ). 4 d ≤ 1 1 1 • Consider level ℓ such that 2 ℓ < 2 d . This constrains ℓ to be a unique value for any d ≥ 1.

ℓ 0 -sampling analysis II • Pr( S � = 1) = Pr( | S − 1 | ≥ 1) ≤ d 2 p 2 + dpq − 2 dp + 1. • The probability that a substream is strictly 1-sparse is therefore at least 2 dp − d 2 p 2 − dpq = dp (1 − ( d − 1) p ) > dp (1 − dp ). • If p = c / d for c ∈ (0 , 1) then the probability that a substream is strictly 1-sparse is at least c (1 − c ). 4 d ≤ 1 1 1 • Consider level ℓ such that 2 ℓ < 2 d . This constrains ℓ to be a unique value for any d ≥ 1. • We therefore have that the probability that a substream at such a level ℓ is strictly 1-sparse is at least 1 4 (1 − 1 4 ) = 3 / 16 > 1 / 8.

ℓ 0 -sampling analysis III • By repeating the whole procedure O (log(1 /δ )) times we reduce the probability that no substream is 1-sparse to O ( δ ). To see this, 8 ) x = δ = ( 7 ⇒ x = log 2 (1 /δ ) / log 2 (8 / 7).

ℓ 0 -sampling analysis III • By repeating the whole procedure O (log(1 /δ )) times we reduce the probability that no substream is 1-sparse to O ( δ ). To see this, 8 ) x = δ = ( 7 ⇒ x = log 2 (1 /δ ) / log 2 (8 / 7). • Each run of the 1-sparse algorithm fails with probability O (1 / n 2 ) and so the overall probability of failure is O ( log n log(1 /δ ) ). n 2

ℓ 0 -sampling summary The ℓ 0 sampling problem asks us to sample independently and uniformly from the tokens with non-zero frequency.

Topics in TCS 0 -sampling Raphal Clifford Introduction to 0 - PowerPoint PPT Presentation

Topics in TCS 0 -sampling Raphal Clifford Introduction to 0 sampling Over a large data set that assigns counts to tokens, the goal of an 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency.

TCS - Toolmakers Cluster of Slovenia General Presentation July 2016 Why TCS? 2 TCS answer

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav

Partnership for TCS & Davidson County Public Library TCS & Davidson County Public Library

TCS TCS Temperature Control Temperature Control Solution Solutions Optimizing Plastics

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

Ontology for Multimedia Applications Hiranmay Ghosh TCS Innovation Labs, Delhi Contributors

FARMERS : Small & Marginal How to address their needs ? How Broad Band Infrastructure will

TCS G 2 Manifolds and 4D Emergent Strings Fengjun Xu Universit at Heidelberg arXiv:

Concurrent pattern calculus IFIP TCS 2010 Thomas Given-Wilson 1 Daniele Gorla 2 Barry Jay 1

Verification of Industry Code : Challenges R Venkatesh r.venky@tcs.com 1 Overview Focus of

A formal approach to autonomic systems programming: The SCEL Language Italian Conference on TCS

TERENA Certificate Service (TCS) 9 June 2011 Background Many NRENs had set-up a CA, but

TCS Webinar for ITAC Presentation name (Name of presenter) Pratima Rao (Position) Trade

IoT-Enabled Community Care for Sustainable Ageing-in-Place Hwee-Pink TAN, Ph.D. Associate

Shift Left Principle Key enabler for First Time Right Dr Ramesh K Tumuluru 10 June 2013 TCS

Morogoro Email: ishengomarc@yahoo.com Presen entat atio ion to 2nd nd TTC TCS Bioma omass

Timelike Compton Scattering with CLAS12 at Jefferson Lab Pierre Chatagnon Institut de Physique

Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors

Revisiting Class Activation Mapping for Learning from Imperfect Data Wonho Bae , Junhyug Noh,

CSC2412: Private Multiplicative Weights Sasho Nikolov 1 Query Release Reminder: Query Release

Impacts of Tropical Cyclones on the Upper Troposphere Eric Ray 1,2 and Karen Rosenlof 1 1 Chemical

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof

NHDP/OLSRv2 Security Ulrich Herberg Thomas Clausen 1 Reminder draft-herberg-manet-packetbb-sec

Introduction References and Presentation at: http://www.elinux.org/SOC_Spies Introduction

Topics in TCS 0 -sampling Raphal Clifford Introduction to 0 - PowerPoint PPT Presentation

Topics in TCS 0 -sampling Raphal Clifford Introduction to 0 sampling Over a large data set that assigns counts to tokens, the goal of an 0 -sampler is to draw (approximately) uniformly from the set of tokens with non-zero frequency.

TCS - Toolmakers Cluster of Slovenia General Presentation July 2016 Why TCS? 2 TCS answer

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension Mohit Yadav

Partnership for TCS &amp; Davidson County Public Library TCS &amp; Davidson County Public Library

TCS TCS Temperature Control Temperature Control Solution Solutions Optimizing Plastics

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

Ontology for Multimedia Applications Hiranmay Ghosh TCS Innovation Labs, Delhi Contributors

FARMERS : Small &amp; Marginal How to address their needs ? How Broad Band Infrastructure will

TCS G 2 Manifolds and 4D Emergent Strings Fengjun Xu Universit at Heidelberg arXiv:

Concurrent pattern calculus IFIP TCS 2010 Thomas Given-Wilson 1 Daniele Gorla 2 Barry Jay 1

Verification of Industry Code : Challenges R Venkatesh r.venky@tcs.com 1 Overview Focus of

A formal approach to autonomic systems programming: The SCEL Language Italian Conference on TCS

TERENA Certificate Service (TCS) 9 June 2011 Background Many NRENs had set-up a CA, but

TCS Webinar for ITAC Presentation name (Name of presenter) Pratima Rao (Position) Trade

IoT-Enabled Community Care for Sustainable Ageing-in-Place Hwee-Pink TAN, Ph.D. Associate

Shift Left Principle Key enabler for First Time Right Dr Ramesh K Tumuluru 10 June 2013 TCS

Morogoro Email: ishengomarc@yahoo.com Presen entat atio ion to 2nd nd TTC TCS Bioma omass

Timelike Compton Scattering with CLAS12 at Jefferson Lab Pierre Chatagnon Institut de Physique

Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors

Revisiting Class Activation Mapping for Learning from Imperfect Data Wonho Bae *, Junhyug Noh*,

CSC2412: Private Multiplicative Weights Sasho Nikolov 1 Query Release Reminder: Query Release

Impacts of Tropical Cyclones on the Upper Troposphere Eric Ray 1,2 and Karen Rosenlof 1 1 Chemical

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof

NHDP/OLSRv2 Security Ulrich Herberg Thomas Clausen 1 Reminder draft-herberg-manet-packetbb-sec

Introduction References and Presentation at: http://www.elinux.org/SOC_Spies Introduction

Partnership for TCS & Davidson County Public Library TCS & Davidson County Public Library

FARMERS : Small & Marginal How to address their needs ? How Broad Band Infrastructure will

Revisiting Class Activation Mapping for Learning from Imperfect Data Wonho Bae , Junhyug Noh,