Finding Interesting Correlations with Conditional Heavy Hitters - PowerPoint PPT Presentation

Finding Interesting Correlations with Conditional Heavy Hitters Katsiaryna Mirylenka (University of Trento) Themis Palpanas (University of Trento) Graham Cormode (AT&T Labs) Divesh Srivastava (AT&T Labs)

Streaming Data Processing  Much big data arrives in the form of streams of updates – Each item in the stream gives more information – Stream is too large to store or forward  Much prior work on streaming algorithms using small space – For “heavy hitters” (frequent items, frequent itemsets) – For quantiles, entropy and other statistical quantities – For data mining and machine learning (clustering, classifiers)  Common application domains: – Network health monitoring (anomaly detection) – Intrusion detection over streams of events 2

Limitations of current approaches Existing streaming primitives not always suited to these cases:  Tracking heavy hitters in network monitoring is too crude – Some sources or destinations are always popular – These may drown out the informative cases – Want to study data at a finer level of detail  Frequent itemset mining in intrusion detection is not scalable – Enormous search space of possible combinations – Existing algorithms need a lot of space – Do not offer ‘real - time’ performance  Want mining primitive between these two extremes – Finer than heavy hitters, simpler than frequent itemsets 3

Conditional Heavy Hitters  Observation: much data can be abstracted as pairs of items – (Source, destination) in network data child 1 – (Current, next) states in Markov chain models child 2 – Pairs of attributes in database systems child 3 parent  First item is primary, other is secondary …. – Abstract as (parent, child) pairs child n  Introduce the notion of conditional heavy hitters: – (parent, child) pairs where the child is frequent given the parent – We formalize this definition, and give algorithms to find them 4

Conditional Heavy Hitters Definitions  Given parents p, and children c, define – f p as the frequency (count) of parent p in the stream – f p,c as the frequency (count) of pair (p,c) in the stream – Pr[p] as the probability of p, f p /n – Pr[c|p] as the conditional probability of c given p, f p,c /f p  Conditional heavy hitters are those (p, c) pairs with Pr[c|p] >  – Needs refinement: if f p = f p,c = 1, then Pr[c|p]=1 – Restrict attention to those with the top-  largest f p,c values  Still a technically difficult problem – Lower bound shows a lot of space needed to give guarantees 5

Outline  Introduce a sequence of four algorithms to find Conditional Heavy Hitters (CHH)  Initial two algorithms store information on all parents  Subsequent two algs track approximate information on parents  Experimental study identifies where each algorithm performs best child 1 child 2 child 3 parent …. child n 6

Space Saving Algorithm for HH  Basic building block is an algorithm for heavy hitters (HH)  SpaceSaving is an efficient HH algorithm [Metwally et al ‘05]  Keeps information about k different items and their counts – If next item in stream is stored, update its count – If not, overwrite least frequent item and update count  Guarantees error at most (n/k) on any count  SpaceSaving (SS) often performs very well in practice 7 6 4 1 7

1. GlobalHH Algorithm  Natural first approach to CHH problem: – Keep exact statistics on parent frequencies – Keep approximate counts of (parent, child) pairs via SS – Use approximate and exact information to estimate Pr[c|p] – Output CHHs based on these estimates  Provides guarantees on estimated values: – Error in estimate of Pr[c|p] is at most n/(k f p ) – Error improves if distribution is skewed parent child SS Exact count 8

2. CondHH Algorithm  Previous algorithm is not tuned to the CHH definition – SS algorithm prunes based on raw frequency – Instead, CondHH algorithm prunes based on (estimated) Pr[c|p]  Introduce ConditionalSpaceSaving (CSS) algorithm: – Keeps information about k different items and their counts – If next item in stream is stored, update its count – If not, overwrite item with lowest Pr[c|p] estimate , update count – Use some implementation tricks to make fast to update  CondHH: use CSS for (parent, child) pairs to estimate Pr[c|p] parent child CSS Exact count 9

3. FamilyHH Algorithm  Previous algorithms assumed we could store all parents – Not realistic as the domain of parents increases  FamilyHH: natural generalization of GlobalHH – Keep SS for parents, and another SS for (parent,child) pairs – Use both approximate counts to estimate Pr[c|p]  Similar worst case guarantees to GlobalHH – Given O(k) space, error in Pr[c|p] is at most n/(k f p ) parent child SS SS 10

4. SparseHH Algorithm  Last algorithm is the most involved – Keep SS on parents, CSS on parent, child pairs  Given new (parent, child) pair, need to initialize its f p,c estimate – Can use additional data structures to track this information – Use hashing/Bloom filter techniques to minimize space – Experimentally determine how to divide available memory  No worst-case guarantees on performance, – So we compare all algorithms empirically parent child SS CSS 11

Algorithm Summary Algorithm Parent Parent,Child 1. GlobalHH Exact SS 2. CondHH Exact CSS 3. FamilyHH SS SS 4. SparseHH SS CSS  Other algorithms proposed, performed less well  For more details, see paper 12

Experimental Study  Implemented and evaluated on variety of data – WorldCup data of (ClientID, ObjectID) request pairs – Taxicab GPS data: 54K trajectories in a 2 nd order Markov model  Distinguish between data that is sparse and dense – Sparse data has few distinct children per parent (on average) – Dense data has many distinct children per parent (on average)  Measure precision and recall of CHH recovery 13

Sparse Data Results 1 1 0.8 0.8 Precision GlobalHH 0.6 0.6 Recall FamilyHH 0.4 0.4 CondHH SparseHH 0.2 0.2 0 0 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Total memory (Mbytes) Total memory (Mbytes)  World Cup data is sparse: 1/10 parents have a CHH child  CondHH and SparseHH do well, both based on CSS – Keep very similar information internally – Other methods not competitive 14

Dense Data Results 1 1 0.8 0.8 Precision 0.6 0.6 Recall GlobalHH 0.4 0.4 CondHH 0.2 0.2 SparseHH 0 0 1 2 3 4 1 2 3 4 Total memory (Mbytes) Total memory (Mbytes)  Taxicab data is relatively dense, many parents have CHH child  CondHH can take more advantage of available memory  SparseHH converges on CondHH as more memory is used  Other algorithms are not competitive 15

Throughput and Performance  Not much variation as memory increases  CondHH and SparseHH are slightly more expensive, due to more complex processing  Throughput is still 5 x 10 5 items / second per core 16

Concluding Remarks  High precision and recall of CHHs is possible on data streams – SparseHH algorithm works well over a variety of data types – CondHH is preferred when the data is more dense  Future work: – Evaluate for Markov Chain parameter estimation – Compare to other recently proposed definitions 17

ParentHH Algorithm  Keep small amount of information for each parent about its child distribution – Run an instance of SS for each parent – Track child distribution accurately – Use stored information to estimate Pr[c|p] and output CHHs  Also provides guarantees on accuracy – Given total space k, error in estimate of Pr[c|p] is |P|/s – P denotes total number of parents 19

Finding Interesting Correlations with Conditional Heavy Hitters - PowerPoint PPT Presentation

Finding Interesting Correlations with Conditional Heavy Hitters Katsiaryna Mirylenka (University of Trento) Themis Palpanas (University of Trento) Graham Cormode (AT&T Labs) Divesh Srivastava (AT&T Labs) Streaming Data Processing

How do financial correlations grow? How do financial correlations grow? C. Borghesi Borghesi

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Control AIRS clear-sky radiances AIRS cloudy retrievals Anomaly Correlations computed from 90S

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Causation and Correlations Assume that you have found an interesting (new?) correlation

STATUS COUNT FINDING APPROVED 5 FINDING CONDITIONAL 16 FINDING DENIED 11

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Market Outlook January 2018 1 Equity Markets 2 3 Key Interesting Events Seen In 2017 4 Key

NGN: Basic Architecture NGN: Basic Architecture & & Interesting Issues Interesting

Mr. RT Mr. RT An Interesting Scenario & Attractive Complaint !! An Interesting Scenario

"Interesting" Paths = Shortest Paths? "Interesting" Paths Shortest Paths!

Random Stuff I Find Interesting Random Stuff I Find Interesting Matthew Dockrey

House of Graphs: Introduction what are interesting graphs? GraPHedron First Definition of

Measuring Pixel Correlations on SLAC Data Emily Phillips Longley Current implementation

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Multicast Data Source Authentication Ideas Atul Sharma Nokia, Inc. 1

The Jiangmen Underground Neutrino Observatory Liangjian Wen JUNO Neutrino Astronomy &

Status of GEANT4 in LHCb S. Easo, RAL, 30-9-2002 The LHCb experiment. GEANT4 is used for

Odd behavior in the coefficients of reciprocals of binary power series Katie Anders University

An FPGA Implementation of Reciprocal Sums for SPME Sam Lee and Paul Chow Edward S. Rogers Sr.

Motor Representation and Shared Intention s.butterfill@warwick.ac.uk joint Which events are

Object-Oriented Paradigm http://cs.mst.edu The Concept Bundled together in one object

Finding Interesting Correlations with Conditional Heavy Hitters - PowerPoint PPT Presentation

Finding Interesting Correlations with Conditional Heavy Hitters Katsiaryna Mirylenka (University of Trento) Themis Palpanas (University of Trento) Graham Cormode (AT&T Labs) Divesh Srivastava (AT&T Labs) Streaming Data Processing

How do financial correlations grow? How do financial correlations grow? C. Borghesi Borghesi

Statistically-Significant Correlations 11 Oct, 2014 0F 2014 NNN4 Statistically-Significant

Measurements of BB Angular Correlations Measurements of BB Angular Correlations Measurements of

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Control AIRS clear-sky radiances AIRS cloudy retrievals Anomaly Correlations computed from 90S

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Causation and Correlations Assume that you have found an interesting (new?) correlation

STATUS COUNT FINDING APPROVED 5 FINDING CONDITIONAL 16 FINDING DENIED 11

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Market Outlook January 2018 1 Equity Markets 2 3 Key Interesting Events Seen In 2017 4 Key

NGN: Basic Architecture NGN: Basic Architecture &amp; &amp; Interesting Issues Interesting

Mr. RT Mr. RT An Interesting Scenario &amp; Attractive Complaint !! An Interesting Scenario

&quot;Interesting&quot; Paths = Shortest Paths? &quot;Interesting&quot; Paths Shortest Paths!

Random Stuff I Find Interesting Random Stuff I Find Interesting Matthew Dockrey

House of Graphs: Introduction what are interesting graphs? GraPHedron First Definition of

Measuring Pixel Correlations on SLAC Data Emily Phillips Longley Current implementation

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Multicast Data Source Authentication Ideas Atul Sharma Nokia, Inc. 1

The Jiangmen Underground Neutrino Observatory Liangjian Wen JUNO Neutrino Astronomy &amp;

Status of GEANT4 in LHCb S. Easo, RAL, 30-9-2002 The LHCb experiment. GEANT4 is used for

Odd behavior in the coefficients of reciprocals of binary power series Katie Anders University

An FPGA Implementation of Reciprocal Sums for SPME Sam Lee and Paul Chow Edward S. Rogers Sr.

Motor Representation and Shared Intention s.butterfill@warwick.ac.uk joint Which events are

Object-Oriented Paradigm http://cs.mst.edu The Concept Bundled together in one object

NGN: Basic Architecture NGN: Basic Architecture & & Interesting Issues Interesting

Mr. RT Mr. RT An Interesting Scenario & Attractive Complaint !! An Interesting Scenario

"Interesting" Paths = Shortest Paths? "Interesting" Paths Shortest Paths!

The Jiangmen Underground Neutrino Observatory Liangjian Wen JUNO Neutrino Astronomy &