Finding (Recently) Frequent Items in Distributed Data Streams Amit - PDF document

Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi ∗ Kedar Dhamdhere † Vladislav Shkapenyuk Christopher Olston Carnegie Mellon University { manjhi, vshkap, kedar, olston } @cs.cmu.edu Abstract 2. Time sensitivity. Recent data is more important than older data. We consider the problem of maintaining frequency We briefly describe two real-world applications ex- counts for items occurring frequently in the union of hibiting the properties just mentioned: multiple distributed data streams. Na¨ ıve methods of 1. Monitoring usage in large-scale distributed sys- combining approximate frequency counts from multi- tems. Web content providers using the services of a ple nodes tend to result in excessively large data struc- Content Delivery Network (CDN) like Akamai [2] may tures that are costly to transfer among nodes. To mini- wish to monitor recent access frequencies of content mize communication requirements, the degree of preci- served (e.g., HTML pages/images), to keep tabs on cur- sion maintained by each node while counting item fre- rent “hot spots.” The CDN may serve requests from any quencies must be managed carefully. We introduce the of a number of cache nodes (Akamai currently has over concept of a precision gradient for managing precision 10,000 such nodes); typically requests are served by the when nodes are arranged in a hierarchical communica- cache node closest to the end-user making the request tion structure. We then study the optimization problem in order to minimize latency. Hence, keeping tabs on of how to set the precision gradient so as to minimize overall access frequencies requires distributed monitor- communication, and provide optimal solutions that mining across many CDN cache nodes. imize worst-case communication load over all possible inputs. We then introduce a variant designed to perform 2. Detecting malicious activities in networked sys- well in practice, with input data that does not conform tems: to worst-case characteristics. We verify the effective- (a) Detecting worms. Previously unknown Inter- ness of our approach empirically using real-world data, net worms can be detected by discovering that a large and show that our methods incur substantially less com- number of recent traffic flows contain the same bit munication than na¨ ıve approaches while providing the string [20]. Distributed monitoring can reduce detection same error guarantees on answers. time. (b) Detecting DDoS attacks. Early detection of Dis- 1. Introduction tributed Denial of Service (DDoS) attacks is an important topic in network security. While a DDoS attack The problem of identifying frequently occurring typically targets a single “victim” node or organization, items in continuous data streams has attracted sig- there is generally no common path that all packets take. nificant attention recently [4, 9, 12, 14, 19, 22]. Po- In fact, even packets sent to the same destination and tential applications include identifying large network originating from within the same organization may fol- flows [12], answering iceberg queries [13], computing low different routes, due to so-called “hot potato” rout- iceberg cubes [17] and finding frequent itemsets and as- ing [3]. This property makes it very difficult to detect sociation rules [1]. However, nearly all prior work on distributed denial of service attacks effectively by only identifying frequent items in data streams and estimat- considering the traffic passing through any single moni- ing their occurrence frequencies falls short of meeting toring point, and motivates a distributed monitoring ap- the needs of the many real-world applications that ex- proach. Furthermore, techniques that weigh recent data hibit one or both of the following two properties: more than past data may help in early detection of attacks. 1. Distributed streams. Streams originate from multiple distributed sources. Data from all sources needs to 1.1 Problem Variants be aggregated to arrive at the final result, as in the dis- Both applications outlined above require algorithms tributed streams model of [15]. for identifying recently frequent items in the union of many distributed streams, and estimating the corre- ∗ Supported by an ITR grant from the NSF. † Supported by NSF ITR grants CCR-0085982 and CCR-0122581. sponding occurrence frequencies. In general, we can 1

classify applications of frequent item identification into problem statement that unifies the four variants listed four categories, in terms of whether they require (a) above. time-sensitivity and (b) distributed monitoring capabil- 1.2 Unified Problem Statement ity. We briefly describe each problem variant: Our problem statement extends that of [22]. There (1) Finding frequent items in a single stream: A sin- are m ≥ 1 ordered data streams S 1 , S 2 , . . . , S m . Each gle node sees an ordered stream of possibly repeating stream S i consists of a sequence of item occurrences items. The goal is to maintain frequency counts of items with time-stamps: � o i 1 , t i 1 � , � o i 2 , t i 2 � , etc. Each item whose frequency currently exceeds a user-supplied frac- occurrence o ij is drawn from a fixed universe U of tion of the size of the overall stream seen so far. items, i.e., ∀ i, j, o ij ∈ U . Arbitrary repetition of item (2) Finding recently frequent items in a single occurrences in streams is allowed. Each stream S i stream: In this variant recent occurrences of items in is monitored by a corresponding monitor node M i , of the stream are considered more important than older which there are m . Monitored frequency counts for high occurrences of items. At any given time, a numeric frequency items are to be supplied to a central root node weight is associated with each item occurrence in the R , which may or may not be the same as one of the stream that is a function of the amount of time that has monitor nodes. elapsed since the appearance of the item in the stream. Let S be the sequence-preserving union of streams A commonly-used weighting scheme is exponential de- S 1 , S 2 , . . . , S m . Further, let c ( u ) be the frequency of oc- cay [7], in which weights are assigned according to currence of item u in S up to the current time, weighted a negative-exponential function of elapsed time. The by recency of occurrence in an exponentially decaying goal is to identify items whose cumulative weighted fre- fashion. Mathematically, quency currently exceeds a user-supplied fraction of the total across all items, and provide an estimate of the cu- α ⌊ t now − ti � ⌋ c ( u ) = T mulative weighted frequencies of any such items. � o i ,t i �∈ S,o i = u (3) Finding frequent items in the union of distributed where t now denotes the current time, and α and T are streams: In this variant there are m ordered streams user-supplied parameters. The parameter α ∈ (0 , 1] S 1 , S 2 , . . . , S m , each produced at a different node in a controls the aggressiveness of exponential weighting. distributed environment and consisting of a sequence of As a special case, setting α = 1 causes all item oc- item occurrences. The goal is the same as in Variant currences to be weighted equally, regardless of age (as (1), except that item frequencies are computed over the in Variants (1) and (3) of Section 1.1). The parameter union of streams S 1 , S 2 , . . . , S m , instead of over a sin- T > 0 controls the frequency with which answers are gle stream. reported, and also the granularity of time-sensitivity. A (4) Finding recently frequent items in the union of time period of T time units is referred to as an epoch . distributed streams: This variant represents the natural The objective is to supply, at the end of every epoch combination of Variants (2) and (3). (i.e., every T time units), an estimate ˆ c ( u ) of c ( u ) for items occurring in S whose true time-weighted fre- Of these four variants, only Variant (1) has been stud- quency c ( u ) exceeds a support threshold T . T is de- ied in prior work. (Some work conducted concurrently fined as the product of a user-supplied support parame- with our own [4,16] also addresses problems quite sim- ter s ∈ [0 , 1] , and the sum of the weighted item occur- ilar to Variants (2) and (3), but there are significant dif- rences seen so far on all input streams, N = Σ u ∈ U c ( u ) , ferences with our work; see Section 4 for further dis- i.e., T = s · N . The amount of allowable inaccuracy cussion.) Algorithms for time-insensitive frequent item in the frequency estimates ˆ c ( u ) is governed by a user- identification over a single stream include those pre- supplied parameter ǫ . It is required that 0 ≤ ǫ ≤ s sented in [9,19,22]. It is straightforward to extend these (usually, ǫ ≪ s ). Each time an answer is produced, it algorithms to handle Variant (2), although the effect on must adhere to the following guarantees: the space bounds and error guarantees of the resulting algorithms in some cases is nonobvious. 1. All items whose true time-weighted frequency ex- Variants (3) and (4) present a larger challenge. As we ceeds s · N are output. will show, simple adaptations of existing frequent item identification algorithms to work in a distributed setting 2. No item whose true time-weighted frequency is incur excessive communication. In this paper we present less than ( s − ǫ ) · N is output. a new framework for distributed frequent item identification that minimizes communication requirements. 3. Each estimate ˆ c ( u ) supplied in the answer satisfies: max { 0 , c ( u ) − ǫ · N } ≤ ˆ c ( u ) ≤ c ( u ) . Before outlining our approach we first provide a formal 2

Finding (Recently) Frequent Items in Distributed Data Streams Amit - PDF document

Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi Kedar Dhamdhere Vladislav Shkapenyuk Christopher Olston Carnegie Mellon University { manjhi, vshkap, kedar, olston } @cs.cmu.edu Abstract 2. Time sensitivity.

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Recommendation Systems Stony Brook University CSE545, Fall 2016 From Frequent to Recommended

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

HPMS Pavement Data Items 16 different pavement data items are sought (HPMS Field Manual Items

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Q3 2016 Results November 10, 2016 1 DISCLAIMER NOT AN OFFER TO SELL OR SOLICITATION OF AN OFFER

w w w . I C A 2 0 1 4 . o r g On improving pension product design Agnieszka K. Konicz a and John

Low energy by procurement Smart municipality on the move Municipality of Mnsters 13.500

Identifying core skills required for the digital economy: Internet of Things Prof. Dr. Anna

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Presentation of third quarter 2019 CEO Per Jrgen Weisethaunet and CFO Stian Lnvik Oslo,

Roberto Rodriguez 5G TELEFONICA TRIALS AND 5G FIRST EXPERIENCES GSMA CITEL Seminar in WRC-19

RURAL HOUSING LOAN FUND RURAL HOUSING LOAN FUND A National Incremental Housing Finance

Finding (Recently) Frequent Items in Distributed Data Streams Amit - PDF document

Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi Kedar Dhamdhere Vladislav Shkapenyuk Christopher Olston Carnegie Mellon University { manjhi, vshkap, kedar, olston } @cs.cmu.edu Abstract 2. Time sensitivity.

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items

Recommendation Systems Stony Brook University CSE545, Fall 2016 From Frequent to Recommended

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

HPMS Pavement Data Items 16 different pavement data items are sought (HPMS Field Manual Items

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Q3 2016 Results November 10, 2016 1 DISCLAIMER NOT AN OFFER TO SELL OR SOLICITATION OF AN OFFER

w w w . I C A 2 0 1 4 . o r g On improving pension product design Agnieszka K. Konicz a and John

Low energy by procurement Smart municipality on the move Municipality of Mnsters 13.500

Identifying core skills required for the digital economy: Internet of Things Prof. Dr. Anna

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Presentation of third quarter 2019 CEO Per Jrgen Weisethaunet and CFO Stian Lnvik Oslo,

Roberto Rodriguez 5G TELEFONICA TRIALS AND 5G FIRST EXPERIENCES GSMA CITEL Seminar in WRC-19

RURAL HOUSING LOAN FUND RURAL HOUSING LOAN FUND A National Incremental Housing Finance

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung