cs535 big data 3 4 2020 week 7 b sangmi lee pallickara
play

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Lossy Algorithm PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara


  1. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Lossy Algorithm PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Programming Assignment #2 Lossy Algorithm • GEAR Session 2. Machine Learning for Big Data • Lecture 2. • Distributed Optimization Problem in Machine Learning Programming Assignment 2 Lossy Counting Algorithm CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Algorithm • Solving frequent element • Divide the incoming stream into buckets of w = 1/ ε • Each buckets are labeled with integer starting from 1 • Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams". • Current bucket number = b current VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases : • b current = N/w 346–357 • True frequency of an element e = f e • Data structure • (e,f, Δ ) • e is an element in the stream • f is an integer representing its estimated frequency • Δ is a maximum possible error in f http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5), 1 st bucket ε = 0.2 • When an element arrives w = 1/ε= 5 (5 items per "bucket") • Lookup to see if there is an entry for that element already exists • If there is an entry, increase its frequency f by one bucket 1 bucket 2 bucket 3 bucket 4 • Otherwise, create a new entry of the form (e, f, Δ ) = (e, f, b curren t-1) 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 • When the new elements fill up the bucket • N mod w == 0 [Bucket 1] • Prune elements b current = 1 inserted: 1 2 4 3 4 • (e,f, Δ ) is deleted if f + Δ ≤ b current Insert phase: D (before removing):(x=1;f=1;Δ=0) (x=2;f=1;Δ=0) (x=4;f=2;Δ=0) (x=3;f=1;Δ=0) • When user request a list of item with threshold s Delete phase : delete elements with f + Δ ≤ b current (=1) D (after removing) :(x=4;f=2;Δ=0) • Outputs are items that f ≥ (s- ε )N NOTE : elements with frequencies ≤ 1 are deleted New elements added has maximum count error of 0 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5) , 2 nd bucket Example (ε = 0.2, w = 1/ε= 5) , 3 rd bucket ε = 0.2 ε = 0.2 w = 1/ε= 5 (5 items per "bucket") w = 1/ε= 5 (5 items per "bucket") bucket 1 bucket 2 bucket 3 bucket 4 bucket 1 bucket 2 bucket 3 bucket 4 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 [Bucket 2] [Bucket 3] b current = 2 inserted: 3,4,5,4,6 b current = 3 inserted: 7 3 3 6 1 Insert phase: Insert phase: D (before removing) : (x=4;f=4;Δ=0) (x=3;f=1;Δ=1) (x=5;f=1;Δ=1) (x=6;f=1;Δ=1) D (before removing):(x=7;f=1;Δ=2) (x=3;f=2;Δ=2) (x=4;f=4;Δ=0) (x=6;f=1;Δ=2) (x=1;f=1;Δ=2) Delete phase : delete elements with f + Δ ≤ b current (=3) Delete phase : delete elements with f + Δ ≤ b current (=2) • D (after removing) :(x=4;f=4;Δ=0) (x=3;f=2;Δ=2) D (after removing) :(x=4;f=4;Δ=0) NOTE : elements with frequencies ≤ 3 are deleted NOTE : elements with frequencies ≤ 2 are deleted New elements added has maximum count error of 2 New elements added has maximum count error of 1 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5) , 4 th bucket Example ( ε = 0.2, w = 1/ ε = 5 ) , Output ε = 0.2 ε = 0.2 w = 1/ε= 5 (5 items per "bucket") w = 1/ε= 5 (5 items per "bucket") 1,2,4,3,4 3,4,5,4,6 7,3,3,6,1 1,3,2,4,7 bucket 1 bucket 2 bucket 3 bucket 4 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 D :(x=4;f=5; Δ =0) (x=3;f=3; Δ =2) For the threshold s = 0.3 (so far, N=20 ) [Bucket 4] (s- ε ) N = (0.3-0.2) x 20 = 2 b current = 4 inserted: 1 3 2 4 7 Insert phase: There are only two elements available: • D (before removing):(x=4;f=5;Δ=0) (x=3;f=3;Δ=2) (x=1;f=1;Δ=3)(x=2;f=1;Δ=3) (x=7;f=1;Δ=3) Item f estimated f actual 4 5 5 Delete phase : delete elements with f + Δ ≤ b current (=4) D (after removing) :(x=4;f=5;Δ=0) (x=3;f=3;Δ=2) 3 3 5 NOTE : elements with frequencies ≤ 4 are deleted If s = 0.5? New elements added has maximum count error of 3 No element will be returned http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Infrequent Items are NOT included in D Why does it work? • Lemma 3. • Lemma 1. • If an item e is not included D , then f e ≤ ε × N b current is at a bucket boundary • i.e., the true frequency count of e is less than or equal to ε × N Where the most recently started new bucket The approximate value of b current = ε × N • Case 1. trivial case • If e does not appear in the input stream, then trivially, the entry (e, f, Δ ) was never • Lemma 2. entered into D and hence, (e, f, Δ ) ∉ D • If an entity (e; f; Δ ) is deleted in the delete phase of the algorithm when b current =k then We have then: • The number of occurrences of e (actual count f e ) is less than or equal to k f e = 0 • f e ≤ b current and trivially: f e (= 0) ≤ ε × N is true. CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Lemma 3: continued Lemma 3: continued • Case 2: • Now, according to Lemma 1, • If e was in the input stream, and the entry (e, f, Δ ) is not in the output set D , then (e, f, Δ ) at any bucket boundary b current = ε × N was deleted in some bucket. Since the entry (e, f, Δ ) was deleted at a bucket boundary, therefore, at that time (when (e, f, Δ ) was deleted): Batch 1 Batch 2 Batch 3 f e ≤ b current = ε × N e has not found e (e,f,Δ) deleted (e,f,Δ) is not present • The maximum actual frequency of e is f e = f + Δ • Since Lemma 3 is true, (If (e, f, Δ ) ∉ D , when the algorithm terminates then, the actual frequency of item e : f e ≤ ε × N ) • According to lemma 2, • Because (e, f, Δ ) is deleted in bucket b current , the actual count at that moment • By rules of negation, f e ≤ b current • If the actual frequency of item e : f e > ε × N then, (e, f, Δ ) ∈ D , when the algorithm terminates CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Difference between true frequency count and approximate Lemma 4: continued frequency count • Lemma 4. • Part 2. f e ≤ f + ε × N • If (e, f, Δ ) ∈ D , then: f ≤ f e ≤ f + ε× N Batch 1 Batch 2 Batch 3 e e e e e • Proof. Algorithm keeps exact • Part 1. f ≤ f e count of e during this (e,f,Δ) deleted period • Since the value f (variable in the algorithm) count the item e in the input after the entry (e, f, Δ ) has been inserted in D , and the entry (e, f, Δ ) may have been deleted before, it is obvious that f ≤ f e • The only occurrences of e that the algorithm fails to count are those that appeared prior to the bucket Δ + 1. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend