compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 8 0

logistics 1 • Problem Set 1 was due this morning in Gradescope. • Problem Set 2 will be released tomorrow and due 10/10.

summary Last Class: Finished up MinHash and LSH. signatures and t hash table repetitions ( s -curves). (SimHash). This Class: streams. 2 • Application to fast similarity search. • False positive and negative tuning with length r hash • Examples of other locality sensitive hash functions • The Frequent Elements (heavy-hitters) problem in data • Misra-Gries summaries. • Count-min sketch.

upcoming Next Time: Random compression methods for high dimensional vectors. The Johnson-Lindenstrauss lemma. After That: Spectral Methods decomposition. Will use a lot of linear algebra. May be helpful to refresh. multiplication. 3 • Building on the idea of SimHash. • PCA, low-rank approximation, and the singular value • Spectral clustering and spectral graph theory. • Vector dot product, addition, length. Matrix vector • Linear independence, column span, orthogonal bases, rank. • Eigendecomposition.

hashing for duplicate detection All different variants of detecting duplicates/finding matches in large datasets. An important problem in many contexts! 4

the frequent items problem k -Frequent Items (Heavy-Hitters) Problem : Consider a stream k . 5 of n items x 1 , . . . , x n (with possible duplicates). Return any item k times. E.g., for n = 9, k = 3: that appears at least n • What is the maximum number of items that must be returned? At most k items with frequency ≥ n • Think of k = 100. Want items appearing ≥ 1 % of the time. • Easy with O ( n ) space – store the count for each item and return the one that appears ≥ n / k times. • Can we do it with less space? I.e., without storing all n items? • Similar challenge as with the distinct elements problem.

the frequent items problem Applications of Frequent Items: watched on Youtube, Google searches, etc.) detect DoS attacks/network anomalies). above some threshold. Generally want very fast detection, without having to scan through database/logs. I.e., want to maintain a running list of frequent items that appear in a stream. 6 • Finding top/viral items (i.e., products on Amazon, videos • Finding very frequent IP addresses sending requests (to • ‘Iceberg queries’ for all items in a database with frequency

frequent itemset mining Association rule learning: A very common task in data mining is to identify common associations between different events. that appear many times in the same basket. different baskets an efficient approach is critical. E.g., baskets are Twitter users and itemsets are subsets of who they follow. 7 • Identified via frequent itemset counting. Find all sets of k items • Frequency of an itemset is known as its support. • A single basket includes many different itemsets, and with many

majority in data streams single item appears a majority of the time. Return this item. item has a strict majority.) 8 Majority: Consider a stream of n items x 1 , . . . , x n , where a • Basically k -Frequent items for k = 2 (and assume a single

boyer-moore algorithm Boyer-Moore Voting Algorithm: (our first deterministic algorithm ) 9 • Initialize count c := 0, majority element m := ⊥ • For i = 1 , . . . , n • If c = 0, set m := x i and c := 1. • Else if m = x i , set c := c + 1. • Else if m ̸ = x i , set c := c − 1. Just requires O ( log n ) bits to store c and space to store m .

• s is incremented each time M appears. So it is incremented more correctness of boyer-moore Boyer-Moore Voting Algorithm: M . algorithm ends with m ends at a positive value. than it is decremented (since M appears a majority of times) and element, regardless of what order the stream is presented in. 10 Claim: The Boyer-Moore algorithm always outputs the majority • Initialize count c := 0, majority element m := ⊥ • For i = 1 , . . . , n • If c = 0, set m := x i and c := 1. • Else if m = x i , set c := c + 1. • Else if m ̸ = x i , set c := c − 1. Proof: Let M be the true majority element. Let s = c when m = M and s = − c otherwise (s is a ‘helper’ variable).

correctness of boyer-moore Boyer-Moore Voting Algorithm: M . algorithm ends with m ends at a positive value. than it is decremented (since M appears a majority of times) and element, regardless of what order the stream is presented in. Claim: The Boyer-Moore algorithm always outputs the majority 10 • Initialize count c := 0, majority element m := ⊥ • For i = 1 , . . . , n • If c = 0, set m := x i and c := 1. • Else if m = x i , set c := c + 1. • Else if m ̸ = x i , set c := c − 1. Proof: Let M be the true majority element. Let s = c when m = M and s = − c otherwise (s is a ‘helper’ variable). • s is incremented each time M appears. So it is incremented more

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 8 0 logistics 1 Problem Set 1 was due this morning in Gradescope. Problem Set 2 will be released tomorrow and due 10/10.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Choir: Empowering Low-Power Wide-Area Networks in Urban Settings Rashad Eletreby Diana Zhang,

International Study of Comparative Health Effectiveness with Medical and Invasive Approaches

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Multi-Frequency Phase Synchronization Tingran Gao 1 Zhizhen Zhao 2 1 Committee on Computational and

EE456 Digital Communications Professor Ha Nguyen September 2016 EE456 Digital

No Time to Countdown: Backing Off in Frequency Domain Souvik Sen , Romit Roy Choudhury, Srihari

Counter/Timers Overview ATmega328P has two 8-bit and one 16-bit counter/timers. Unit C

Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau MSR/Purdue CMU Stanford/EFF Or,