Data-Intensive Distributed Computing
Part 6: Data Mining (3/4)
CS 431/631 451/651 (Fall 2019) Ali Abedi November 5, 2019
These slides are available at https://www.student.cs.uwaterloo.ca/~cs451
1
Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) - - PowerPoint PPT Presentation
Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4) November 5, 2019 Ali Abedi Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) These slides are available at
1
2
3
4
5
6
7
9
10
11
12
14
15
16
17
18
20
21
22
23
24
25
26
27
2nd element of the permutation is the first to map to a 1 4th element of the permutation is the first to map to a 1
Note: Another (equivalent) way is to store row indexes: 1
28
One of the two cols had to have 1 at position y
30
31
32
33
How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
52
53