faqs
play

FAQs Your disk quota is 20GB (per student) If you need more space, - PDF document

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State


  1. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS435 Introduction to Big Data - Spring 2016 FAQs • Your disk quota is 20GB (per student) • If you need more space, please let me know ASAP http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Topics of Todays Class • Part 1: Counting Triangles (from the last lecture) • Part 2: Locality Sensitive Hashing CS435 Introduction to Big Data - Spring 2016 GEAR Session 4. Large Scale Recommendation Systems and Social Media Lecture 4. Social Network Analysis Counting Triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: Probability for Random Graphs • If we start with n nodes and add m edges to a graph at random, there will be an expected number of triangles in the graph • There are ! 3 sets of three nodes • Approximately n 3 /6 sets of three nodes that might be a triangle • The probability of an edge between any two given nodes being added • m/ ! 2 • approximately 2m/n 2 • The probability that any set of three nodes has edges between each pair • if those edges are independently chosen to be present or absent • Approximately (2m/n 2 ) 3 = 8m 3 /n 6 • Thus, the expected number of triangles in a graph of n nodes and m randomly selected edges • Approximately ( 8m 3 /n 6 )(n 3 /6) = 4 (m/n) 3 CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Why Count Triangles?: How about Social Network Graph? • If a graph is a social network graph, • n nodes ( n users) • m edges (with m pairs of friends) • Do we expect the number of triangle to be, a. Same b. Much greater than the value for a random graph Much smaller than the value for a random graph c. Why? If A and B are friends, and A is also a friend of C, there should be a much greater chance than average that B and C are also friends Counting the number of triangles helps us to measure the extent to which a graph looks like a social network CS435 Introduction to Big Data - Spring 2016 What Else with the Counting Triangles? • Counting the number of triangles helps us to measure the extent to which a graph looks like a social network • It also shows some characteristics of social networks • E.g. the age of a community is related to the density of triangles http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

  5. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 An Algorithm for Finding Triangles • Suppose we have a graph of n nodes and m ( ≥ n) edges. For convenience, assume the nodes are integers 1, 2, . . . , n • Heavy hitter • If the degree is at least ! • Heavy hitter triangle • Triangle all three of whose nodes are heavy hitters • Note that the number of heavy hitter nodes is no more than 2 ! • Otherwise, the sum of the degrees of the heavy hitter nodes would be more than 2m • Each edge contributes to the degree of only two nodes CS435 Introduction to Big Data - Spring 2016 1. Preparing for the Data Structures • Step 1 . Compute the degree of each node • Examine each edge and add 1 to the count of each of its two nodes • The total time required is O(m) • Step 2. Create an index on edges, with the pair of nodes at its ends as the key. • For the given two nodes, whether the edge between them exists • A hash table suffices • O(m) • Expected time to answer a query about the existence of an edge is a constant • Step 3. Create another index of edges, this one with key equal to a single node • Given a node v , we can retrieve the nodes adjacent to v in time proportional to the number of those nodes http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

  6. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 2. Sorting Nodes • Sorting nodes • First criteria : By degree • Second criteria: if v and u have the same degree, recall that both v and u are integers, so order them numerically • Therefore, we say v ≺ u if and only if either 1) The degree of v is less than the degree of u , or 2) The degrees of u and v are the same, and v < u CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [1/2] • Heavy-Hitter Triangles • There are only O( ! ) heavy-hitter nodes • There are O(m 3/2 ) possible heavy-hitter triangles, and using the index on edges we can check if all three edges exist in O(1) time. Therefore, O(m 3/2 ) time is needed to find all the heavy-hitter triangles • Other Triangles • Consider each edge (v 1 ,v 2 ) • If both v 1 and v 2 are heavy hitters, ignore this edge • Suppose that v 1 is not a heavy hitter and moreover v 1 ≺ v 2 • Let u 1 , u 2 , . . . , u k be the nodes adjacent to v 1 • Note that k < √m • We can find these nodes, using the index on nodes, in O(k) time, which is surely O( √ m) time http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

  7. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 3. Counting Triangles [2/2] • Other Triangles-continued • For each u i we can use the first index to check whether edge ( u i ,v 2 ) exists in O(1) time • We can also determine the degree of u i in O(1) time, because we have counted all the nodes’ degrees • We count the triangle { v 1 ,v 2 ,u i } if and only if the edge ( u i ,v 2 ) exists, and v 1 ≺ u i • A triangle is counted only once • v 1 is the node of the triangle that precedes both other nodes of the triangle according to the ≺ ordering • Time to process all the nodes adjacent to v 1 is O( √ m) • Since there are m edges, the total time spent counting other triangles is O(m 3/2 ) • The time to find heavy hitter triangles is O(m 3/2 ) and so is the time to find the other triangles • Thus, the total time of the algorithm is O(m 3/2 ) CS435 Introduction to Big Data - Spring 2016 GEAR Session 5. Algorithmic Techniques for Big Data Lecture 1. Locality Sensitive Hashing Introduction http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

  8. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Traditional hash functions • Cryptographic hash function (e.g. SHA-1) • Should be difficult to reverse • Designed to map a data to an integer that can be used to look in a particular bucket within the hash table (e.g. hashtables) • Key properties for the non-cryptographic hash functions • Efficiently computable • Should uniformly distribute the keys • Two inputs will result in hash outputs that are either different or the same based on key properties of the inputs. CS435 Introduction to Big Data - Spring 2016 Locality-sensitive hash functions • Hash value collisions are more likely for two input values • “Close” together than for inputs that are far apart • Many different definitions regarding the “closeness” • Neighboring • Similarity • … http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

  9. CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 Finding the most similar documents • Measuring similarity between pairs of documents • Extremely expensive • Example • 1M documents, signatures of length 250 ( 4 Byte each ) • 1M x 1,000Bytes = 1GB • Number of comparisons = 1M C 2 ( Half of trillion pairs) • 1 ms per calculation of similarity • 6 days to complete computing • Do we need to calculate the similarity for all of the pairs? CS435 Introduction to Big Data - Spring 2016 Distance measure • A distance measure over a space is a function d(x,y) that takes two points in the space as arguments and produces a real number that satisfies the following axioms: d(x,y) ≥ 0 (no negative distance) 1. d(x,y) = 0 if and only if x = y 2. d(x,y) = d(y,x) (distance is symmetric) 3. d(x,y) ≤ d(x,z)+d(z,y) (the triangle inequality) 4. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend