CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week - PDF document

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019 Colorado State University, Spring 2019 CS535 BIG DATA FAQs • Term project deliverable 0 • Item 1: Your team members PART A. BIG DATA TECHNOLOGY • Item 2: Tentative project titles (up to 3) 3. DISTRIBUTED COMPUTING • Submission deadline: Feb. 1 MODELS FOR SCALABLE BATCH • Via email or canvas COMPUTING • PA1 • Hadoop and Spark installation guides are posted • If you would like to start your homework, please send me an email with your team information. I will assign the port range for your team. Sangmi Lee Pallickara • Quiz 1: February 4. 2019 in class Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 1/30/2019 Week 2-A-2 1/30/2019 Week 2-A-3 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Topics of Todays Class • Overview of the Programing Assignment 1 • 3. Distributed Computing Models for Scalable Batch Computing • MapReduce Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) Week 2-A-4 Week 2-A-5 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 This material is built based on Types of Web queries • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the • Yes/No queries ACM . 46 (5): 604–632 • Does Chrome support .ogv video format? • Broad topic queries • Find information about “polar vortex” • Similar-page query • Find pages similar to ‘https://stackoverflow.com’ Image credit: https://www.cnn.com/2019/01/30/weather/winter-weather-wednesday-wxc/index.html http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-6 Week 2-A-7 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Ranking algorithm to find the most “authoritative” pages Challenge of content-based ranking • To find the small set of the most authoritative pages that are relevant to the query • Most useful pages do not include the keyword (that the users are looking for) • ” computer ” in the APPLE page? • Examples of the authoritative pages • For the topic, “python” • https://www.python.org/ • For the information about “Colorado State University” • https://www.colostate.edu/ • For the images about ”iPhone” • https://www.apple.com/iphone/ Captured Jan.30, 2019 1/30/2019 Week 2-A-8 1/30/2019 Week 2-A-9 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Challenge of content-based ranking Challenge of content-based ranking • How about IBM’s web page? • Pages are not sufficiently descriptive • “ health care ” in Poudre Valley Hospital? Captured Jan.30, 2019 Captured Jan.30, 2019 Week 2-A-10 Week 2-A-11 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 HITS (Hipertext-Induced Topic Search) HITS (Hypertext-Induced Topic Search) • PageRank captures simplistic view of a network • A.K.A. Hubs and Authorities • Jon Kleinberg 1997 • Topic search • Authority • Automatically determine hubs/authorities • A Web page with good, authoritative content on a specific topic • A Web page that is linked by many hubs • In practice • Performed only on the result set (PageRank is applied on the complete set of documents) • Hub • Developed for the IBM Clever project • A Web page pointing to many authoritative Web pages • Used by Teoma (later Ask.com) • e.g. portal pages (Yahoo) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-12 Week 2-A-13 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Understanding Authorities and Hubs [1/2] Understanding Authorities and Hubs [2/2] • Intuitive Idea to find authoritative results using link analysis : • A good hub page points to many good authoritative pages • Not all hyperlinks are related to the conferral of authority • A good authoritative page is pointed to by many good hub pages • Patterns that authoritative pages have • Authoritative Pages share considerable overlap in the sets of pages that point to them. • Authorities and hubs have a mutual reinforcement relationship Authorities Hubs 1/30/2019 Week 2-A-14 1/30/2019 Week 2-A-15 Colorado State University, Spring 2019 Colorado State University, Spring 2019 P1 Calculating Authority/Hub scores [1/3] Calculating Authority/Hub scores [2/3] Let there be n Web pages Each Web page has an authority score a i and a hub Define the n x n adjacency matrix A such that, P1 score h i . A uv = 1 if there is a link from u to v. P2 P4 We define the authority score by summing up the Otherwise A uv = 0 hub scores that point to it, ( ! " = $ ℎ % * %" 0 1 1 1 P1 P4 Graph with pages P2 %&' P3 0 0 1 1 P2 0 1 1 1 j: row # in the matrix 1 0 0 1 0 0 1 1 i: column # in the matrix P3 Graph with pages 0 0 0 1 P3 1 0 0 1 P4 This can be written concisely as, 0 0 0 1 ! = * + ℎ Week 2-A-16 Week 2-A-17 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 P1 P1 Calculating Authority/Hub scores [3/3] Hubs and Authorities Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all- one vector. Similarly, we define the hub score by summing up P4 P4 P2 P2 a 0 =(1,1,1,1) the authority scores ! " , h 0 =(1,1,1,1) ) Repeating this, the sequences a 0 , a 1 , a 2 ,… and h 0 , h 1 , h 2 ,… ℎ $ = & ! " * "$ converge (to limits x * and y * ) "'( Graph with pages a 1 =(((1 x 0)+(1 x 0)+(1 x 1)+(1 x 0)), P3 P3 Graph with pages ((1 x 1)+(1 x 0)+(1 x 0)+(1 x 0)), j: row # in the matrix 0 1 1 1 ((1 x 1)+(1 x 1)+(1 x 0)+(1 x 0)), i: column # in the matrix 0 1 1 1 0 0 1 1 ((1 x 1)+(1 x 1)+(1 x 1)+(1 x 1))) = (1,1,2,4) 0 0 1 1 This can be written concisely as, Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), 1 0 0 1 ℎ = *! 4/(1+1+2+4)) = (1/8, 1/8, ¼ , ½ ) 1 0 0 1 0 0 0 1 a 1 = (1/8, 1/8, ¼ , ½ ) ( ß authority values after the first 0 0 0 1 iteration) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-18 Week 2-A-19 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Hubs and Authorities Implementing Topic Search using HITS Graph with pages Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all-one vector. • Step 1. 0 1 1 1 a 0 =(1,1,1,1) • Constructing a focused subgraph based on a query 0 0 1 1 h 0 =(1,1,1,1) a 1 =(1/8, 1/8, ¼ , ½ ) 1 0 0 1 h 1 =(((1/8 x 0)+(1/8 x 1)+(1/4 x 1)+(1/2 x 1)), • Step 2. ((1/8 x 0)+(1/8 x 0)+(1/4 x 1)+(1/2 x 1)), 0 0 0 1 • Iteratively calculate the authority value and hub value of the page in the subgraph ((1/8 x 1)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1)), ((1/8 x 0)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1))) = (7/8,6/8,5/8, 4/8) After the normalization: h 1 =(7/22,6/22,5/22, 4/22) ( ß hub values after the first iteration) 1/30/2019 Week 2-A-20 1/30/2019 Week 2-A-21 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Step 1. Constructing a focused subgraph (root set) Step 2. Constructing a focused subgraph ( base set ) • For each page p ∈ R • Generate a root set from a text-based search engine • Add the set of all pages p points to • e.g. pages containing query words • Add the set of all pages pointing to p Root set Base set Week 2-A-22 Week 2-A-23 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Step 3. Initial values Step 4. After the first iteration Nodes Hubs Authority Nodes Hubs Authority P1 1 1 P1 7/22 1/8 P1 P1 P2 1 1 P2 6/22 1/8 P3 1 1 P3 5/22 2/8 P4 1 1 P4 4/22 4/8 Ranks P4 P4 P2 P2 Hub: P1>P2>P3>P4 Ranks Authority: P1=P2<P3<P4 Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 Normalization P3 • Original paper: using squares sum (to 1) P3 • You can use sum (to 1) • value = value/(sum of all values) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week - PDF document

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019 Colorado State University, Spring 2019 CS535 BIG DATA FAQs Term project deliverable 0 Item 1: Your team members PART A. BIG DATA TECHNOLOGY

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

Preview question In a 32-bit Linux/x86 program, which of these objects would have the lowest

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 7: Mutable State (1/2)

Best Vehicles for Estate Tax Planning Now and Best Ways to Draft Them SLATs, DAPTs, GRATs,

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay

Agenda Move to Digital Publishing Author benefits Using the template Content and

How I Hacked facebook Again! by Orange Tsai Orange Tsai Principal security researcher at

Role of the Plan Commission Rebecca Roberts Center for Land Use Education UW Stevens

AND SETTING GODS PEOPLE FREE ACROSS THE DIOCESE GROWING DISCIPLES AND SETTING GODS PEOPLE