cs535 big data 1 30 2019 week 2 b sangmi lee pallickara
play

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week - PDF document

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019 Colorado State University, Spring 2019 CS535 BIG DATA FAQs Term project deliverable 0 Item 1: Your team members PART A. BIG DATA TECHNOLOGY


  1. CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019 Colorado State University, Spring 2019 CS535 BIG DATA FAQs • Term project deliverable 0 • Item 1: Your team members PART A. BIG DATA TECHNOLOGY • Item 2: Tentative project titles (up to 3) 3. DISTRIBUTED COMPUTING • Submission deadline: Feb. 1 MODELS FOR SCALABLE BATCH • Via email or canvas COMPUTING • PA1 • Hadoop and Spark installation guides are posted • If you would like to start your homework, please send me an email with your team information. I will assign the port range for your team. Sangmi Lee Pallickara • Quiz 1: February 4. 2019 in class Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 1/30/2019 Week 2-A-2 1/30/2019 Week 2-A-3 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Topics of Todays Class • Overview of the Programing Assignment 1 • 3. Distributed Computing Models for Scalable Batch Computing • MapReduce Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) Week 2-A-4 Week 2-A-5 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 This material is built based on Types of Web queries • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the • Yes/No queries ACM . 46 (5): 604–632 • Does Chrome support .ogv video format? • Broad topic queries • Find information about “polar vortex” • Similar-page query • Find pages similar to ‘https://stackoverflow.com’ Image credit: https://www.cnn.com/2019/01/30/weather/winter-weather-wednesday-wxc/index.html http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1

  2. CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-6 Week 2-A-7 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Ranking algorithm to find the most “authoritative” pages Challenge of content-based ranking • To find the small set of the most authoritative pages that are relevant to the query • Most useful pages do not include the keyword (that the users are looking for) • ” computer ” in the APPLE page? • Examples of the authoritative pages • For the topic, “python” • https://www.python.org/ • For the information about “Colorado State University” • https://www.colostate.edu/ • For the images about ”iPhone” • https://www.apple.com/iphone/ Captured Jan.30, 2019 1/30/2019 Week 2-A-8 1/30/2019 Week 2-A-9 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Challenge of content-based ranking Challenge of content-based ranking • How about IBM’s web page? • Pages are not sufficiently descriptive • “ health care ” in Poudre Valley Hospital? Captured Jan.30, 2019 Captured Jan.30, 2019 Week 2-A-10 Week 2-A-11 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 HITS (Hipertext-Induced Topic Search) HITS (Hypertext-Induced Topic Search) • PageRank captures simplistic view of a network • A.K.A. Hubs and Authorities • Jon Kleinberg 1997 • Topic search • Authority • Automatically determine hubs/authorities • A Web page with good, authoritative content on a specific topic • A Web page that is linked by many hubs • In practice • Performed only on the result set (PageRank is applied on the complete set of documents) • Hub • Developed for the IBM Clever project • A Web page pointing to many authoritative Web pages • Used by Teoma (later Ask.com) • e.g. portal pages (Yahoo) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2

  3. CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-12 Week 2-A-13 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Understanding Authorities and Hubs [1/2] Understanding Authorities and Hubs [2/2] • Intuitive Idea to find authoritative results using link analysis : • A good hub page points to many good authoritative pages • Not all hyperlinks are related to the conferral of authority • A good authoritative page is pointed to by many good hub pages • Patterns that authoritative pages have • Authoritative Pages share considerable overlap in the sets of pages that point to them. • Authorities and hubs have a mutual reinforcement relationship Authorities Hubs 1/30/2019 Week 2-A-14 1/30/2019 Week 2-A-15 Colorado State University, Spring 2019 Colorado State University, Spring 2019 P1 Calculating Authority/Hub scores [1/3] Calculating Authority/Hub scores [2/3] Let there be n Web pages Each Web page has an authority score a i and a hub Define the n x n adjacency matrix A such that, P1 score h i . A uv = 1 if there is a link from u to v. P2 P4 We define the authority score by summing up the Otherwise A uv = 0 hub scores that point to it, ( ! " = $ ℎ % * %" 0 1 1 1 P1 P4 Graph with pages P2 %&' P3 0 0 1 1 P2 0 1 1 1 j: row # in the matrix 1 0 0 1 0 0 1 1 i: column # in the matrix P3 Graph with pages 0 0 0 1 P3 1 0 0 1 P4 This can be written concisely as, 0 0 0 1 ! = * + ℎ Week 2-A-16 Week 2-A-17 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 P1 P1 Calculating Authority/Hub scores [3/3] Hubs and Authorities Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all- one vector. Similarly, we define the hub score by summing up P4 P4 P2 P2 a 0 =(1,1,1,1) the authority scores ! " , h 0 =(1,1,1,1) ) Repeating this, the sequences a 0 , a 1 , a 2 ,… and h 0 , h 1 , h 2 ,… ℎ $ = & ! " * "$ converge (to limits x * and y * ) "'( Graph with pages a 1 =(((1 x 0)+(1 x 0)+(1 x 1)+(1 x 0)), P3 P3 Graph with pages ((1 x 1)+(1 x 0)+(1 x 0)+(1 x 0)), j: row # in the matrix 0 1 1 1 ((1 x 1)+(1 x 1)+(1 x 0)+(1 x 0)), i: column # in the matrix 0 1 1 1 0 0 1 1 ((1 x 1)+(1 x 1)+(1 x 1)+(1 x 1))) = (1,1,2,4) 0 0 1 1 This can be written concisely as, Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), 1 0 0 1 ℎ = *! 4/(1+1+2+4)) = (1/8, 1/8, ¼ , ½ ) 1 0 0 1 0 0 0 1 a 1 = (1/8, 1/8, ¼ , ½ ) ( ß authority values after the first 0 0 0 1 iteration) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3

  4. CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-18 Week 2-A-19 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Hubs and Authorities Implementing Topic Search using HITS Graph with pages Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all-one vector. • Step 1. 0 1 1 1 a 0 =(1,1,1,1) • Constructing a focused subgraph based on a query 0 0 1 1 h 0 =(1,1,1,1) a 1 =(1/8, 1/8, ¼ , ½ ) 1 0 0 1 h 1 =(((1/8 x 0)+(1/8 x 1)+(1/4 x 1)+(1/2 x 1)), • Step 2. ((1/8 x 0)+(1/8 x 0)+(1/4 x 1)+(1/2 x 1)), 0 0 0 1 • Iteratively calculate the authority value and hub value of the page in the subgraph ((1/8 x 1)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1)), ((1/8 x 0)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1))) = (7/8,6/8,5/8, 4/8) After the normalization: h 1 =(7/22,6/22,5/22, 4/22) ( ß hub values after the first iteration) 1/30/2019 Week 2-A-20 1/30/2019 Week 2-A-21 Colorado State University, Spring 2019 Colorado State University, Spring 2019 Step 1. Constructing a focused subgraph (root set) Step 2. Constructing a focused subgraph ( base set ) • For each page p ∈ R • Generate a root set from a text-based search engine • Add the set of all pages p points to • e.g. pages containing query words • Add the set of all pages pointing to p Root set Base set Week 2-A-22 Week 2-A-23 1/30/2019 Colorado State University, Spring 2019 1/30/2019 Colorado State University, Spring 2019 Step 3. Initial values Step 4. After the first iteration Nodes Hubs Authority Nodes Hubs Authority P1 1 1 P1 7/22 1/8 P1 P1 P2 1 1 P2 6/22 1/8 P3 1 1 P3 5/22 2/8 P4 1 1 P4 4/22 4/8 Ranks P4 P4 P2 P2 Hub: P1>P2>P3>P4 Ranks Authority: P1=P2<P3<P4 Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 Normalization P3 • Original paper: using squares sum (to 1) P3 • You can use sum (to 1) • value = value/(sum of all values) http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend