cs535 big data 1 29 2020 week 2 b sangmi lee pallickara
play

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs TP0 There may be adjustment of


  1. CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • TP0 • There may be adjustment of your team composition PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING • PA1 MODELS FOR SCALABLE BATCH • Hadoop and Spark installation video clips are posted COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Overview of the Programing Assignment 1 • 3. Distributed Computing Models for Scalable Batch Computing • MapReduce Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University This material is built based on Types of Web queries • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the • Yes/No queries ACM . 46 (5): 604–632 • Does Chrome support .ogv video format? • Broad topic queries • Find information about “Coronavirus” Image credit: https://www.cnn.com/2020/01/22/world/wuhan-coronavirus-visual-guide-intl/index.html • Similarity query • Find person similar to “Justin Bieber” Im age credit: https://w w w .google.com /search?source=hp&ei=tM YxXsH aFZO 4tQ ae7ILAC Q &q=sim ilar+to+justin+bieber&oq =Sim ilar+to+justin+&gs_l=psy-ab.3.0.0l3j0i22i30l7.546394.575419..576451...17.0..0.184.1712.34j1......0....1..gw s-w iz.......0i131j0i70i249j0i10.D W TD 5rf16d8 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

  2. CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Challenge of content-based ranking for topic search Challenge of content-based ranking for topic search • Assume that you are looking for “computer” • How about IBM’s web page? • ” computer ” in the APPLE page? CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Challenge of content-based ranking for topic search Challenge of content-based ranking for topic search • O.K… Now, Google? • Most useful pages do not include the keyword (that the users are looking for) • Pages are not sufficiently descriptive! • Semantic mismatch • Search keys vs. descriptions Image Credit: https://e360.yale.edu/features/could-massive-storm-surge-barriers-end-the-hudson-rivers-revival CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Ranking algorithm to find HITS (Hipertext-Induced Topic Search) the most “authoritative” pages for the given topic • To find the small set of the most authoritative pages that are relevant to the query • PageRank captures simplistic view of a network • Examples of the authoritative pages • Authority • For the topic, “python” • A Web page with good, authoritative content on a specific topic • https://www.python.org/ • A Web page that is linked by many hubs • For the information about “Colorado State University” • https://www.colostate.edu/ • Hub • A Web page pointing to many authoritative Web pages • e.g. portal pages (Yahoo) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

  3. CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University HITS (Hypertext-Induced Topic Search) Understanding Authorities and Hubs [1/2] • A.K.A. Hubs and Authorities • Intuitive Idea to find authoritative results using link analysis : • Jon Kleinberg 1997 • Not all hyperlinks are related to the conferral of authority • Topic search • Automatically determine hubs/authorities • Patterns that authoritative pages have • Authoritative Pages share considerable overlap in the sets of pages that point to them. • In practice • Performed only on the result set (PageRank is applied on the complete set of documents) • Developed for the IBM Clever project Authorities Hubs • Used by Teoma (later Ask.com) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Understanding Authorities and Hubs [2/2] Calculating Authority/Hub scores [1/3] Let there be n Web pages • A good hub page points to many good authoritative pages Define the n x n adjacency matrix A such that, P1 A uv = 1 if there is a link from u to v. • A good authoritative page is pointed to by many good hub pages Otherwise A uv = 0 • Authorities and hubs have a mutual reinforcement relationship 0 1 1 1 P1 P4 P2 0 0 1 1 P2 1 0 0 1 P3 Graph with pages 0 0 0 1 P3 P4 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University P1 P1 Calculating Authority/Hub scores [2/3] Calculating Authority/Hub scores [3/3] Each Web page i has an authority score a i and a hub score h i . Similarly, we define the hub score of a Web page i P4 P4 P2 P2 We define the authority score by summing up the by summing up the authority scores ! " , hub scores that point to it, ) ( ℎ $ = & ! " * "$ ! " = $ ℎ % * %" "'( Graph with pages Graph with pages P3 %&' P3 0 1 1 1 j: row # in the matrix 0 1 1 1 j: row # in the matrix i: column # in the matrix 0 0 1 1 i: column # in the matrix 0 0 1 1 This can be written concisely as, 1 0 0 1 1 0 0 1 ℎ = *! This can be written concisely as, 0 0 0 1 0 0 0 1 ! = * + ℎ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

  4. CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University P1 Hubs and Authorities Hubs and Authorities Graph with pages Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all- Let’s start arbitrarily from a 0 =1, h 0 =1 , where 1 is the all-one vector. 0 1 1 1 one vector. a 0 =(1,1,1,1) P4 P2 a 0 =(1,1,1,1) h 0 =(1,1,1,1) 0 0 1 1 h 0 =(1,1,1,1) a 1 =(1/8, 1/8, ¼ , ½ ) 1 0 0 1 Repeating this, the sequences a 0 , a 1 , a 2 ,… and h 0 , h 1 , h 2 ,… h 1 =(((1/8 x 0)+(1/8 x 1)+(1/4 x 1)+(1/2 x 1)), converge (to limits x * and y * ) ((1/8 x 0)+(1/8 x 0)+(1/4 x 1)+(1/2 x 1)), 0 0 0 1 a 1 =(((1 x 0)+(1 x 0)+(1 x 1)+(1 x 0)), ((1/8 x 1)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1)), P3 Graph with pages ((1 x 1)+(1 x 0)+(1 x 0)+(1 x 0)), ((1/8 x 0)+(1/8 x 0)+(1/4 x 0)+(1/2 x 1))) = (7/8,6/8,5/8, 4/8) ((1 x 1)+(1 x 1)+(1 x 0)+(1 x 0)), 0 1 1 1 ((1 x 1)+(1 x 1)+(1 x 1)+(1 x 1))) = (1,1,2,4) After the normalization: 0 0 1 1 Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), h 1 =(7/22,6/22,5/22, 4/22) ( ß hub values after the first iteration) 4/(1+1+2+4)) = (1/8, 1/8, ¼ , ½ ) 1 0 0 1 a 1 = (1/8, 1/8, ¼ , ½ ) ( ß authority values after the first 0 0 0 1 iteration) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Implementing Topic Search using HITS Step 1. Constructing a focused subgraph (root set) • Step 1. • Generate a root set from a text-based search engine • Constructing a focused subgraph based on a query • e.g. pages containing query words • Step 2. • Iteratively calculate the authority value and hub value of the page in the subgraph Root set CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Step 2. Constructing a focused subgraph ( base set ) Step 3. Initial values • For each page p ∈ R Nodes Hubs Authority P1 1 1 • Add the set of all pages p points to P1 P2 1 1 • Add the set of all pages pointing to p P3 1 1 P4 1 1 P4 P2 Ranks Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 P3 Base set http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend