Link-based Web Search Web Search PageRank HITS Stability Issues - PDF document

Roadmap Link-based Web Search � Web Search � PageRank � HITS � Stability Issues � Current Research Vagelis Hristidis School of Computer Science Florida International University COP 6727 9/5/2004 FIU, COP 6727 2 Search the Web Standard Web Search Engine Architecture crawl the Eliminate duplicates web DocIds crawler user create an inverted query index Search Inverted engine Show results index servers 9/5/2004 FIU, COP 6727 3 9/5/2004 FIU, COP 6727 4 1

Limitations of traditional IR analysis Before Google • Text-based ranking function � Traditional IR Ranking Eg. Could www.harvard.edu Web be recognized as one of the � Term frequency (tf) most authoritative pages, database � Inverse Document Frequency (idf) Keyword since many other web pages � … contain “harvard” more often. • Pages are not sufficiently Web self – descriptive pages Usually the term “search engine” doesn't’t appear on search engine web pages 9/5/2004 FIU, COP 6727 5 9/5/2004 FIU, COP 6727 6 Link Analysis [Kleinberg98, PageRank] Roadmap � Assumptions � Web Search � If the pages pointing to this page are good, then this is also � PageRank a good page. � HITS � The words on the links pointing to this page are useful indicators of what this page is about. � Stability Issues � Does it work? � Current Research � Apparently, Google uses it � The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage. 9/5/2004 FIU, COP 6727 7 9/5/2004 FIU, COP 6727 8 2

PageRank PageRank is a Usage Simulation � Make use of the link structure of the web to � “Random surfer” calculate a quality ranking (PageRank) for � Given a random URL each web page. � Clicks randomly on links � Each page has unique PageRank, � After a while gets bored and gets a new random URL independent of keyword query � The number of visits to each page is its � PageRank does NOT express relevance of PageRank. page to query 9/5/2004 FIU, COP 6727 9 9/5/2004 FIU, COP 6727 10 PageRank Calculation Intuition PageRank Calculation PR(A)=(1-d) + d*(PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)) � PageRank of page P increases when pages d: damping factor, normally this is set to 0.85. with large PageRanks point to P. T1, …, Tn: pages pointing to page A PR(A): PageRank of page A. PR(Ti): PageRank of page Ti. C(Ti): the number of links going out of page Ti. Note: d is needed due to PageRank sinks 9/5/2004 FIU, COP 6727 11 9/5/2004 FIU, COP 6727 12 3

Example of Calculation (1) Example of Calculation (2) Page A 1*0.85/2 Page B 1 Page A 1 Page B 1*0.85/2 1*0.85 1*0.85 Page C Page D 1*0.85 1 Page C 1 Page D 9/5/2004 FIU, COP 6727 13 9/5/2004 FIU, COP 6727 14 Example of Calculation (3) Example of Calculation (4) Page A Page A Page B Page B 2.03875 1 0.575 0.575 Page C Page C Page D Page D 1.1925 2.275 0.15 0.15 Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from (from Page A) +0.15 (not transferred) = 1.19125 Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 Page D: receives none, but has not transferred 0.15 = 0.15 9/5/2004 FIU, COP 6727 15 9/5/2004 FIU, COP 6727 16 4

Example of Calculation (5) Google Page A Page B � Uses PageRank as one of the criteria to rank 1.490 0.783 keyword query results. � Other criteria (may) include: � Term frequencies � Term proximities � Term position (title, top of page, etc) Page C Page D � Term characteristics (boldface, capitalized, etc) 1.577 0.15 � Link analysis information � Category information � After 20 iterations it converges � Popularity information � Converges because Web data graph irreducible (strongly connected) and aperiodic 9/5/2004 FIU, COP 6727 17 9/5/2004 FIU, COP 6727 18 Roadmap HITS [Kleinberg98] Hubs & Authorities � Jon M. Kleinberg: Authoritative Sources in a � Web Search Hyperlinked Environment . JACM 46(5): 604-632 � PageRank (1999) � HITS ( Hypertext-Induced Topic Search) developed � HITS by Jon Kleinberg, while visiting IBM Almaden. � Stability Issues � IBM expanded HITS into Clever. � IBM doesn't see Clever as real-time search engine. � Current Research But create constantly refreshed lists of relevant pages for categories 9/5/2004 FIU, COP 6727 19 9/5/2004 FIU, COP 6727 20 5

Hubs & Authorities Hubs & Authorities � Rank pages according to keyword query (in contrast to PageRank) � Good hub: page that points to many good authorities. � Good authority: page pointed to by many good hubs. � Given Keyword Query, assign a hub and an authoritative value to each page. � Pages with high authority are results of query 9/5/2004 FIU, COP 6727 21 9/5/2004 FIU, COP 6727 22 Hubs & Authorities Calculation : Root Set Hubs & Authorities Calculation : Root and Base Set (Cont’d) Set and Base Set Expand root set into base set by including (up to a designated size cut-off) � all pages linked to by pages in root set � � Using query term to collect a root set of pages all pages that link to a page in root set � from text-based search engine (AltaVista) Typical base set contains roughly 1000-5000 pages � Base Set Root Set Root Set 9/5/2004 FIU, COP 6727 23 9/5/2004 FIU, COP 6727 24 6

Hubs & Authorities Calculation Example: Mini Web � Iterative algorithm on Base Set: authority weights a (p), and hub weights h (p). X Y Z � Set authority weights a (p) = 1, and hub weights h (p) = 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 1 for all p. h a X x x ⎢ ⎥ ⎢ ⎥ = ⎢ ⎥ = ⎢ ⎥ = � Repeat following two operations A H a h M Y ⎢ 0 0 1 ⎥ y y ⎣ ⎦ ⎣ ⎦ ⎢ ⎥ (and then re-normalize a and h to have unit norm): a ⎢ ⎥ h 1 1 0 Z z z ⎣ ⎦ h(v 1 ) v 1 v 1 a(v 1 ) = = − H M A T h(v 2 ) v 2 p p v 2 a(v 2 ) i * i 1 H M H X M − i * i 1 h(v 3 ) a(v 3 ) = = v 3 v 3 T T A M H − A M M A − * * * 1 ∑ ∑ i i 1 i i = = a ( p ) h(q) h ( p ) a (q) Z Y p points to q q points to p 9/5/2004 FIU, COP 6727 25 9/5/2004 FIU, COP 6727 26 Example Hubs & Authorities Calculation ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 0 1 2 2 1 3 1 2 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = = ⎢ ⎥ M T 1 0 1 T 2 2 1 M T 1 1 0 = � Theorem (Kleinberg, 1998). The iterates a(p) M 0 0 1 ⎢ ⎥ M M ⎢ ⎥ M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 1 1 0 ⎥ ⎢ 1 1 2 ⎥ ⎢ 2 0 2 ⎥ ⎢ 1 1 0 ⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ and h(p) converge to the principal ⎣ ⎦ ∞ eigenvectors of M T M and MM T , where M is Iteration 0 1 2 3 … the adjacency matrix of the (directed) Web ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ + ⎤ 1 6 28 132 X is the best 2 3 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ hub subgraph. = ⎢ 1 ⎥ H 1 2 8 36 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X + ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 1 3 ⎦ ⎣ 1 ⎦ ⎣ 4 ⎦ ⎣ 20 ⎦ ⎣ 96 ⎦ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 5 24 114 + 1 3 ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = A 1 5 24 114 + ⎢ 1 3 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Z Y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 1 ⎦ ⎣ 4 ⎦ ⎣ 18 ⎦ ⎣ 84 ⎦ 2 ⎣ ⎦ Z is most authoritative 9/5/2004 FIU, COP 6727 27 9/5/2004 FIU, COP 6727 28 7

PageRank v.s. Authorities Roadmap � PageRank � HITS � Web Search (Google) (CLEVER) � PageRank � computed for all web � performed on the set of � HITS pages stored in the retrieved web pages for database prior to the each query � Stability Issues query � computes authorities � computes authorities only and hubs � Current Research � Trivial and fast to � easy to compute, but compute real-time execution is hard 9/5/2004 FIU, COP 6727 29 9/5/2004 FIU, COP 6727 30 How do we analyze algorithm stability? PageRank Stability General Strategy: Theoretical Result: Start with original adjacency matrix, A 1. � If original k pages to be modified do not have Perturb the matrix to get A* 2. Select k nodes in graph to add or delete � high overall PR scores then perturbed scores Compute distance, d(r(A),r(A*)), for some distance 3. will not be far from the original measure d and objective function r that measures the quality of results of A’ somehow Compute amount of perturbation p (Α,Α * ) for some 4. Note: Result conditioned on d, resetting distance function p that measures the amount of perturbation probability, not being too small Evaluate the conditions, if any, where small values for 5. p generate large values for d 9/5/2004 FIU, COP 6727 31 9/5/2004 FIU, COP 6727 32 8

Link-based Web Search Web Search PageRank HITS Stability Issues - PDF document

Roadmap Link-based Web Search Web Search PageRank HITS Stability Issues Current Research Vagelis Hristidis School of Computer Science Florida International University COP 6727 9/5/2004 FIU, COP 6727 2 Search the Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link Project

Vertex Standard EVX-Link Training EVX-Link Training What is the EVX-Link EVX-Link is a fast

Changing the Game - The De-Linking Paradigm Old Way Our Way De-Link De-Link Link Link

Web CS490W: Web I nformation Search & Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

CMSC 110 Instructor: Grading Jia Tao, Ph.D.

QCD anatomy of WIMP- nucleon interactions Mikhail Solon UCB/LBNL MITP workshop on Effective

Design Options Representation Anatomy Presentation Tutorial Jrg Cassens References

B INARY S EARCH T REES Acknowledgement: The course slides are adapted from the slides prepared

Document Navigation: Ontologies or Knowledge Organisation Systems Simon Jupp - NETTAB 2007

Dynamo, Five Years Later Andy Gross Chief Architect, Basho Technologies QCon London 2013

Neurobiology HMS 130/230 Harvard / GSAS 78454 Visual object recognition: From computational and

BANKING HUMAN BIOMATERIALS FOR RESEARCH Paul J. Volek, MPH Administrative Director Research

Link-based Web Search Web Search PageRank HITS Stability Issues - PDF document

Roadmap Link-based Web Search Web Search PageRank HITS Stability Issues Current Research Vagelis Hristidis School of Computer Science Florida International University COP 6727 9/5/2004 FIU, COP 6727 2 Search the Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link Project

Vertex Standard EVX-Link Training EVX-Link Training What is the EVX-Link EVX-Link is a fast

Changing the Game - The De-Linking Paradigm Old Way Our Way De-Link De-Link Link Link

Web CS490W: Web I nformation Search &amp; Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

CMSC 110 Instructor: Grading Jia Tao, Ph.D.

QCD anatomy of WIMP- nucleon interactions Mikhail Solon UCB/LBNL MITP workshop on Effective

Design Options Representation Anatomy Presentation Tutorial Jrg Cassens References

B INARY S EARCH T REES Acknowledgement: The course slides are adapted from the slides prepared

Document Navigation: Ontologies or Knowledge Organisation Systems Simon Jupp - NETTAB 2007

Dynamo, Five Years Later Andy Gross Chief Architect, Basho Technologies QCon London 2013

Neurobiology HMS 130/230 Harvard / GSAS 78454 Visual object recognition: From computational and

BANKING HUMAN BIOMATERIALS FOR RESEARCH Paul J. Volek, MPH Administrative Director Research

Web CS490W: Web I nformation Search & Management Web opened the door for many important