Motivation Topic-Sensitive PageRank Improve search results Current - PDF document

� ✆ ✞ ☎ ✝ ✁ Motivation Topic-Sensitive PageRank ✂ Improve search results ✄ Current engines work well for us “computer types”, but not for novice users ✂ Exploit search context in a tractable and Taher H. Haveliwala effective way ✄ Current engines can only do so well when Stanford University optimizing parameters for Joe User issuing query q taherh@cs.stanford.edu Search Context Link-Based Scoring (HITS) ✂ Query context ✂ HITS (“Hubs and Authorities”) ✄ Highlighted word on page ✄ [Kleinberg SODA ’98] ✄ Previous queries issued ✄ Determine important Hub pages and ✂ User context important Authority pages ✄ Bookmarks ✄ +Query specific rank score ✄ Browsing history ✄ - Expensive at runtime Placing Search in Context: The Concept Revisited [Finkelstein et al. WWW10 ’01]

� ☎ ✟ ✞ ✞ ✁ ✝ ✝ ✆ ✝ ✞ Link-Based Scoring (PageRank) Original PageRank ✂ PageRank query ✄ [Page et al. ’98] ✄ Assigns a-priori “importance” estimates to Query Processor pages page → rank Web graph Query-time ✄ - Query independent rank score ✄ + Inexpensive at runtime PageRank() Offline ✂ Algorithm has hooks for “personalization” Topic-Sensitive PageRank Topic-Sensitive PageRank Assigns multiple a-priori “importance” estimates query context to pages One PageRank score per basis topic Query Processor (page,topic) + Query specific rank score → Classifier Web + Make use of context rank topic Query-time + Inexpensive at runtime Related approach: one score per query word TSPageRank() was considered in [Richardson, Domingos NIPS ’02] (builds on [Rafiei, Mendelzon WWW ’00]) Yahoo! Offline or ODP

� ☎ ☎ ✝ ✁ ✞ Original PageRank Intuition PageRank Diagram ✂ “Page is important if many important pages point to it” ✄ Many pages point to Yahoo!, so it is “important” ✄ Because Yahoo! is important, anyone it prominently points to is “important” Graph structure for entire web ☎✆☎ PageRank Diagram PageRank Diagram 1 1 0.5 1 1 0.5 1 Initialize all nodes to rank 1 Propagate ranks across links (multiplying by link weights)

� ☎ ✂ ✆ ✁ ✄ ✁ ✁ ✁ PageRank Diagram PageRank Diagram 1.5 1 0.5 1.5 0.5 0.5 0.5 PageRank Diagram PageRank Diagram 1 1.2 1.5 1.2 0.5 0.6 After a while…

� ✂ ✄ ✁ ✁ ✆ Original PageRank Influencing the Computation ✄ Input ✄ Uninfluenced: ☎ Web graph G “Page is important if many important pages point to it.” ✄ Output ☎ Rank vector r : (page → page importance) Influenced: ✄ r = PR( G ) “Page is important if many important pages point to it, and btw, the following are by definition important pages.” Influencing the Computation Influencing the Computation Graph structure for entire web Pick a set of influence ✝✟✞ ✝✠✁

� ✄ ✝ ✄ ✄ ✄ ✁ ✄ ☎ ✄ ☎ ✄ ☎ Influenced PageRank Topic-Sensitive PageRank query Input: context Web graph G influence vector v Query Processor v : (page → degree of influence) (page,topic) → Classifier Web Output: rank topic Rank vector r : (page → page importance wrt v ) Query-time r = IPR( G , v ) TSPageRank() How to choose v ? Yahoo! Offline or ODP ✁✂✁ ✁✂✆ Topic-Sensitive PageRank: Part I (preprocessing) Offline Processing ✟ Input: Goal: Generate multiple a-priori estimates of ✠ Web W page importance, each score providing an importance estimate with respect to a topic ✠ Basis topics [c 1 , ... ,c 16 ] Use the Open Directory as a source of We use 16 categories (first level of ODP) representative basis topics (i.e., use ODP ✟ Output: pages to form a set of influence vectors v j ) ✠ List of rank vectors [ r 1 , ... , r 16 ] Offline preprocessing step, just as with ordinary r j : (page → page importance wrt topic c j ) PageRank ✁✂✞

� Offline Processing Graphical Depiction of Part I For each topic c j ∈ FirstLevel(ODP): 1  Sports if ( ) i ∈ pages c  j set [ ] = v i ( )  pages c j j  0 otherwise  Compute r j = IPR( W , v j ) d Select set of influence, calculate PageRank for all pages [ ] . 05 r sports d = For example, ✁✄✂ ✁✄☎ Graphical Depiction of Part I Topic-Sensitive PageRank Health query context Query Processor (page,topic) → Classifier Web rank topic Query-time d TSPageRank() Select set of influence, calculate PageRank for all pages Yahoo! r health [ d ] = . 01 Offline or ODP For example, ✁✄✆ ✁✄✝

� ✡ Topic-Sensitive PageRank: Part II (query processing) Two Usage Scenarios ☎ Goal: calculate some distribution of ☎ Classify the query ☎ Classify the query + context weights over the 16 topics in our basis ☎ Use a multinomial Naive Bayes classifier ✟ query history ✟ words surrounding a highlighted search phrase ✆ Training set: pages listed in ODP ✟ ... ✆ Input: {query} or {query, context} ✆ Output: probability distribution (weights) over the basis topics ✁✄✂ ✁✞✝ Classify the Query Example Topic Distribution ☎ Only the link structure of pages relevant For the query ‘golf’, with no additional context, the distribution of topic weights we would use to the query topic will be used to rank is: pages ☎ Better to rank query ‘golf’ with the Sports- 1 0.9 0.8 specific rank vector 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 s s s s h e s s n e a l e g y s d t s r e t m n w o c c n t t l r e e m l n e r r A t a o e e t i n o n p i i o o n u a e e N a e e c p W i p H H T e r g i i p o s G e c o S S u m _ r e S h B d c e f R o e S C n R R a _ s ✁✄✠ d ✁✄✁ i K

� ✁ ✂ ✝ Classify the Query Context Picking the Topic Distribution ✄ The topic distribution will influence If the query is ‘golf’, but the previous query was ‘financial services investments’, then the rankings to prefer pages important to the distribution of topic weights we would use is: topic of the query context ✄ If user issues queries about investment 1 0.9 0.8 opportunities, a follow-up query on ‘golf’ 0.7 0.6 0.5 should be ranked with the Business- 0.4 0.3 specific rank vector 0.2 0.1 0 s s s s h e s s n e a l e g y s d t s r e t m n w o c c n e t t l r e e m a l n n n r r A t o e e t i o p i i o o n u a e e N a e e c p W i p H H T e r g i i p o s G e c o S S u m _ r f e S B d c e R h o e S n R R C a _ s d ✁✆☎ i K Composite Link Score Interpretation of Composite Score ✠ For set of influence vectors { v j } ✄ Use the distribution w to weight the respective topic-specific ranks, forming ∑ j [w j · IPR( W , v j )] = IPR( W , ∑ j [w j · v j ]) the topic-sensitive PageRank score for document d : s d = ∑ j w j r j [d] ✄ Weighted sum of rank vectors itself forms a valid rank vector ✁✆✞ ✁✆✟

Interpretation Interpretation Health Sports d d First set of influence Second set of influence ✄✆☎ ✄✆✝ Interpretation Implementation Platform ☞ Stanford WebBase repository: 120M Health pages Sports ☞ For research experiments, topic weights can be estimated automatically by classifier, or specified explicitly d Topic-sensitive score is PageRank of above graph r = . 026 { , }, For example, sports health d ✞✠✟ ✞☛✡ �✂✁

Does it make a difference? User Study (no search context) ✆ Do the different topical rank vectors rank ✆ Test set of 10 queries results for queries differently? ✆ 5 users were each shown top 10 results ✆ To answer, measure the similarity of to queries, when ranked using induced ranks for some set of test query ✞ Standard PageRank vector results ✞ Topic-Sensitive PageRank vector ✆ Details in paper, but short answer is, ✆ A page in the result was “relevant” if 3 of “yes, the different rank vectors induce the 5 users judged it to be relevant different result rankings” ✂☎✄ ✂☎✝ User Study (no search context) User Study Follow-up ✆ After factoring in text-based scoring, the precision values for both standard and topic-sensitive ranking go up ✆ Topic-sensitive rankings still preferred ✆ “Precision” not the best metric to use ✞ Some pages are “more relevant” ✞ Some pages are of “higher quality” ✂✟✂ ✂☎✠ �✁�

Query for ‘golf’ (topic-sensitivity disabled) Results for ‘golf’ ✄✆☎ ✄✆✝ Results Enable History Tracking ‘financial services investments’ ‘financial services investments’ ✄✆✞ ✄✆✟ �✂✁

✠ ✟ ✠ ✠ ✠ ‘golf’ again, but query history judged to be Business topic Search Context Advantages of mediating through basis topics, as opposed to ‘keyword extraction’: Flexibility : uniformly treat variety of sources of context and personalization Transparency : topic weights are easily interpreted by user Privacy : topic weights reveal less unintentionally Efficiency : low query time cost, with small additional preprocessing cost ✄✆☎ ✄✞✝ Future Work Future Work ☛ Finer grained set of representative topics, ☛ Graph weighting scheme based on page similarity to ODP category, rather than to reflect more accurately user page membership to ODP category preferences and search context ✄✆✡ ✄✆☞ �✂✁

Motivation Topic-Sensitive PageRank Improve search results Current - PDF document

Motivation Topic-Sensitive PageRank Improve search results Current engines work well for us computer types, but not for novice users Exploit search context in a tractable and Taher H. Haveliwala

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search Overview

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search 1 Overview

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

PAGERANK-RELATED METHODS FOR ANALYZING CITATION NETWORKS Author: Ludo Waltman and Erjia Yan

PageRank Google's PageRank algorithm. [Sergey Brin and Larry Page, 1998] Measure

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 Todays Biz 1. Review MPI 2.

Personalized PageRank Document Understanding, session 4 CS6200: Information Retrieval

0.1 Naive formulation of PageRank In general, PageRank is a way to rank nodes on a graph. Let r i

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

simulations Workshop on Bioinformatics of Gene Regulation on the occasion of 30 Years TRANSFAC

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

Work/Life Balance Cynthia Barnhart Massachusetts Institute of Technology October 10, 2009

Balance properties of infinite words associated with quadratic Pisot numbers Ond rej Turek

Gaussian process regression for Sensitivity analysis GPSS Workshop on UQ, Sheffield, September

Proving Expected Sensitivity of Probabilistic Programs Gilles Barthe Thomas Espitau Benjamin

Bootstrapping Sensitivity Analysis Qingyuan Zhao Department of Statistics, The Wharton School

Data-driven sensitivity analysis for Matching estimators Giovanni Cerulli 1 1 IRCrES-CNR, Research

Motivation Topic-Sensitive PageRank Improve search results Current - PDF document

Motivation Topic-Sensitive PageRank Improve search results Current engines work well for us computer types, but not for novice users Exploit search context in a tractable and Taher H. Haveliwala

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

IV.4 Topic-Specific &amp; Personalized PageRank PageRank produces one-size-fits-all

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures &amp; Algorithms Spring 2020 Outline The WWW

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search Overview

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search 1 Overview

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

PAGERANK-RELATED METHODS FOR ANALYZING CITATION NETWORKS Author: Ludo Waltman and Erjia Yan

PageRank Google's PageRank algorithm. [Sergey Brin and Larry Page, 1998] Measure

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 Todays Biz 1. Review MPI 2.

Personalized PageRank Document Understanding, session 4 CS6200: Information Retrieval

0.1 Naive formulation of PageRank In general, PageRank is a way to rank nodes on a graph. Let r i

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

simulations Workshop on Bioinformatics of Gene Regulation on the occasion of 30 Years TRANSFAC

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

Work/Life Balance Cynthia Barnhart Massachusetts Institute of Technology October 10, 2009

Balance properties of infinite words associated with quadratic Pisot numbers Ond rej Turek

Gaussian process regression for Sensitivity analysis GPSS Workshop on UQ, Sheffield, September

Proving Expected Sensitivity of Probabilistic Programs Gilles Barthe Thomas Espitau Benjamin

Bootstrapping Sensitivity Analysis Qingyuan Zhao Department of Statistics, The Wharton School

Data-driven sensitivity analysis for Matching estimators Giovanni Cerulli 1 1 IRCrES-CNR, Research

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW