Searching for Authority on the WWW (Not just relevance or - PowerPoint PPT Presentation

Searching for Authority on the WWW (Not just relevance or popularity...) Ido Rosen <ido@cs.uchicago.edu>

Sources of Information on the WWW • Textual content • Images, sounds, multimedia content • Hyperlink digraph (network structure) • Pages are vertices, links are arcs • Refinement: URLs are nodes

Nature of the WWW • Local organization may be a priori. • Global organization “utterly unplanned.” • Billions of agents (users, spiders). • Millions of publishers. • Trillions of vertices, at least. • Too big for simple search.

Searching the WWW • Quality of search method defined by utility of results. • Utility requires human evaluation. • Utility is closely correlated to relevance . • Algorithmic and storage efficiency are a concern: interactivity/response time.

Search: Queries • Searches are initiated by a user- supplied query . • Three types of queries discussed: • Specific queries. • Broad-topic queries. • Similar content queries.

Search: Problems. • Specific queries: Scarcity . • Required information is scarce and pages are hard to find. • Broad-topic queries: Abundance . • We only want the authoritative pages. (i.e.: Wikipedia itself, not ad-clones.)

Search: Authorities • Possible measures of authority: • Frequency of search term on page. • Problem: Self-descriptive . • Popularity of page. (rank by links in) • Problem: Obfuscation by hubs . • Analysis of link structure...

Hyperlinks • Claim: Hyperlinks indicate conferred authority . • Claim: Hyperlinks solve self-descriptive problem. • What about navigational links? • What about paid advertisements?

Popularity • In some cases, most authoritative pages aren’t self-descriptive. • Universally popular pages would be considered highly authoritative w.r.t any query string, when they are not.

Step 1: Constructing Focused Subgraph • Obtain root set, R , from textual search. • Relatively small, rich in relevant pages, but doesn’t contain most or many of strongest authorities. • Extremely few intra-R links. • Obtain base set, S , from R by adding any pages pointing to or pointed from R.

Figure 1: Expanding the root set into a base set. R → S

• What about navigational links? • Transverse vs. intrinsic links. • Delete all intrinsic links. • Caveats? • What about “Google Bombing”? • Set limitations on in-degree or out- degree on a per-domain basis.

Step 2: Computing Hubs and Authorities • Given our focused subgraph G, now what? • Popularity ranking by in-degree? • Popularity ≠ relevance. • Hub : links to multiple relevant authorities. • Authorities : high in-degree and overlap. • Hubs & Authorities: Mutually reinforcing.

unrelated page of large in-degree hubs authorities Figure 2: A densely linked set of hubs and authorities.

Iterative Algorithm • Subgraph G = (V, A). • Normalized weights, x & y. • Update operations, I & O. • Mutually reinforcing: • I: x = ∑ y<q> ∀ q : (q, p) ∈ A. • O: y = ∑ x<q> ∀ q : (p, q) ∈ A.

Figure 3: The basic operations. I & O

• x is a vector containing all x • y is a vector containing all y • Iterate(k): apply I & O and normalize. • Filter(c): obtain c largest coordinates. • Optimization of k is trivial: • x and y converge eventually. (3.1)

Iterate( G , k ) G : a collection of n linked pages k : a natural number Let z denote the vector (1 , 1 , 1 , . . ., 1) ∈ R n . Set x 0 := z. Set y 0 := z. For i = 1 , 2 , . . ., k Apply the I operation to ( x i − 1 , y i − 1 ), obtaining new x -weights x � i . Apply the O operation to ( x � i , y i − 1 ), obtaining new y -weights y � i . Normalize x � i , obtaining x i . Normalize y � i , obtaining y i . End Return ( x k , y k ). Iterate

Filter( G , k , c ) G : a collection of n linked pages k , c : natural numbers ( x k , y k ) := Iterate ( G, k ). Report the pages with the c largest coordinates in x k as authorities. Report the pages with the c largest coordinates in y k as hubs. Filter

Method Quirks • Textual search as black box. • Only probabilistically global. • Does not address scarcity problem.

Similar-Page Queries • “similar:www.example.com” • Very little modification necessary! • Obtain root set from in-pages search. • R = t pages pointing to p. • In-degree still not a good ranking.

Related Work • Standing in social networks. • Influence in scientific citation networks. • PageRank. (i.e.: WWW indices, no hubs)

Multiple Sets of H&A • What about ambiguous query terms? (Terms with several meanings.) • What about different contexts? • What about polarized issues? (Groups that won’t link to one another, but are debating the same topic.) • Clusters exist.

Diffusion and Generalization • Diffusion : pages corresponding to “broader” topics than the query string are returned, or reference page has insufficient in-degree. • Was the query string too specific? • Possible solutions? • Non-principal eigenvectors. • Textual approaches (i.e.: term-matching)

Conclusions • Abundance problem is harder each day. • Calls for search engines to consider more than simple relevance and clustering. • Growth of WWW makes indexing harder. • WWW search results must be global , WWW search process doesn’t have to be. • Quality of results is critical, more so as the WWW grows and becomes polluted.

Conclusions • WWW is social. (Social organization is represented.) • Further avenues: • User traffic pattern analysis. • Eigenvector-based heuristics. (LSA) • Link-based methods for other queries.

Searching for Authority on the WWW (Not just relevance or - PowerPoint PPT Presentation

Searching for Authority on the WWW (Not just relevance or popularity...) Ido Rosen <ido@cs.uchicago.edu> Sources of Information on the WWW Textual content Images, sounds, multimedia content Hyperlink digraph (network

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

Linguistics 384: Language and Computers Operators Searching the web Topic 2: Searching

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Searching and Sorting Mason Vail, Boise State University Computer Science Searching Searching is

Chapter 5 Searching and Binary Search Trees 5.1 Searching sequence The purpose of searching :

Searching Tiziana Ligorio 1 Todays Plan Searching algorithms and their analysis 2

Sorting and Searching Topic 11 Sorting and Searching S ti d S hi Fundamental problems in

CSN08101 Digital Forensics Lecture 3: Linux Searching Lecture 3: Linux Searching Module Leader:

#3: Trademark Two Problems Eric R. Waltmire Searching Patent Attorney 1. Not Searching 2.

Sorting and Searching Topic 14 Searching and Simple Sorts Fundamental problems in computer

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

Our Spiritual Authority Part 4: Preparation for Exercising Our Authority Our Spiritual Authority

Sources of Authority Sources of Authority Sources of Authority Lesson No. 3 ENV H 471

Searching in a Public Library Searching in a Public Library Some Experiences with the Search

Searching for information on-line iClicker Question I know a lot about searching for information

Employment Concerns Committee Matthew A. Engel Employment Concerns Chair National Association

What to Expect Next!!! WRHS Class of 2021 What are we going to talk about???? To -

Split digraphs and their applications M. Drew LaMar The College of William and Mary

WHO are Catholics in the United States? Non-Hispanic White 47.4 Hispanic 43% African American

Investor Presentation Aug, 2018 September 1, 2018 Disclaimer Certain statements in this

Degree Measure 90 = /2 45 135 0 180 = 360 = 2 315 225 270 = 3

Standard II.A Instructional Programs II.A: Instructional Programs How do we make sure our

July 2013 What is Mission: Graduate ? Mission: Graduate is a cradle-to-career education partnership