Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web ‐ intelligence g Based on slides from: Christopher Manning and Prabhakar Raghavan Christopher Manning and Prabhakar Raghavan Chapter 19: Web search basics Introduction to Information Retrieval Introduction to Information Retrieval Brief (non ‐ technical) history  Early keyword ‐ based engines ca. 1995 ‐ 1997  Altavista, Excite, Infoseek, Inktomi, Lycos , , , , y  Paid search ranking: Goto (morphed into Overture com  Yahoo!) Overture.com  Yahoo!)  Your search ranking depended on how much you paid  Auction for keywords: casino was expensive!

Introduction to Information Retrieval Introduction to Information Retrieval Brief (non ‐ technical) history  1998+: Link ‐ based ranking pioneered by Google  Blew away all early engines save Inktomi  Great user experience in search of a business model  Great user experience in search of a business model  Meanwhile Goto/Overture’s annual revenues were nearing $1 billion  Result: Google added paid search “ads” to the side Result: Google added paid search ads to the side, independent of search results  Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search)  2005+: Google gains search share, dominating in Europe and very strong in North America t i N th A i  2009: Yahoo! and Microsoft propose combined paid search offering Introduction to Information Retrieval Introduction to Information Retrieval Paid Search Ads Search Ads Algorithmic results.

Sec. 19.4.1 Introduction to Information Retrieval Introduction to Information Retrieval Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www. miele .com/ - 20k - Cached - Similar pages Web spider Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ 10k Cached Similar pages www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages Search Search Indexer The Web Indexes Ad indexes Sec. 19.4.1 Introduction to Information Retrieval Introduction to Information Retrieval User Needs  Need [Brod02, RL04] d [ d ]  Informational – want to learn about something (~40% / 65%) Low hemoglobin g  Navigational – want to go to that page (~25% / 15%) United Airlines  Transactional – want to do something (web ‐ mediated) (~35% / 20%) Transactional want to do something (web mediated) ( 35% / 20%)  Access a service Seattle weather  Downloads Mars surface images Canon S410  Shop  Gray areas  Find a good hub  Find a good hub Car rental Brasil Car rental Brasil  Exploratory search “see what’s there”

Introduction to Information Retrieval Introduction to Information Retrieval How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) Introduction to Information Retrieval Introduction to Information Retrieval Users’ empirical evaluation of results  Quality of pages varies widely  Quality of pages varies widely  Relevance is not enough  Other desirable qualities (non IR!!)  Content: Trustworthy, diverse, non ‐ duplicated, well maintained C t t T t th di d li t d ll i t i d  Web readability: display correctly & fast  No annoyances: pop ‐ ups, etc  Precision vs. recall P i i ll  On the web, recall seldom matters  What matters What matters  Precision at 1? Precision above the fold?  Comprehensiveness – must be able to deal with obscure queries  Recall matters when the number of matches is very small  Recall matters when the number of matches is very small  User perceptions may be unscientific, but are significant over a large aggregate

Introduction to Information Retrieval Introduction to Information Retrieval Users’ empirical evaluation of engines  Relevance and validity of results R l d lidi f l  UI – Simple, no clutter, error tolerant  Trust  Trust – Results are objective Results are objective  Coverage of topics for polysemic queries  Pre/Post process tools provided  Pre/Post process tools provided  Mitigate user errors (auto spell check, search assist,…)  Explicit: Search within results, more like this, refine ...  Anticipative: related searches  Deal with idiosyncrasies  Web specific vocabulary  Web specific vocabulary  Impact on stemming, spell ‐ check, etc  Web addresses typed in the search box  “The first, the last, the best and the worst …” Sec. 19.2 Introduction to Information Retrieval Introduction to Information Retrieval The Web document collection h b d ll  N d  No design/co ‐ ordination i / di ti  Distributed content creation, linking, democratization of publishing p g  Content includes truth, lies, obsolete information, contradictions …  Unstructured (text, html, …), semi ‐  Unstructured (text html ) semi structured (XML, annotated photos), structured (Databases)…  Scale much larger than previous text collections … but corporate records are catching up  Growth – slowed down from initial “volume doubling every few months” but The Web still expanding still expanding  Content can be dynamically generated

Introduction to Information Retrieval Introduction to Information Retrieval Spam Spam  (Search Engine Optimization) Sec. 19.2.2 Introduction to Information Retrieval Introduction to Information Retrieval The trouble with paid search ads …  It costs money. What’s the alternative?  Search Engine Optimization:  Search Engine Optimization:  “Tuning” your web page to rank highly in the algorithmic search results for select keywords algorithmic search results for select keywords  Alternative to paying for placement  Thus intrinsically a marketing function  Thus, intrinsically a marketing function  Performed by companies, webmasters and consultants (“Search engine optimizers”) for their lt t (“S h i ti i ”) f th i clients  Some perfectly legitimate, some very shady

Sec. 19.2.2 Introduction to Information Retrieval Introduction to Information Retrieval Search engine optimization (Spam) Search engine optimization (Spam)  Motives  Commercial, political, religious, lobbies  Promotion funded by advertising budget  Operators  Contractors (Search Engine Optimizers) for lobbies, companies C (S h i O i i ) f l bbi i  Web masters  Hosting services Hosting services  Forums  E.g., Web master world ( www.webmasterworld.com ) E.g., Web master world ( www.webmasterworld.com )  Search engine specific tricks  Discussions about academic papers  Sec. 19.2.2 Introduction to Information Retrieval Introduction to Information Retrieval Simplest forms l f  First generation engines relied heavily on tf/idf  The top ‐ ranked pages for the query maui resort were the ones containing the most maui ’ s and resort ’ s ones containing the most maui ’ s and resort ’ s  SEOs responded with dense repetitions of chosen terms  e.g., maui resort maui resort maui resort g ,  Often, the repetitions would be in the same color as the background of the web page  Repeated terms got indexed by crawlers Repeated terms got indexed by crawlers  But not visible to humans on browsers Pure word density cannot be trusted as an IR signal be trusted as an IR signal

Sec. 19.2.2 Introduction to Information Retrieval Introduction to Information Retrieval Variants of keyword stuffing  Misleading meta ‐ tags, excessive repetition Mi l di i i i  Hidden text with colors, style sheet tricks, etc. Meta Tags = Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …” Sec. 19.2.2 Introduction to Information Retrieval Introduction to Information Retrieval Cloaking  Serve fake content to search engine spider  DNS cloaking: Switch IP address. Impersonate SPAM Y Is this a Search Engine spider? Real N Cloaking Doc

Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web intelligence g Based on slides from: Christopher Manning and Prabhakar Raghavan Christopher Manning and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &

Information In formation Sy Systems stems Susan Dumais Microsoft Research

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

Lift Ladder Silver B Probl blem em: 50-100 lb shingle packs 500,000 accidents in

American Graphic Design in the 1920s-30s was dominated by traditional illustration and

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text

High Dimensional Search Min-Hashing Locality Sensi6ve Hashing

Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal Community Detection

Information Retrieval TDT4215 Web intelligence g Based on slides - PDF document

Introduction to Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval TDT4215 Web intelligence g Based on slides from: Christopher Manning and Prabhakar Raghavan Christopher Manning and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Retrieval Max Gubin mail@maxgubin.com Information Retrieval History 4000 1950 2000 BC

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &amp;

Information In formation Sy Systems stems Susan Dumais Microsoft Research

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

Lift Ladder Silver B Probl blem em: 50-100 lb shingle packs 500,000 accidents in

American Graphic Design in the 1920s-30s was dominated by traditional illustration and

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text

High Dimensional Search Min-Hashing Locality Sensi6ve Hashing

Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal Community Detection

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

classical? Aaron M. HELFAND Bulfinch awards ceremony institute of Classical Architecture &