SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 - PDF document

3/17/09  Search Engine Architecture  CISC489/689‐010, Lecture #2  Wednesday, Feb. 11  Ben CartereGe  Search Engine Architecture  • A soIware architecture consists of soIware  components, the interfaces provided by those  components, and the relaPonships between  them  – describes a system at a parPcular level of abstracPon  • Architecture of a search engine determined by 2  requirements  – effecPveness (quality of results) and efficiency  (response Pme and throughput)  1 

3/17/09  Indexing Process  Corpus  Accessible data store  Server(s)  Text acquisiPon  Index creaPon  (Crawler, feeds,   (Document/term stats,   filter, …)  weighPng, inversion, …)  Text transformaPon  (Parsing, stopping,   stemming, extracPon, …)  Documents  (E‐mails, web pages,   Word docs, news arPcles, …)  Indexing Process  • Text acquisiPon  – idenPfies and stores documents for indexing  • Text transformaPon  – transforms documents into  index terms or  features • Index creaPon  – takes index terms and creates data structures  ( indexes ) to support fast searching  2 

3/17/09  Query Process  Corpus  Accessible data store  Server(s)  Ranking  f(Q,D) EvaluaPon  (Precision, recall,   clicks, …)  Query Process  • User interacPon  – supports creaPon and refinement of query, display  of results  • Ranking  – uses query and indexes to generate ranked list of  documents  • EvaluaPon  – monitors and measures effecPveness and  efficiency (primarily offline)  3 

3/17/09  Details: Text AcquisiPon  • Crawler  – IdenPfies and acquires documents for search  engine  – Many types – web, enterprise, desktop  – Web crawlers follow  links  to find documents • Must efficiently find huge numbers of web pages  ( coverage ) and keep them up‐to‐date ( freshness )  • Single site crawlers for  site search • Topical or focused crawlers for verPcal search  – Document  crawlers for enterprise and desktop  search  • Follow links and scan directories  Text AcquisiPon  • Feeds   – Real‐Pme streams of documents  • e.g., web feeds for news, blogs, video, radio, tv  – RSS is common standard  • RSS “reader” can provide new XML documents to search  engine  • Conversion  – Convert variety of documents into a consistent text  plus metadata format  • e.g. HTML, XML, Word, PDF, etc. → XML  – Convert text encoding for different languages  • Using a Unicode standard like UTF‐8  4 

3/17/09  Text AcquisiPon  • Document data store  – Stores text, metadata, and other related content  for documents   • Metadata is informaPon about document such as type  and creaPon date  • Other content includes links, anchor text  – Provides fast access to document contents for  search engine components  • e.g. result list generaPon  – Could use relaPonal database system   • More typically, a simpler, more efficient storage system  is used due to huge numbers of documents  Text TransformaPon  • Parser  – Processing the sequence of text  tokens in the  document to recognize structural elements  • e.g., Ptles, links, headings, etc.  – Tokenizer  recognizes “words” in the text  • must consider issues like capitalizaPon, hyphens,  apostrophes, non‐alpha characters, separators  – Markup languages such as HTML, XML oIen used to  specify structure  • Tags  used to specify document  elements – E.g., <h2> Overview </h2>  • Document parser uses  syntax  of markup language (or other  formanng) to idenPfy structure  5 

3/17/09  Text TransformaPon  • Stopping  – Remove common words • e.g., “and”, “or”, “the”, “in”  – Some impact on efficiency and effecPveness  – Can be a problem for some queries  • Stemming  – Group words derived from a common  stem • e.g., “computer”, “computers”, “compuPng”, “compute”  – Usually effecPve, but not for all queries  – Benefits vary for different languages  Text TransformaPon  • Link Analysis  – Makes use of  links  and  anchor text in web pages  – Link analysis idenPfies  popularity  and  community   informaPon  • e.g., PageRank  – Anchor text can significantly enhance the  representaPon of pages pointed to by links  – Significant impact on web search  • Less importance in other applicaPons  6 

3/17/09  Text TransformaPon  • InformaPon ExtracPon  – IdenPfy classes of index terms that are important  for some applicaPons  – e.g.,  named en;ty recognizers idenPfy classes  such as  people , loca;ons , companies , dates,  etc.  • Classifier  – IdenPfies class‐related metadata for documents  • i.e., assigns labels to documents  • e.g., topics, reading levels, senPment, genre  – Use depends on applicaPon  Index CreaPon  • Document StaPsPcs  – Gathers counts and posiPons of words and other  features  – Used in ranking algorithm  • WeighPng  – Computes weights for index terms  – Used in ranking algorithm  – e.g.,  =.idf  weight  • CombinaPon of  term frequency in document and  inverse document frequency in the collecPon  7 

3/17/09  Index CreaPon  • Inversion  – Core of indexing process  – Converts document‐term informaPon to term‐ document for indexing  • Difficult for very large numbers of documents  – Format of inverted file is designed for fast query  processing  • Must also handle updates  • Compression used for efficiency  Index CreaPon  • Index DistribuPon  – Distributes indexes across mulPple computers  and/or mulPple sites  – EssenPal for fast query processing with large  numbers of documents  – Many variaPons  • Document distribuPon, term distribuPon, replicaPon  – P2P  and  distributed IR  involve search across  mulPple sites 8 

3/17/09  User InteracPon  • Query input  – Provides interface and parser for  query language   – Most web queries are very simple, other  applicaPons may use forms  – Query language used to describe more complex  queries and results of query transformaPon  • e.g., Boolean queries, Indri and Galago query languages  •  similar to SQL language used in database applicaPons  • IR query languages also allow content and structure  specificaPons, but focus on content  User InteracPon  • Query transformaPon  – Improves iniPal query, both before and aIer iniPal  search  – Includes text transformaPon techniques used for  documents  – Spell checking and  query sugges;on  provide  alternaPves to original query  – Query expansion and  relevance feedback  modify  the original query with addiPonal terms 9 

3/17/09  User InteracPon  • Results output  – Constructs the display of ranked documents for a  query  – Generates  snippets  to show how queries match  documents  – Highlights  important words and passages  – Retrieves appropriate  adver;sing  in many  applicaPons  – May provide  clustering  and other visualizaPon  tools  Ranking  • Scoring  – Calculates scores for documents using a ranking  algorithm  – Core component of search engine  – Basic form of score is   • q t  and d t  are query and document term weights for  term t  – Many variaPons of ranking algorithms and  retrieval models  10 

SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 - PDF document

3/17/09 SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 BenCartereGe SearchEngineArchitecture AsoIwarearchitectureconsistsofsoIware

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The search engine you can see Connects people to information and services The search engine you

Audient: Audient: An Acoustic Search Engine An Acoustic Search Engine By Ted Leath Supervisor:

Automatic Search Engine Evaluation Automatic Search Engine Evaluation with Click- -through Data

An Online Shopping Search Shopping Search An Online Engine User Study Engine User Study

Game Engine Architecture Game Engine Architecture Spring 2017 Spring 2017 03. Event systems

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale Stephen Bach

The Web Servers + Crawlers Eytan Adar November 8, 2007 With slides from Dan Weld & Oren

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Grails In The Enterprise Ryan Vanderwerf Chief Architect @ ReachForce www.reachforce.com Grails

Applied Algorithm Design: Exam Answers Prof. Pietro Michiardi Questions 1. When does a bipartite

Mining Lectures Marcel Caraciolo - @marcelcaraciolo 1 Whos me ? Marcel Pinheiro Caraciolo

Retrieving and Visualizing Data Charles Severance Multi-Step Data Analysis Many Data Mining

The Web ARChive (WARC) File Format Sawood Alam Web Science and Digital Libraries Research Group

SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 - PDF document

3/17/09 SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 BenCartereGe SearchEngineArchitecture AsoIwarearchitectureconsistsofsoIware

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The search engine you can see Connects people to information and services The search engine you

Audient: Audient: An Acoustic Search Engine An Acoustic Search Engine By Ted Leath Supervisor:

Automatic Search Engine Evaluation Automatic Search Engine Evaluation with Click- -through Data

An Online Shopping Search Shopping Search An Online Engine User Study Engine User Study

Game Engine Architecture Game Engine Architecture Spring 2017 Spring 2017 03. Event systems

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale Stephen Bach

The Web Servers + Crawlers Eytan Adar November 8, 2007 With slides from Dan Weld &amp; Oren

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Grails In The Enterprise Ryan Vanderwerf Chief Architect @ ReachForce www.reachforce.com Grails

Applied Algorithm Design: Exam Answers Prof. Pietro Michiardi Questions 1. When does a bipartite

Mining Lectures Marcel Caraciolo - @marcelcaraciolo 1 Whos me ? Marcel Pinheiro Caraciolo

Retrieving and Visualizing Data Charles Severance Multi-Step Data Analysis Many Data Mining

The Web ARChive (WARC) File Format Sawood Alam Web Science and Digital Libraries Research Group

The Web Servers + Crawlers Eytan Adar November 8, 2007 With slides from Dan Weld & Oren