Chapter 11: Text Indexing and Matching There were 5 Exabytes of - PowerPoint PPT Presentation

Chapter 11: Text Indexing and Matching There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information -- Eric Schmidt is now created every 2 days. There is nothing that cannot be found through -- Eric Schmidt some search engine. The best place to hide a dead body is -- anonymous page 2 of Google search results. An engineer is someone who can do for a dime -- anonymous what any fool can do for a dollar. 11-1 IRDM WS 2015

Outline 11.1 Search Engine Architecture 11.2 Dictionary and Inverted Lists 11.3 Index Compression 11.4 Similarity Search mostly following Büttcher/Clarke/Cormack Chapters 2,3, 4 , 6 (alternatively: Manning/Raghavan/Schütze Chapters 3 , 4 , 5 ,6) 11.2 mostly BCC Ch.4, 11.3 mostly BCC Ch.6, 11.4 mostly MRS Ch.3 11-2 IRDM WS 2015

11.1 Search Engine Architecture ...... ..... ...... ..... extract index search rank present crawl & clean handle fast top-k queries, GUI, user guidance, dynamic pages, query logging, personalization detect duplicates, auto-completion detect spam scoring function strategies for build and analyze over many data crawl schedule and Web graph, and context criteria priority queue for index all tokens crawl frontier or word stems server farm with 100 000‘s of computers, distributed/replicated data in high-performance file system, massive parallelism for query processing 11-3 IRDM WS 2015

Content Gathering and Indexing Bag-of-Words representations ...... Internet ..... ...... ..... Web Internet Internet crisis Crawling crisis crisis user users user love ... ... Internet crisis: search engine users still love trust search engines Extraction Linguistic Statistically faith and have trust of relevant methods: weighted ... in the Internet words stemming features Indexing (terms) Documents Thesaurus Index (Ontology) (B + -tree) Synonyms, ... Sub-/Super- crisis love Concepts URLs 11-4 IRDM WS 2015

Crawling • Traverse Web: fetch page by http, parse retrieved html content for href links • Crawl frontier: maintain priority queue • Crawl strategy: breadth-first for broad coverage, depth-first for site capturing, clever prioritization • Link extraction: handle dynamic pages (Javascript …) Deep Web Crawling: generate form-filling queries Focused Crawling: interleave with classifier 11-5 IRDM WS 2015

Deep Web Crawling Deep Web (aka. Hidden Web): DB/CMS content items without URLs  generate (valid) values for query form fields in order to bring items to surface Source: http://deepwebtechblog.com/wringing-science-from-google/ 11-6 IRDM WS 2015

Focused Crawling WWW automatially populate ...... ad-hoc topic directory ..... ...... ..... seeds Crawler training Classifier Link Analysis Root critical issues: Database Semistrutured • classifier accuracy Technology Data • feature selection Web Data • quality of training data XML Retrieval Mining 11-7 IRDM WS 2015

Focused Crawling WWW interleave crawler ...... and classifier ..... ...... ..... with periodic re-training seeds Crawler training Classifier Link Analysis high high confidence authority re-training Root Database Semistrutured Technology Data Web Data Social topic-specific Retrieval Mining Graphs archetypes 11-8 IRDM WS 2015

Vector Space Model for Content Relevance Ranking Ranking by descending Similarity metric: relevance | | F  Search engine d q ij j   1 j ( , ) : sim d q i | | | | F F   2 2 d q ij j | | q  F Query [ 0 , 1 ]   1 1 j j (set of weighted features) | | d  F [ 0 , 1 ] Documents are feature vectors i (bags of words) e.g. weights by tf*idf model Features are terms (words and other tokens) or term-zone pairs (term in title/heading/caption /…) can be stemmed/lemmatized (e.g. to unify singular and plural) can also be multi-word phrases (e.g. bigrams) 11-9 IRDM WS 2015

Vector Space Model: tf*idf Scores tf (d i , t j ) = term frequency of term t j in doc d i df (t j ) = document frequency of t j = #docs with t j idf (t j ) = N / df(t j ) with corpus size (total #docs) N dl (d i ) = doc length of d i (avgdl: avg. doc length over all N docs) tf*idf score for single-term query ( index weight ):  1 N     d ( 1 ln( 1 ln( tf ( d , t )))) ln for tf(d i ,t j )>0, 0 else ij i j df ( t ) j dampening & plus optional length normalization normalization cosine similarity for ranking (cosine of angle between q and d vectors when vectors are L2-normalized):       sim ( q , d ) q d q d where j  q  d i if q j  0  d ij  0 i j ij   j ij j j q d i sparse scalar product 11-10 IRDM WS 2015

(Many) tf*idf Variants: Pivoted tf*idf Scores tf (d i , t j ) = term frequency of term t j in doc d i df (t j ) = document frequency of t j = #docs with t j idf (t j ) = N / df(t j ) with corpus size (total #docs) N dl (d i ) = doc length of d i (avgdl: avg. doc length over all N docs) tf*idf score for single-term query ( index weight ):  1 N     d ( 1 ln( 1 ln( tf ( d , t )))) ln for tf(d i ,t j )>0, 0 else ij i j df ( t ) j pivoted tf*idf score:    1 ln( 1 ln( tf ( d , t ))) 1 N   i j d ln avoids undue favoring ij dl ( d ) df ( t )   i ( 1 s ) s j of long docs avgdl tf*idf scoring often works very well, also uses scalar product but it has many ad-hoc tuning issues  Chapter 13: for score aggregation more principled ranking models 11-11 IRDM WS 2015

11.2 Indexing with Inverted Lists Vector space model suggests term-document matrix , but data is sparse and queries are even very sparse  use inverted index lists with terms as keys for B+ tree or hashmap q: Internet B+ tree or hashmap crisis ... ... trust crisis Internet trust 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists Google etc.: 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 with postings > 10 Mio. terms 52: 0.1 28: 0.1 28: 0.7 (DocId, score) > 100 Bio. docs 53: 0.8 44: 0.2 44: 0.2 ... sorted by DocId > 50 TB index 55: 0.6 51: 0.6 52: 0.3 ... ... terms can be full words, word stems, word pairs, substrings, N-grams, etc. (whatever „dictionary terms“ we prefer for the application) • index-list entries in DocId order for fast Boolean operations • many techniques for excellent compression of index lists • additional position index needed for phrases, proximity, etc. (or other precomputed data structures) 11-12 IRDM WS 2015

Dictionary • Dictionary maintains information about terms: – mapping terms to unique term identifiers (e.g. crisis → 3141359) – location of corresponding posting list on disk or in memory – statistics such as document frequency and collection frequency • Operations supported by the dictionary: – Lookups by term – range searches for prefix and suffix queries (e.g. net*, *net ) – substring matching for wildcard queries (e.g. cris*s ) – Lookups by term identifier • Typical implementations: – B+ trees, hash tables, tries (digital trees), suffix arrays 11-13 IRDM WS 2015

B + Tree • Paginated hollow multiway search tree with high fanout (  low depth) • Node contents: (child pointer, key) pairs as routers in inner nodes key with id list or record data in leaf nodes • Perfectly balanced: all leaves have identical distance to root • Search and update efficiency: O(log k n/C) page accesses (disk I/Os) with n keys, page storage capacity C, and fanout k Jena Bonn Essen Merzig B + -Tree Frank- Jena furt Saar- Erfurt Essen Köln Mainz Merzig Paris brücken Trier Ulm Aachen Berlin Bonn 11-14 IRDM WS 2015

Prefix B + Tree for Keys of Type String Keys in inner nodes are mere Routers for search space partitioning. Rather than x i = max{s: s is a key in subtree t i } a shorter router y i with s i  y i < x i+1 for all s i in t i and all s i+1 in t i+1 is sufficient, for example, yi = shortest string with the above property.  even higher fanout, possibly lower depth of the tree K C N Et Prefix- Frank- Jena B + -tree furt Saar- Aachen Berlin Bonn Erfurt Essen Köln Mainz Merzig Paris brücken Trier Ulm 11-15 IRDM WS 2015

Posting Lists and Payload • Inverted index keeps a posting list for each term with the following payload for each posting: – document identifier (e.g. d 123 , d 234 , …) – term frequency (e.g. tf ( crisis , d 123 ) = 2, tf ( crisis , d 234 ) = 4) – score impact (e.g. tf(crisis, d 123 ) * idf(crisis) = 3.75 ) – offsets : positions at which the term occurs in document • Posting lists can be sorted by doc id or sorted by score impact • Posting lists are compressed for space and time efficiency posting list for d 123 , 2, [4, 14] d 234 , 4, [47] d 266 , 3, [1, 9, 20] crisis payload: tf, offsets posting 11-16 IRDM WS 2015

Chapter 11: Text Indexing and Matching There were 5 Exabytes of - PowerPoint PPT Presentation

Chapter 11: Text Indexing and Matching There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information -- Eric Schmidt is now created every 2 days. There is nothing that cannot be found

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Lecture 19: Motion Sparse stereo matching Indexing scenes Indexing scenes Tuesday, Nov

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Greedy Algorithms ms Jeevani Goone*llake University of

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro More

Compression, Information 2. binary code : assigns a string of 0 s and 1 s to each and Entropy

Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks

CS4102 Algorithms Fall 2018 Warm up Decode the line below into English (hint: use Google or

3/25/13 CS200 Algorithms and Data Structures Colorado State University CS200 Algorithms and

Chapter 11: Text Indexing and Matching There were 5 Exabytes of - PowerPoint PPT Presentation

Chapter 11: Text Indexing and Matching There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information -- Eric Schmidt is now created every 2 days. There is nothing that cannot be found

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Lecture 19: Motion Sparse stereo matching Indexing scenes Indexing scenes Tuesday, Nov

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Greedy Algorithms ms Jeevani Goone*llake University of

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro More

Compression, Information 2. binary code : assigns a string of 0 s and 1 s to each and Entropy

Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks

CS4102 Algorithms Fall 2018 Warm up Decode the line below into English (hint: use Google or

3/25/13 CS200 Algorithms and Data Structures Colorado State University CS200 Algorithms and

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3