Chapter V: Indexing & Searching Information Retrieval & - PowerPoint PPT Presentation

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2011/12

Chapter V: Indexing & Searching* V.1 Indexing & Query processing Inverted indexes, B + -trees, merging vs. hashing, Map-Reduce & distribution, index caching V.2 Compression Dictionary-based vs. variable-length encoding, Gamma encoding, S16, P-for-Delta V.3 Top-k Query Processing Heuristic top-k approaches, Fagin’s family of threshold -algorithms, IO-Top-k, Top-k with incremental merging, and others V.4 Efficient Similarity Search High-dimensional similarity search, SpotSigs algorithm, Min-Hashing & Locality Sensitive Hashing (LSH) *mostly following Chapters 4 & 5 from Manning/Raghavan/Schütze and Chapter 9 from Baeza-Yates/Ribeiro-Neto with additions from recent research papers IR&DM, WS'11/12 November 29, 2011 V.2

V.1 Indexing - Web, intranet, digital libraries, desktop search - Unstructured/semistructured data ...... ..... ...... ..... extract index search rank present crawl & clean handle fast top-k queries, GUI, user guidance, dynamic pages, query logging, personalization detect duplicates, auto-completion detect spam scoring function strategies for build and analyze over many data crawl schedule and Web graph, and context criteria priority queue for index all tokens crawl frontier or word stems Server farms with 10 000‘s (2002) – 100,000’s (2010) computers, distributed/replicated data in high-performance file system ( GFS , HDFS ,…), massive parallelism for query processing ( MapReduce , Hadoop ,…) IR&DM, WS'11/12 November 29, 2011 V.3

Content Gathering and Indexing Bag-of-Words representations ...... ..... ...... ..... Surfing Surf Surf Crawling Internet Internet Wave Cafes Cafe Internet Web Surfing: ... ... WWW In Internet eService cafes with or Cafe Extraction Linguistic Statistically without Bistro of relevant methods: weighted ... Web Suit ... stemming , words features Indexing (terms) lemmas Documents Thesaurus Index (Ontology) (B + -tree) Synonyms, ... Bistro Cafe Sub-/Super- Concepts URLs IR&DM, WS'11/12 November 29, 2011 V.4

Vector Space Model for Relevance Ranking Ranking by Similarity metric: descending (e.g., Cosine measure) relevance | | F Search engine d q ij j 1 j ( , ) : sim d q i | F | | F | 2 2 d q | | ij j F [ 0 , 1 ] Query q 1 1 j j (set of weighted | | F [ 0 , 1 ] with d features) i Documents are feature vectors (bags of words) Using, e.g., using: 2 : / d w w e.g., ij ij ik k ( , ) tf*idf as freq f d # docs j i : log 1 log w ij weights max ( , ) # freq f d docs with f k k i i IR&DM, WS'11/12 November 29, 2011 V.5

Combined Ranking with Content & Links Structure Ranking by descending relevance & authority Search engine | | F [ 0 , 1 ] Query q (set of weighted features) Ranking functions: • Low-dimensional queries (ad-hoc ranking, Web search): BM25(F), authority scores, recency, document structure, etc. • High-dimensional queries (similarity search): Cosine, Jaccard, Hamming on bitwise signatures, etc. + Dozens of more features employed by various search engines IR&DM, WS'11/12 November 29, 2011 V.6

Digression: Basic Hardware Considerations 16 GB/s Typical (64bit@2GHz) CPU Bus system (32 – 256 bits Computer @200 – 800 MHz) ... ... 300 MB/s M C (SATA-300) 3,200 MB/s (DDR-SDRAM Secondary Storage @200MHz) HD Tertiary Storage 6,400 MB/s – 12,800 MB/s (DDR2, dual channel, 800MHz) HD TransferRate = width (number of bits) x clock rate x data per clock / 8 (bytes/sec) typically 1 IR&DM, WS'11/12 November 29, 2011 V.7

Moore’s Law Gordon Moore (Intel) anno 1965: “The density of integrated circuits (transistors) will double every 18 months!” → Has often been generalized to clock rates of CPUs, disk & memory sizes, etc. → Still holds today for integrated circuits! Source: http://en.wikipedia.org/wiki/Moore%27s_law IR&DM, WS'11/12 November 29, 2011 V.8

More Modern View on Hardware Multi-core- CPU CPU CPU CPU CPU CPU CPU CPU ... multi-CPU Computer L1/L2 L1/L2 ... ... C M CPU-to-L1-Cache: 3-5 cycles initial latency, Secondary Storage then “burst” mode • CPU-cache HD becomes primary CPU-to-L2-Cache: storage! 15-20 cycles latency • Main-memory CPU-to-Main-Memory: becomes secondary HD ~200 cycles latency storage! IR&DM, WS'11/12 November 29, 2011 V.9

Data Centers Google Data Center anno 2004 Source: J. Dean: WSDM 2009 Keynote IR&DM, WS'11/12 November 29, 2011 V.10

Different Query Types Find relevant docs Conjunctive queries: by list processing all words in q = q 1 … q k required on inverted indexes Disjunctive (“ andish ”) queries: Including variant: subset of q words qualifies, • scan & merge more of q yields higher score only subset of q i lists • lookup long Mixed-mode queries and negations : or negated q i lists q = q 1 q 2 q 3 +q 4 +q 5 – q 6 only for best result Phrase queries and proximity queries: candidates q = “q 1 q 2 q 3 ” q 4 q 5 … Vague-match (approximate) queries see Chapter III.5 with tolerance to spelling variants Structured queries and XML-IR //article[about(.//title , “ Harry Potter ”)] //sec IR&DM, WS'11/12 November 29, 2011 V.11

Indexing with Inverted Lists Vector space model suggests term-document matrix , but data is sparse and queries are even very sparse. Better use inverted index lists with terms as keys for B+ tree. q: {professor B+ tree on terms research ... ... xml} professor research xml 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 Google: with postings 52: 0.1 28: 0.1 28: 0.7 > 10 Mio. terms ... (docId, score) 53: 0.8 44: 0.2 44: 0.2 > 20 Bio. docs 55: 0.6 51: 0.6 sorted by docId ... 52: 0.3 > 10 TB index ... terms can be full words, word stems, word pairs, substrings, N-grams, etc. (whatever “dictionary terms” we prefer for the application) • Index-list entries in docId order for fast Boolean operations • Many techniques for excellent compression of index lists • Additional position index needed for phrases, proximity, etc. (or other pre-computed data structures) IR&DM, WS'11/12 November 29, 2011 V.12

B+-Tree Index for Term Dictionary Keywords [A-Z] m = 3 [A-I] [J-Z] [A-D] [E-F] [L-Q] [R-Z] [G-I] [J-K] [A-B] [C] [D] [G] [H] [E] [F] [I] … … … • B-tree: balanced tree with internal nodes of ≤m fan -out • B + -tree: leaf nodes additionally linked via pointers for efficient range scans • For term dictionary: Leaf entries point to inverted list entries on local disk and/or node in compute cluster IR&DM, WS'11/12 November 29, 2011 V.13

Inverted Index for Posting Lists Documents: d 1 , …, d n Index-list entries usually stored in ascending order of docId d 10 (for efficient merge joins ) s(t 1 ,d 1 ) = 0.9 … or s(t m ,d 1 ) = 0.2 sort in descending order of per-term score Index lists ( impact-ordered lists d10 d23 d54 d67 d88 t 1 for top-k style pruning). … 0.9 0.8 0.8 0.7 0.2 d10 d12 d17 d23 d78 t 2 … 0.8 0.6 0.6 0.2 0.1 Usually compressed and divided d10 d12 d23 d88 d99 t 3 … 0.7 0.5 0.4 0.2 0.1 into block sizes which are convenient for disk operations. IR&DM, WS'11/12 November 29, 2011 V.14

Query Processing on Inverted Lists q: {professor B+ tree on terms research ... ... xml} professor research xml 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 with postings 52: 0.1 28: 0.1 28: 0.7 ... (docId, score) 53: 0.8 44: 0.2 44: 0.2 55: 0.6 51: 0.6 sorted by docId ... 52: 0.3 ... Given: query q = t 1 t 2 ... t z with z (conjunctive) keywords similarity scoring function score(q,d) for docs d D, e.g.: q d with precomputed scores (index weights) s i (d) for which q i ≠0 Find: top-k results for score(q,d) =aggr{s i (d)} (e.g.: i q s i (d) ) Join-then-sort algorithm: top-k ( [term=t 1 ] (index) DocId [term=t 2 ] (index) DocId ... DocId [term=t z ] (index) order by s desc) IR&DM, WS'11/12 November 29, 2011 V.15

Index List Processing by Merge Join Keep L(i) in ascending order of doc ids . Delta encoding: compress L i by actually storing the gaps between successive doc ids (or using some more sophisticated prefix-free code). QP may start with those L i lists that are short and have high idf . → Candidates need to be looked up in other lists L j . To avoid having to uncompress the entire list L j , L j is encoded into groups (i.e., blocks) of compressed entries with a skip pointer at the start of each block sqrt(n) evenly spaced skip pointers for list of length n. L i … 2 4 9 16 59 66 128 135 291 311 315 591 672 899 skip! L j … 1 2 3 5 8 17 21 35 39 46 52 66 75 88 IR&DM, WS'11/12 November 29, 2011 V.16

Chapter V: Indexing & Searching Information Retrieval & - PowerPoint PPT Presentation

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des Saarlandes, Saarbrcken Winter Semester 2011/12 Chapter V: Indexing & Searching* V.1 Indexing & Query processing Inverted indexes, B +

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

Linguistics 384: Language and Computers Operators Searching the web Topic 2: Searching

Chapter 5 Searching and Binary Search Trees 5.1 Searching sequence The purpose of searching :

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Searching and Sorting Mason Vail, Boise State University Computer Science Searching Searching is

Searching Tiziana Ligorio 1 Todays Plan Searching algorithms and their analysis 2

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

An Intra-Chip Free-Space Optical Interconnect Jing Xue , Alok Garg, Berkehan Ciftcioglu, Jianyun

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

MIMD Multicomputer Mesh, ring, linear array, 2D-torus, 3D-mesh 3D-torus, tree fat tree,

High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

COMP 3713 Operating Systems Slides Part 1 Jim Diamond CAR 409 Jodrey School of Computer

CS603: Distributed Systems Lecture 1: Basic Communication Services Cristina Nita-Rotaru Lecture

Chapter V: Indexing & Searching Information Retrieval & - PowerPoint PPT Presentation

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des Saarlandes, Saarbrcken Winter Semester 2011/12 Chapter V: Indexing & Searching* V.1 Indexing & Query processing Inverted indexes, B +

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

Linguistics 384: Language and Computers Operators Searching the web Topic 2: Searching

Chapter 5 Searching and Binary Search Trees 5.1 Searching sequence The purpose of searching :

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Searching and Sorting Mason Vail, Boise State University Computer Science Searching Searching is

Searching Tiziana Ligorio 1 Todays Plan Searching algorithms and their analysis 2

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

An Intra-Chip Free-Space Optical Interconnect Jing Xue , Alok Garg, Berkehan Ciftcioglu, Jianyun

S Summary of f Technical Technical Achievements Sverre Jarp, CERN openlab CTO Sverre Jarp,

MIMD Multicomputer Mesh, ring, linear array, 2D-torus, 3D-mesh 3D-torus, tree fat tree,

High Performance Combinatorial Algorithm Design on the Cell/B.E. David A. Bader, Virat Agarwal

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

COMP 3713 Operating Systems Slides Part 1 Jim Diamond CAR 409 Jodrey School of Computer

CS603: Distributed Systems Lecture 1: Basic Communication Services Cristina Nita-Rotaru Lecture

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3