inverted index
play

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP570 Database Applications | Fall 2014 | Derbinsky Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology COMP570 Database Applications | Fall 2014 | Derbinsky


  1. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Inverted Index Lecture 12 Inverted Index 1 December 2014 1

  2. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Outline • Background & Motivation – Full-Text Search – Big-O Review – Indexing • The Inverted Index – An Example – Design a relational index – Advanced Issues • Example in Cognitive Modeling Inverted Index 1 December 2014 2

  3. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Problem: Full-Text Search • Given: set of “documents” containing “words” – General problem in the field of Information Retrieval • Task: find “best” document(s) that contain a set of words • Requirements – Fast & scalable – Relevant results (precision, recall, f-score) – Expressive queries – Up-to-date Inverted Index 1 December 2014 3

  4. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Example: Web Search document ¡= ¡web ¡page/map ¡lis4ng ¡ Inverted Index 1 December 2014 4

  5. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Other Examples Inverted Index 1 December 2014 5

  6. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Scaling Up: # of “Documents” >60 ¡trillion ¡pages ¡ >1 ¡billion ¡users ¡ >240 ¡million ¡items ¡ Inverted Index 1 December 2014 6

  7. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Scaling Up: Search Frequency >68K/s ¡ >11K/s ¡ Inverted Index 1 December 2014 7

  8. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Big-O Review • What does O ( n ) mean? • A linear algorithm for full-text search? – Find 5 documents that contain “WIT” • What is the complexity in terms of documents ( d ) and average-words-per- document ( w )? Inverted Index 1 December 2014 8

  9. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Is Linear-Time Google Possible? • Assume simple query: • Single listing of all web pages + words – 60T pages * 250 w/page * 8 bytes/w ~ 106PB • Require 1s response ~4M ¡Blu-­‑ray ¡ – 7.5M * 2GHz 64-bit CPU (assume 1 cycle/w) – (7.5M * 68K) CPUs * 85W/CPU * $.15/kWh ~ $1.8M/s ~70 ¡CPU/ 3X ¡US ¡GDP ¡ person ¡ Inverted Index 1 December 2014 9

  10. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Indexing • Improve search speed at the cost of extra… – Memory for data structure(s) – Time to update the data structure(s) • Backbone of databases (physical design) – Search engines – Graphics/game engines – Simulation software … Inverted Index 1 December 2014 10

  11. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Inverted Index by Example Given documents { D 1 , D 2 , D 3 }: – D 1 = “it is what it is” – D 2 = “what is it” – D 3 = “it is a banana” Let’s ¡try ¡some ¡queries: ¡ “what”, ¡“a”, ¡“banana”, ¡“apple” ¡ Inverted Index: Describe ¡an ¡algorithm ¡to ¡query ¡ – “a”: [ D 3 ] this ¡data ¡structure ¡ Dis:nct ¡Word ¡List ¡ Document ¡Lists ¡ – “banana”: [ D 3 ] (sorted) ¡ Describe ¡an ¡algorithm ¡to ¡ – “is”: [ D 1 , D 2 , D 3 ] populate ¡these ¡lists ¡ – “it”: [ D 1 , D 2 , D 3 ] Time ¡& ¡Memory: ¡O(?) ¡ – “what”: [ D 1 , D 2 ] Construc)on ¡& ¡Query ¡ Inverted Index 1 December 2014 11

  12. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Design a Relational Inverted Index Develop a set of table(s) and index(es) that support efficient construction and querying of an inverted index Assume • Documents have a unique id and a path • A document is a sequence of words – Document d = [ w 1 , w 2 , … w n ] • Search for a single, exact-match word – Does document D have word w? – The list of documents D that have word w? Inverted Index 1 December 2014 12

  13. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Advanced Issues • More expressive query semantics – Multi-word – Locality: [“what is it”] vs. [ “what it is” ] vs. [“what”, “is”, “it”] • Ranked results – Document-ranking algorithm (e.g. PageRank) – Efficient ranked retrieval • Dynamics – Document addition/removal/modification – Rank • Document changes • Integration of real-time variables (e.g. location) Inverted Index 1 December 2014 13

  14. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Modeling Semantic Memory • Semantic memory is a human’s long-term store of facts about the world, independent of the context in which they were originally learned • The ACT-R (http://act-r.psy.cmu.edu) model of semantic memory has been successful at explaining a variety of psychological phenomena (e.g. retrieval bias, forgetting) • The model does not scale to large memory sizes, which hampers complex experiments Inverted Index 1 December 2014 14

  15. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Scale Fail [AFRL ’09] Inverted Index 1 December 2014 15

  16. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Memory Representation • Document = Node • Word = edge Example ¡cue: ¡ last(obama),spouse(X) Inverted Index 1 December 2014 16

  17. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Ranking [Anderson et al. ‘04] Predict future usage via history n − d ∑ t ln( ) j j = 1 M A ¡> ¡M B ¡ 1.5 ¡ Base-­‑Level ¡Ac:va:on ¡ 1 ¡ 0.5 ¡ 0 ¡ 0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ 50 ¡ 60 ¡ 70 ¡ 80 ¡ 90 ¡ 100 ¡ -­‑0.5 ¡ -­‑1 ¡ -­‑1.5 ¡ Time ¡ Inverted Index 1 December 2014 17

  18. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Example Semantic Objects: Features Inverted Index 1 December 2014 18

  19. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Inverted Index Semantic Objects: Features Inverted Index Inverted Index 1 December 2014 19

  20. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Index Statistics Semantic Objects: Features Inverted Index 3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ Inverted Index 1 December 2014 20

  21. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Top-1 Non-Ranked Retrieval Inverted Index 3 ¡ Cue: ¡ 2 ¡ 2 ¡ 3 ¡ Query ¡Plan: ¡ 1 ¡ 1 ¡ Candidate: ¡ 1 ¡ 1 ¡ Inverted Index 1 December 2014 21

  22. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Introducing Rank Semantic Objects: Features Inverted Index 1 ¡ 3 ¡ 2 ¡ 2 ¡ 1 ¡ 3 ¡ 1 ¡ 1 ¡ 1 ¡ Inverted Index 1 December 2014 22

  23. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Ranked Retrieval Algorithm #1 Sort on Query Inverted Index 3 ¡ Cue: ¡ 2 ¡ 2 ¡ 3 ¡ Query ¡Plan: ¡ 1 ¡ 1 ¡ 1 ¡ Candidate: ¡ 2 ¡ 1 ¡ Each ¡query ¡scales ¡with ¡the ¡size ¡of ¡ 1 ¡ the ¡candidate ¡list! ¡ Inverted Index 1 December 2014 23

  24. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Ranked Retrieval Algorithm #3 Static Sort Inverted Index 1 ¡ 2 ¡ 3 ¡ 3 ¡ Cue: ¡ 1 ¡ 2 ¡ 2 ¡ 3 ¡ 2 ¡ 3 ¡ Query ¡Plan: ¡ 1 ¡ 1 ¡ 1 ¡ Candidate: ¡ 2 ¡ 1 ¡ 3 ¡ 1 ¡ Inverted Index 1 December 2014 24

  25. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Ranked Retrieval Algorithm #2 Static Sort Inverted Index 1 ¡ 2 ¡ 3 ¡ 3 ¡ Cue: ¡ 1 ¡ 2 ¡ 2 ¡ 3 ¡ 2 ¡ 3 ¡ Query ¡Plan: ¡ 1 ¡ 1 ¡ 1 ¡ Candidate: ¡ 2 ¡ 1 ¡ 3 ¡ Each ¡rank ¡update ¡scales ¡with ¡ 1 ¡ feature ¡cardinality! ¡ Inverted Index 1 December 2014 25

  26. Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky Hybrid Approach • Empirically supported cardinality threshold, θ • If (cardinality > θ ): Sort on Query [#1] – Candidate enumeration scales with # of objects with large cardinality (empirically rare) • If (cardinality ≤ θ ): Static Sort [#2] – Bias updates must be locally efficient • Objects affected: O(1) • Computation: O(1) Inverted Index 1 December 2014 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend