Inverted Index Lecture 12 Inverted Index 1 December 2014 1 - - PowerPoint PPT Presentation

inverted index
SMART_READER_LITE
LIVE PREVIEW

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 - - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP570 Database Applications | Fall 2014 | Derbinsky Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology COMP570 Database Applications | Fall 2014 | Derbinsky


slide-1
SLIDE 1

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Inverted Index

Lecture 12

1 December 2014 Inverted Index 1

slide-2
SLIDE 2

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Outline

  • Background & Motivation

– Full-Text Search – Big-O Review – Indexing

  • The Inverted Index

– An Example – Design a relational index – Advanced Issues

  • Example in Cognitive Modeling

1 December 2014 Inverted Index 2

slide-3
SLIDE 3

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Problem: Full-Text Search

  • Given: set of “documents” containing “words”

– General problem in the field of Information Retrieval

  • Task: find “best” document(s) that contain a set of

words

  • Requirements

– Fast & scalable – Relevant results (precision, recall, f-score) – Expressive queries – Up-to-date

1 December 2014 Inverted Index 3

slide-4
SLIDE 4

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Example: Web Search

1 December 2014 Inverted Index 4

document ¡= ¡web ¡page/map ¡lis4ng ¡

slide-5
SLIDE 5

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Other Examples

1 December 2014 Inverted Index 5

slide-6
SLIDE 6

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Scaling Up: # of “Documents”

1 December 2014 Inverted Index 6

>60 ¡trillion ¡pages ¡ >1 ¡billion ¡users ¡ >240 ¡million ¡items ¡

slide-7
SLIDE 7

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Scaling Up: Search Frequency

1 December 2014 Inverted Index 7

>68K/s ¡ >11K/s ¡

slide-8
SLIDE 8

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Big-O Review

  • What does O(n) mean?
  • A linear algorithm for full-text search?

– Find 5 documents that contain “WIT”

  • What is the complexity in terms of

documents (d) and average-words-per- document (w)?

1 December 2014 Inverted Index 8

slide-9
SLIDE 9

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Is Linear-Time Google Possible?

1 December 2014 Inverted Index 9

  • Assume simple query:
  • Single listing of all web pages + words

– 60T pages * 250 w/page * 8 bytes/w ~ 106PB

  • Require 1s response

– 7.5M * 2GHz 64-bit CPU (assume 1 cycle/w) – (7.5M * 68K) CPUs * 85W/CPU * $.15/kWh ~ $1.8M/s

~4M ¡Blu-­‑ray ¡ ~70 ¡CPU/ person ¡ 3X ¡US ¡GDP ¡

slide-10
SLIDE 10

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Indexing

  • Improve search speed at the cost of extra…

– Memory for data structure(s) – Time to update the data structure(s)

  • Backbone of databases (physical design)

– Search engines – Graphics/game engines – Simulation software …

1 December 2014 Inverted Index 10

slide-11
SLIDE 11

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Inverted Index by Example

1 December 2014 Inverted Index 11

Given documents { D1, D2, D3 }:

– D1 = “it is what it is” – D2 = “what is it” – D3 = “it is a banana”

Inverted Index:

– “a”: [ D3 ] – “banana”: [ D3 ] – “is”: [ D1, D2, D3 ] – “it”: [ D1, D2, D3 ] – “what”: [ D1, D2 ]

Dis:nct ¡Word ¡List ¡ (sorted) ¡ Let’s ¡try ¡some ¡queries: ¡ “what”, ¡“a”, ¡“banana”, ¡“apple” ¡ Time ¡& ¡Memory: ¡O(?) ¡ Construc)on ¡& ¡Query ¡ Describe ¡an ¡algorithm ¡to ¡ populate ¡these ¡lists ¡ Document ¡Lists ¡ Describe ¡an ¡algorithm ¡to ¡query ¡ this ¡data ¡structure ¡

slide-12
SLIDE 12

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Design a Relational Inverted Index

Develop a set of table(s) and index(es) that support efficient construction and querying of an inverted index Assume

  • Documents have a unique id and a path
  • A document is a sequence of words

– Document d = [ w1, w2, … wn ]

  • Search for a single, exact-match word

– Does document D have word w? – The list of documents D that have word w?

1 December 2014 Inverted Index 12

slide-13
SLIDE 13

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Advanced Issues

  • More expressive query semantics

– Multi-word – Locality: [“what is it”] vs. [ “what it is” ] vs. [“what”, “is”, “it”]

  • Ranked results

– Document-ranking algorithm (e.g. PageRank) – Efficient ranked retrieval

  • Dynamics

– Document addition/removal/modification – Rank

  • Document changes
  • Integration of real-time variables (e.g. location)

1 December 2014 Inverted Index 13

slide-14
SLIDE 14

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Modeling Semantic Memory

  • Semantic memory is a human’s long-term

store of facts about the world, independent of the context in which they were originally learned

  • The ACT-R (http://act-r.psy.cmu.edu) model
  • f semantic memory has been successful at

explaining a variety of psychological phenomena (e.g. retrieval bias, forgetting)

  • The model does not scale to large memory

sizes, which hampers complex experiments

1 December 2014 Inverted Index 14

slide-15
SLIDE 15

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Scale Fail [AFRL ’09]

1 December 2014 Inverted Index 15

slide-16
SLIDE 16

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Memory Representation

  • Document = Node
  • Word = edge

1 December 2014 Inverted Index 16

Example ¡cue: ¡ last(obama),spouse(X)

slide-17
SLIDE 17

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Ranking [Anderson et al. ‘04]

1 December 2014 Inverted Index 17

Predict future usage via history

ln(

j −d

t

)

j=1 n

  • ­‑1.5 ¡
  • ­‑1 ¡
  • ­‑0.5 ¡

0 ¡ 0.5 ¡ 1 ¡ 1.5 ¡ 0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ 50 ¡ 60 ¡ 70 ¡ 80 ¡ 90 ¡ 100 ¡ Base-­‑Level ¡Ac:va:on ¡ Time ¡

MA ¡> ¡MB ¡

slide-18
SLIDE 18

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Example

1 December 2014 Inverted Index 18

Semantic Objects: Features

slide-19
SLIDE 19

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Inverted Index

1 December 2014 Inverted Index 19

Semantic Objects: Features Inverted Index

slide-20
SLIDE 20

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Index Statistics

1 December 2014 Inverted Index 20

Semantic Objects: Features Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡

slide-21
SLIDE 21

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Top-1 Non-Ranked Retrieval

1 December 2014 Inverted Index 21

Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ Cue: ¡ Query ¡Plan: ¡ Candidate: ¡ 1 ¡

slide-22
SLIDE 22

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Introducing Rank

1 December 2014 Inverted Index 22

Semantic Objects: Features Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡

2 ¡ 1 ¡ 3 ¡

1 ¡

slide-23
SLIDE 23

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Ranked Retrieval Algorithm #1

Sort on Query

1 December 2014 Inverted Index 23

Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ Cue: ¡ Query ¡Plan: ¡ Candidate: ¡

2 ¡ 1 ¡

1 ¡ Each ¡query ¡scales ¡with ¡the ¡size ¡of ¡ the ¡candidate ¡list! ¡

slide-24
SLIDE 24

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Ranked Retrieval Algorithm #3

Static Sort

1 December 2014 Inverted Index 24

Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ Cue: ¡ Query ¡Plan: ¡ Candidate: ¡

1 ¡ 1 ¡ 1 ¡ 2 ¡ 2 ¡ 2 ¡ 3 ¡ 3 ¡

1 ¡

3 ¡

slide-25
SLIDE 25

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Ranked Retrieval Algorithm #2

Static Sort

1 December 2014 Inverted Index 25

Inverted Index

3 ¡ 2 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ Cue: ¡ Query ¡Plan: ¡ Candidate: ¡

1 ¡ 1 ¡ 1 ¡ 2 ¡ 2 ¡ 2 ¡ 3 ¡ 3 ¡

1 ¡

3 ¡

Each ¡rank ¡update ¡scales ¡with ¡ feature ¡cardinality! ¡

slide-26
SLIDE 26

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Hybrid Approach

  • Empirically supported cardinality threshold, θ
  • If (cardinality > θ): Sort on Query [#1]

– Candidate enumeration scales with # of objects with large cardinality (empirically rare)

  • If (cardinality ≤ θ): Static Sort [#2]

– Bias updates must be locally efficient

  • Objects affected: O(1)
  • Computation: O(1)

1 December 2014 Inverted Index 26

slide-27
SLIDE 27

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Some Results

Inverted index (via SQLite) + new approach was fast and scaled!

>30x faster than off-the-shelf database (on >3x data)!

1 December 2014 Inverted Index 27

0 ¡ 0.1 ¡ 0.2 ¡ 0.3 ¡ 0.4 ¡ 0.5 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡ 9 ¡ 10 ¡ 11 ¡ Retreival ¡Time ¡(msec) ¡ Query ¡Size ¡ 3.6M ¡ 362K ¡ 40K ¡ 5K ¡

slide-28
SLIDE 28

Wentworth Institute of Technology COMP570 – Database Applications | Fall 2014 | Derbinsky

Takeaways

  • A common approach to large-scale search is indexing:

using data structure(s) to improve access speed

  • An inverted index is commonly used for full-text search

(even in situations that might not look like it)

  • Inverted indexes are fast, scalable, and straight-

forward to implement

  • Know your indexes/data structures! Careful problem

analysis and algorithm development can often beat generic approaches

– Even if you don’t use a DBMS, DBMS methods can be very useful in a variety of applications!

1 December 2014 Inverted Index 28