Efficient Scoring in Lucene Stefan Pohl Nokia Berlin - - PowerPoint PPT Presentation

efficient scoring in lucene
SMART_READER_LITE
LIVE PREVIEW

Efficient Scoring in Lucene Stefan Pohl Nokia Berlin - - PowerPoint PPT Presentation

Efficient Scoring in Lucene Stefan Pohl Nokia Berlin stefan.pohl@nokia.com Agenda Motivation Review: Query Processing Modes in Lucene Scoring Efficiency Optimization Experiments Motivation Speed ! Human Reaction Time: 200


slide-1
SLIDE 1

Efficient Scoring in Lucene

Stefan Pohl

Nokia Berlin stefan.pohl@nokia.com

slide-2
SLIDE 2

Agenda

 Motivation  Review: Query Processing Modes in Lucene  Scoring Efficiency Optimization  Experiments

slide-3
SLIDE 3

Motivation

 Speed !

Human Reaction Time: 200 ms* → Backend latency: << 200 ms

 Load ?

→ Secs / Q ↓ means Q / secs ↑

 Why not Scale Out ?

→ Costs

* Steven C. Seow, Designing and Engineering Time: The Psychology of Time Perception in Software, Addison-Wesley Professional, 2008.

slide-4
SLIDE 4

Ranked Retrieval in IR Engines

 Conceptually:

→ sort docs by score (descending)

 Technically:

slide-5
SLIDE 5

Running Example

 Collection

24 900 500 docs, 1kB each, from English Wikipedia (used in Lucene's nightly benchmark:

http://people.apache.org/~mikemccand/lucenebench)

 Query: ”The Berlin Buzzwords Conference”

 10 results queried

 Stats:

  • Doc. Freq. ft

17,574,107 100,989 413 207,041 Term t The Berlin Buzzwords Conference

slide-6
SLIDE 6

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → Matching requirement: All terms MUST* occur in result docs

* see o.a.l.search.BooleanClause.Occur.MUST

slide-7
SLIDE 7

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference”

slide-8
SLIDE 8

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(9) ...uses skip lists

slide-9
SLIDE 9

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(18)

slide-10
SLIDE 10

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(19)

slide-11
SLIDE 11

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(26)

slide-12
SLIDE 12

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(29)

slide-13
SLIDE 13

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(29)

slide-14
SLIDE 14

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(31)

slide-15
SLIDE 15

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(31)

slide-16
SLIDE 16

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” → advance(31)

slide-17
SLIDE 17

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” Result Set: {31,

slide-18
SLIDE 18

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference” Result Set: {31}

slide-19
SLIDE 19

Conjunctions (AND)

”+The +Berlin +Buzzwords +Conference”

 Few matches,

  • nly a few candidates to score

 Wikipedia 25M:

10 ms

→ Very efficient due to skipping, but

0 results → No partial match !

slide-20
SLIDE 20

Disjunctions (OR)

”The Berlin Buzzwords Conference” → k-way merge (using min-heap over terms)*

* see o.a.l.search.BooleanScorer2

slide-21
SLIDE 21

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-22
SLIDE 22

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-23
SLIDE 23

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-24
SLIDE 24

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-25
SLIDE 25

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-26
SLIDE 26

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-27
SLIDE 27

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-28
SLIDE 28

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-29
SLIDE 29

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-30
SLIDE 30

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-31
SLIDE 31

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-32
SLIDE 32

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-33
SLIDE 33

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-34
SLIDE 34

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-35
SLIDE 35

Disjunctions (OR)

”The Berlin Buzzwords Conference” → next()

slide-36
SLIDE 36

Disjunctions (OR)

”The Berlin Buzzwords Conference”

 No skipping; all postings decompressed,

merged & scores computed

slide-37
SLIDE 37

Disjunctions (OR)

”The Berlin Buzzwords Conference”

 Wikipedia 25M:

750 ms, 17,628,190 totalHits (vs. 10 queried) → Scoring of almost ALL documents Can we do better?

slide-38
SLIDE 38

Optimized Scoring with Maxscore*

 Maxscore*

  • H. Turtle, J. Flood. Query Evaluation: Strategies and Optimizations,

IPM, 31(6), 1995.

 Maxscore Variants

  • A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, J. Y. Zien. Efficient

Query Evaluation using a Two-Level Retrieval Process, in Proc. of CIKM, 2003.

  • T. Strohman, H. Turtle, W. B. Croft. Optimization Strategies for

Complex Queries, in Proc. of ACM SIGIR, 2005.

 Maxscore for Block-Compressed Indexes

  • K. Chakrabarti, S. Chaudhuri, V. Ganti. Interval-Based Pruning for

Top-k Processing over Compressed Lists, in Proc. of ICDE, 2011.

 Maxscore with Structured Queries

  • S. Pohl, A. Moffat, J. Zobel. Efficient Extended Boolean Retrieval,

IEEE TKDE, 24(6), 2012.

slide-39
SLIDE 39

Retrieval Model Scoring Functions

 Lucene's DefaultSimilarity:  BM25:  Scoring functions of (standard) retrieval models are

SUMs over term score contributions

slide-40
SLIDE 40

Maxscore

→ Order query terms by doc frequency ft → Box size refers to term score contribution

slide-41
SLIDE 41

Maxscore

→ At indexing time, determine maxscore s*

slide-42
SLIDE 42

Maxscore

→ At search time, compute cumulative maxscores c*

slide-43
SLIDE 43

Maxscore

→ At search time, compute cumulative maxscores c*

slide-44
SLIDE 44

Maxscore

→ At search time, compute cumulative maxscores c*

slide-45
SLIDE 45

Maxscore

→ At search time, compute cumulative maxscores c*

slide-46
SLIDE 46

Maxscore

→ Score top-k

slide-47
SLIDE 47

Maxscore

→ Score top-k, track lowest score as threshold

slide-48
SLIDE 48

Maxscore

slide-49
SLIDE 49

Maxscore

slide-50
SLIDE 50

Maxscore

→ Threshold exceeds c*

slide-51
SLIDE 51

Maxscore

→ Merge m-1 terms, advance(16)

slide-52
SLIDE 52

Maxscore

→ Threshold exceeds next c*

slide-53
SLIDE 53

Maxscore

→ Merge m-2 terms, advance(29)

slide-54
SLIDE 54

6.6X 2X

Maxscore – Experiments

 ”The Berlin Buzzwords Conference”:

System Scored Docs Time [ms] Lucene40 17 628 190 750 ±11 Lucene40 w/ Maxscore 298 800 94 ± 3

8X speed up !

 Hard queries from Lucene Benchmark:

slide-55
SLIDE 55

Maxscore – Summary

 Most effective for:

 Large collections  Queries w/ high-freq terms, or large result sets resp.  Queries w/ many terms

 Benefits

 Exact (identical results) → easy testing, debugging  Negligible overhead → never slower  More expensive scoring fct. possible

 Caveats

 TotalHitCount → approximate, or say ”1000+”  Have to decide on Similarity at indexing time

slide-56
SLIDE 56

Conclusion

DON'T BE AFRAID to score millions of docs.

Follow and vote for LUCENE-4100 !

Thank you!