Fast Bag-Of-Words Candidate Selection in Content-Based Instance - - PowerPoint PPT Presentation

fast bag of words candidate selection in content based
SMART_READER_LITE
LIVE PREVIEW

Fast Bag-Of-Words Candidate Selection in Content-Based Instance - - PowerPoint PPT Presentation

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Micha Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1 1 Department of Computer Science and Engineering Tandon School of Engineering New York University 2


slide-1
SLIDE 1

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems

Michał Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1

1Department of Computer Science and Engineering

Tandon School of Engineering New York University

2Blippar Inc.

December 12, 2018

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Problem Statement

◮ Given a database of different types of images

◮ Point phone camera at an object ◮ Recognize it by finding its instance in the database

◮ Implemented as part of an Augmented Reality

application

◮ General search in a broad domain

slide-4
SLIDE 4

Content-Based Instance Retrieval

◮ Given a picture, return its matching instance from

database

◮ Bag-of-words retrieval

  • 1. Extract descriptors, robust against rotation, scaling, etc.

◮ Convolutional Neural Networks (CNN) [Zheng 2017] ◮ Scale-Invariant Feature Transform (SIFT) [Lowe 1999]

  • 2. Translate feature set into visual words
  • 3. Use standard text search techniques to find candidates
  • 4. Rerank using a complex scoring method
slide-5
SLIDE 5

Inverted Index

slide-6
SLIDE 6

Document Retrieval

  • 1. Lists for query terms used to find matching documents
  • 2. Matching documents scored to find top N candidates
  • 3. Candidates re-ranked by a complex ranker (e.g., DNN or

ML model) [Liu 2009, Wang 2010]

  • 4. Top k < N results returned to user
slide-7
SLIDE 7

Document Retrieval

Our work:

◮ Queries are pictures ◮ SIFT-generated descriptors translated to visual-word

queries

◮ Partial scores stored in index and added up at query

time

slide-8
SLIDE 8

Scored Inverted Index

slide-9
SLIDE 9

Text Retrieval Algorithms

Exhaustive query processing

◮ Term at a time (TAAT) ◮ Document at a time (DAAT) ◮ Score at a time (SAAT)

slide-10
SLIDE 10

Term at a Time

slide-11
SLIDE 11

Document at a Time

slide-12
SLIDE 12

Score at a Time

slide-13
SLIDE 13

Safe Dynamic Pruning

Non-exhaustive processing

◮ Threshold Algorithm [Fagin 1996]

◮ Well known algorithm used in databases

◮ MaxScore [Turtle 1995]

◮ Partitions terms/lists into essential and non-essential

◮ WAND [Broder 2003] (and variations)

◮ Find pivot – a document to which all lists can be skipped

without missing any top-k document

slide-14
SLIDE 14

Data Analysis

slide-15
SLIDE 15

Data Analysis

Objective Better understanding of how quantitative properties of bag-of-visual-words corpus and index may impact query efficiency. Data Set Comparison

◮ BoVW

◮ subset of Blippar’s production BoVW collection ◮ sampled production queries

◮ Clueweb09B

◮ standard IR text corpus ◮ TREC 06-09 Web Query Track topics

slide-16
SLIDE 16

Data Analysis. 1: Query Lengths

Average Query Lengths BoVW 272 Clueweb09B 2.7 Significance

◮ Large overhead of selecting a posting list during

processing in BoVW

◮ DAAT methods slow down significantly

slide-17
SLIDE 17

Data Analysis. 2: Posting List Lengths

1 102 104 106

Posting List Length

.2 .4 .6

172.72

Clueweb09-B

500 1000 1500 2000

Posting List Length

.0015 .003

674.1

BoVW

slide-18
SLIDE 18

Data Analysis. 3: Posting List Max Scores

10 20 30

Posting List Max Score

0.0 0.1 0.2

14.5

Clueweb09-B

200 400 600 800 1000

Posting List Max Score

.005 .01

142.72

BoVW

slide-19
SLIDE 19

Data Analysis. 4: Length/Max Scores Correlation

◮ Clueweb09-B

◮ strong negative correlation (-0.66) ◮ Inverted Document Frequency: common words

penalized by scoring functions

◮ BoVW

◮ almost no correlation (0.06)

Significance Potentially less advantage for dynamic pruning methods such as Max-Score.

slide-20
SLIDE 20

Data Analysis. 5: Query Term Footprint

Query Term Footprint The fraction of the query terms actually contained in the average top-k result. Clueweb09B

◮ 60% – 95% depending on queries

BoVW

◮ 1.1% for production queries ◮ Conjunctive queries impossible ◮ Negative impact on Max-Score algorithms — few

non-essential lists to skip

slide-21
SLIDE 21

Data Analysis. 6: Index Size

Clueweb09-B

◮ 50 mln documents ◮ billions documents in real life

BoVW

◮ 2.6 mln documents ◮ about an order of magnitude more in production ◮ far fewer documents than most large text collections

slide-22
SLIDE 22

Data Analysis. 7: Accumulator Sparsity

Clueweb09-B

◮ ~15% documents with non-zero scores

BoVW

◮ ~8% documents with non-zero scores ◮ potential to improve accumulating and aggregating

scores in TAAT processing

slide-23
SLIDE 23

DAAT v TAAT

slide-24
SLIDE 24

DAAT v TAAT

Results on BoVW

10 20 30 40

Latency (ms) DAAT TAAT

◮ ~75% of DAAT instructions select next posting list

slide-25
SLIDE 25

DAAT v TAAT: Query Lengths

1-10 41-50 91-100 141-150 191-200 Query Length Range 2 4 6 8 10 Latency (ms)

DAAT TAAT

slide-26
SLIDE 26

TAAT Optimizations

slide-27
SLIDE 27

TAAT Optimizations: Aggregation (A)

◮ Keep max of each block while traversing ◮ Before aggregating a block, check if max is higher than

the current threshold

slide-28
SLIDE 28

TAAT Optimizations: Prefetch (P)

◮ ~50% accumulator access instructions miss L1 cache ◮ We hint CPU to prefetch accumulators ahead of time ◮ Additionally, we hint that it can be evicted right after

the write instruction

slide-29
SLIDE 29

TAAT Optimizations: Accumulator Initialization (I)

◮ A cyclic query counter q of size m ◮ At traversal, if qa < q, the accumulator is overwritten,

and qa ← q

◮ Otherwise, we increase the accumulator ◮ At q = 0, we erase the accumulator before traversal

slide-30
SLIDE 30

TAAT Optimizations

TAAT A A+P A+P+I 2 4 6 Latency (ms)

slide-31
SLIDE 31

Early Termination

slide-32
SLIDE 32

Safe Early Termination

◮ We analyzed mechanics behind safe early termination

techniques:

◮ Threshold Algorithm ◮ WAND ◮ MaxScore

◮ Data proves those techniques to be inefficient

slide-33
SLIDE 33

Safe Early Termination

Threshold Algorithm On average, the stopping condition occurs after processing 98% of postings. MaxScore Given the real final threshold, 97% of terms (98% of the postings) are essential on average. WAND Almost 80% of the postings have to be visited on average, and over 70% have to be evaluated.

slide-34
SLIDE 34

Unsafe Score at a Time

20 40 60 80 100

Processed Postings (%)

1.6 1.8 2.0 2.2 2.4 N-S

slide-35
SLIDE 35

Conclusions

◮ CBIR bag-of-words collection and queries are much

different from textual

◮ This impacts the efficiency of known retrieval

algorithms

◮ TAAT outperforms DAAT due to query length ◮ TAAT can be further optimized to neutralize its

drawbacks

◮ Tested early termination techniques fail in our type of

scenario

slide-36
SLIDE 36

Q&A

slide-37
SLIDE 37

References

[ Broder 2003 ] Broder, Carmel, Herscovici, Soffer, Zien. Efficient query evaluation using a two-level retrieval process [ Fagin 2001 ] Fagin, Lotem, Naor. Optimal aggregation algorithms for middleware [ Lowe 1999 ] Lowe. Object recognition from local scale-invariant features [ Turtle 1995 ] Turtle, Flood. Query evaluation: strategies and

  • ptimizations

[ Zheng 2017 ] Zheng, Yang, Tian. SIFT meets CNN: A decade survey of instance retrieval