Fast Bag-Of-Words Candidate Selection in Content-Based Instance - - PowerPoint PPT Presentation

▶

Oct 29, 2023 532 likes •911 views

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Micha Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1 1 Department of Computer Science and Engineering Tandon School of Engineering New York University 2

SLIDE 1

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems

Michał Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1

1Department of Computer Science and Engineering

Tandon School of Engineering New York University

2Blippar Inc.

December 12, 2018

SLIDE 2

Introduction

SLIDE 3

Problem Statement

◮ Given a database of different types of images

◮ Point phone camera at an object ◮ Recognize it by finding its instance in the database

◮ Implemented as part of an Augmented Reality

application

◮ General search in a broad domain

SLIDE 4

Content-Based Instance Retrieval

◮ Given a picture, return its matching instance from

database

◮ Bag-of-words retrieval

1. Extract descriptors, robust against rotation, scaling, etc.

◮ Convolutional Neural Networks (CNN) [Zheng 2017] ◮ Scale-Invariant Feature Transform (SIFT) [Lowe 1999]

2. Translate feature set into visual words
3. Use standard text search techniques to find candidates
4. Rerank using a complex scoring method

SLIDE 5

Inverted Index

SLIDE 6

Document Retrieval

1. Lists for query terms used to find matching documents
2. Matching documents scored to find top N candidates
3. Candidates re-ranked by a complex ranker (e.g., DNN or

ML model) [Liu 2009, Wang 2010]

4. Top k < N results returned to user

SLIDE 7

Document Retrieval

Our work:

◮ Queries are pictures ◮ SIFT-generated descriptors translated to visual-word

queries

◮ Partial scores stored in index and added up at query

time

SLIDE 8

Scored Inverted Index

SLIDE 9

Text Retrieval Algorithms

Exhaustive query processing

◮ Term at a time (TAAT) ◮ Document at a time (DAAT) ◮ Score at a time (SAAT)

SLIDE 10

Term at a Time

SLIDE 11

Document at a Time

SLIDE 12

Score at a Time

SLIDE 13

Safe Dynamic Pruning

Non-exhaustive processing

◮ Threshold Algorithm [Fagin 1996]

◮ Well known algorithm used in databases

◮ MaxScore [Turtle 1995]

◮ Partitions terms/lists into essential and non-essential

◮ WAND [Broder 2003] (and variations)

◮ Find pivot – a document to which all lists can be skipped

without missing any top-k document

SLIDE 14

Data Analysis

SLIDE 15

Data Analysis

Objective Better understanding of how quantitative properties of bag-of-visual-words corpus and index may impact query efficiency. Data Set Comparison

◮ BoVW

◮ subset of Blippar’s production BoVW collection ◮ sampled production queries

◮ Clueweb09B

◮ standard IR text corpus ◮ TREC 06-09 Web Query Track topics

SLIDE 16

Data Analysis. 1: Query Lengths

Average Query Lengths BoVW 272 Clueweb09B 2.7 Significance

◮ Large overhead of selecting a posting list during

processing in BoVW

◮ DAAT methods slow down significantly

SLIDE 17

Data Analysis. 2: Posting List Lengths

1 102 104 106

Posting List Length

.2 .4 .6

172.72

Clueweb09-B

500 1000 1500 2000

Posting List Length

.0015 .003

674.1

BoVW

SLIDE 18

Data Analysis. 3: Posting List Max Scores

10 20 30

Posting List Max Score

0.0 0.1 0.2

14.5

Clueweb09-B

200 400 600 800 1000

Posting List Max Score

.005 .01

142.72

BoVW

SLIDE 19

Data Analysis. 4: Length/Max Scores Correlation

◮ Clueweb09-B

◮ strong negative correlation (-0.66) ◮ Inverted Document Frequency: common words

penalized by scoring functions

◮ BoVW

◮ almost no correlation (0.06)

Significance Potentially less advantage for dynamic pruning methods such as Max-Score.

SLIDE 20

Data Analysis. 5: Query Term Footprint

Query Term Footprint The fraction of the query terms actually contained in the average top-k result. Clueweb09B

◮ 60% – 95% depending on queries

BoVW

◮ 1.1% for production queries ◮ Conjunctive queries impossible ◮ Negative impact on Max-Score algorithms — few

non-essential lists to skip

SLIDE 21

Data Analysis. 6: Index Size

Clueweb09-B

◮ 50 mln documents ◮ billions documents in real life

BoVW

◮ 2.6 mln documents ◮ about an order of magnitude more in production ◮ far fewer documents than most large text collections

SLIDE 22

Data Analysis. 7: Accumulator Sparsity

Clueweb09-B

◮ ~15% documents with non-zero scores

BoVW

◮ ~8% documents with non-zero scores ◮ potential to improve accumulating and aggregating

scores in TAAT processing

SLIDE 23

DAAT v TAAT

SLIDE 24

DAAT v TAAT

Results on BoVW

10 20 30 40

Latency (ms) DAAT TAAT

◮ ~75% of DAAT instructions select next posting list

SLIDE 25

DAAT v TAAT: Query Lengths

1-10 41-50 91-100 141-150 191-200 Query Length Range 2 4 6 8 10 Latency (ms)

DAAT TAAT

SLIDE 26

TAAT Optimizations

SLIDE 27

TAAT Optimizations: Aggregation (A)

◮ Keep max of each block while traversing ◮ Before aggregating a block, check if max is higher than

the current threshold

SLIDE 28

TAAT Optimizations: Prefetch (P)

◮ ~50% accumulator access instructions miss L1 cache ◮ We hint CPU to prefetch accumulators ahead of time ◮ Additionally, we hint that it can be evicted right after

the write instruction

SLIDE 29

TAAT Optimizations: Accumulator Initialization (I)

◮ A cyclic query counter q of size m ◮ At traversal, if qa < q, the accumulator is overwritten,

and qa ← q

◮ Otherwise, we increase the accumulator ◮ At q = 0, we erase the accumulator before traversal

SLIDE 30

TAAT Optimizations

TAAT A A+P A+P+I 2 4 6 Latency (ms)

SLIDE 31

Early Termination

SLIDE 32

Safe Early Termination

◮ We analyzed mechanics behind safe early termination

techniques:

◮ Threshold Algorithm ◮ WAND ◮ MaxScore

◮ Data proves those techniques to be inefficient

SLIDE 33

Safe Early Termination

Threshold Algorithm On average, the stopping condition occurs after processing 98% of postings. MaxScore Given the real final threshold, 97% of terms (98% of the postings) are essential on average. WAND Almost 80% of the postings have to be visited on average, and over 70% have to be evaluated.

SLIDE 34

Unsafe Score at a Time

20 40 60 80 100

Processed Postings (%)

1.6 1.8 2.0 2.2 2.4 N-S

SLIDE 35

Conclusions

◮ CBIR bag-of-words collection and queries are much

different from textual

◮ This impacts the efficiency of known retrieval

algorithms

◮ TAAT outperforms DAAT due to query length ◮ TAAT can be further optimized to neutralize its

drawbacks

◮ Tested early termination techniques fail in our type of

scenario

SLIDE 36

Q&A

SLIDE 37

References

[ Broder 2003 ] Broder, Carmel, Herscovici, Soffer, Zien. Efficient query evaluation using a two-level retrieval process [ Fagin 2001 ] Fagin, Lotem, Naor. Optimal aggregation algorithms for middleware [ Lowe 1999 ] Lowe. Object recognition from local scale-invariant features [ Turtle 1995 ] Turtle, Flood. Query evaluation: strategies and

ptimizations

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems

Introduction

Problem Statement

application

Content-Based Instance Retrieval

database

Inverted Index

Document Retrieval

ML model) [Liu 2009, Wang 2010]

Document Retrieval

Our work:

queries

time

Scored Inverted Index

Text Retrieval Algorithms

Exhaustive query processing

Term at a Time

Document at a Time

Score at a Time

Safe Dynamic Pruning

Non-exhaustive processing

without missing any top-k document

Data Analysis

Data Analysis

Objective Better understanding of how quantitative properties of bag-of-visual-words corpus and index may impact query efficiency. Data Set Comparison

Data Analysis. 1: Query Lengths

Average Query Lengths BoVW 272 Clueweb09B 2.7 Significance

processing in BoVW

Data Analysis. 2: Posting List Lengths

Data Analysis. 3: Posting List Max Scores

Data Analysis. 4: Length/Max Scores Correlation

penalized by scoring functions

Significance Potentially less advantage for dynamic pruning methods such as Max-Score.

Data Analysis. 5: Query Term Footprint

Query Term Footprint The fraction of the query terms actually contained in the average top-k result. Clueweb09B

BoVW

non-essential lists to skip

Data Analysis. 6: Index Size

Clueweb09-B

BoVW

Data Analysis. 7: Accumulator Sparsity

Clueweb09-B

BoVW

scores in TAAT processing

DAAT v TAAT

DAAT v TAAT

Results on BoVW

Latency (ms) DAAT TAAT

DAAT v TAAT: Query Lengths

1-10 41-50 91-100 141-150 191-200 Query Length Range 2 4 6 8 10 Latency (ms)

TAAT Optimizations

TAAT Optimizations: Aggregation (A)

the current threshold

TAAT Optimizations: Prefetch (P)

the write instruction

TAAT Optimizations: Accumulator Initialization (I)

and qa ← q

TAAT Optimizations

TAAT A A+P A+P+I 2 4 6 Latency (ms)

Early Termination

Safe Early Termination

techniques:

Safe Early Termination

Threshold Algorithm On average, the stopping condition occurs after processing 98% of postings. MaxScore Given the real final threshold, 97% of terms (98% of the postings) are essential on average. WAND Almost 80% of the postings have to be visited on average, and over 70% have to be evaluated.

Unsafe Score at a Time

20 40 60 80 100

Processed Postings (%)

1.6 1.8 2.0 2.2 2.4 N-S

Conclusions

different from textual

algorithms

drawbacks

scenario

Q&A

References

[ Zheng 2017 ] Zheng, Yang, Tian. SIFT meets CNN: A decade survey of instance retrieval