Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language - - PowerPoint PPT Presentation

▶

Dec 24, 2022 349 likes •529 views

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University Motivation Recent advance in text/multimedia retrieval: good algorithms Scalability issue

SLIDE 1

Berkeley-DB for Text/Multimedia Retrieval

Chun Jin

Language Technologies Institute School of Computer Science Carnegie Mellon University

SLIDE 2

Motivation

Recent advance in text/multimedia

retrieval: good algorithms

Scalability issue

Continuous data growth Adding new search features

Try: separating the scalability problem

from the retrieval algorithms?

SLIDE 3

Our Goal

Providing a library for

application/system building based on Berkeley DB.

System prototyping.

SLIDE 4

Text Retrieval

Vector Space Model (VSM) for Text:

D = {t1, t2, …, tm} Q = {t1, t2, …, tm} Sim(D, Q) = cos(D, Q)

To Scale: Inverted Index:

DID TF POS1 POS2 …

t1 -> . . t2 -> . . . . . . .

SLIDE 5

Image Retrieval

Feature Space

M = {f1, f2, …, fm} Q = {f1, f2, …, fm} Distance(M, Q) = ||M - Q||

To Scale: Quantization then Index:

f1

. . fl

. . . .

MID flm

. . . |flQ –flm| < δ

SLIDE 6

Retrieval Algorithm

Get feature entries Compute feature-level similarities Compute document-query similarities

DID TF

. . t1 -> . . t2 -> . . . . . . . Q = {TFt1 , TFt2 , …, TFtm }

? ?

SLIDE 7

Retrieval Algorithm

Get feature entries: Berkeley DB:

BTree/Hash indexing Storage/buffer management

Compute feature-level similarities Compute document-query similarities: Join:

AND/OR (Inner/Outer) Join methods Callback to compute Step 2

t1 -> . . t2 -> . . . . . .

SLIDE 8

System Architecture

Preprocessor Inverted Indexer Key Indexer Storage Manager Berkeley DB:

Inverted Index

Search Engine Similarity Calculator Data Query Results

SLIDE 9

Development Layers

Retrieval Application: Feature Extraction Similarity Measures Retrieval API: Join Methods Iterator Inverted Index Formatter Berkeley DB API: Key Indexer Lib (BTree/Hash, etc.) Storage Management

SLIDE 10

Berkeley DB vs. General DBMS:

Indexing techniques Storage management Operation: Join Developer’s API SQL Transaction management Recovery management

Task BDB DBMS √ √ √ √ √ √ √ √ √ √ X √ X √ X √

SLIDE 11

Merge Join

List MergeJoin(List left, List right, Feature Qrfeature) while (not left.end() and not right.end()) lpair = left.current; rpair = right.current; if lpair.key = rpair.key FeatureSim v = Qrfeature.Sim(rpair.data); lpair.data = DocSim(lpair.data, v); left = left.next(); right = right.next(); else if lpair.key < rpair.key left = left.next(); else right = right.next() return left; Basic algorithm adopted from Wikipedia

Feature sim Doc sim

SLIDE 12

Information Encapsulation

BDB: index key and entry boundary Iterator: sub-entry boundary Join: docID key and the rest of data Similarity function: data structure

t1 ->

SLIDE 13

Flexibility

Feature similarity:

Term positions for proximity search Weighted link information Meta data adjustment

Document-Query similarity:

Cosine Euclidean Probabilistic

SLIDE 14

Ongoing Work

Inverted Index Structure Design Implementing join methods and

iterators

Inverted Indexer Similarity functions and feature

extraction

SLIDE 15

Related Work

Commercial Systems:

Google Endeca Oracle Text DB/Multimedia DB IBM Net Search Extender Thunderstone Texis YouTube

Research

CMU Stanford [Su & Widom IDEAS05]

SLIDE 16

Conclusion

Problem:

Scalability issue on text/multimedia retrieval.

Idea:

Separating the problem from retrieval algorithms. Layered architecture.

Goal:

Providing a library for application/system building. Prototyping.

SLIDE 17

Acknowledgements

Thanks to Minglong Shao (CMU) and

Zhu Liu (AT&T) for helpful discussions.

Thanks to Jaime Carbonell (CMU) for

Berkeley-DB for Text/Multimedia Retrieval

Chun Jin

Motivation

retrieval: good algorithms

from the retrieval algorithms?

Our Goal

application/system building based on Berkeley DB.

Text Retrieval

t1 -> . . t2 -> . . . . . . .

Image Retrieval

f1

. . fl

. . . .

. . . |flQ –flm| < δ

Retrieval Algorithm

. . t1 -> . . t2 -> . . . . . . . Q = {TFt1 , TFt2 , …, TFtm }

? ?

Retrieval Algorithm

t1 -> . . t2 -> . . . . . .

System Architecture

Preprocessor Inverted Indexer Key Indexer Storage Manager Berkeley DB:

Search Engine Similarity Calculator Data Query Results

Development Layers

Retrieval Application: Feature Extraction Similarity Measures Retrieval API: Join Methods Iterator Inverted Index Formatter Berkeley DB API: Key Indexer Lib (BTree/Hash, etc.) Storage Management

Berkeley DB vs. General DBMS:

Task BDB DBMS √ √ √ √ √ √ √ √ √ √ X √ X √ X √

Merge Join

Feature sim Doc sim

Information Encapsulation

Flexibility

Ongoing Work

iterators

extraction

Related Work

Conclusion

Acknowledgements

Zhu Liu (AT&T) for helpful discussions.

his continuous support and encouragement.