Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language - - PowerPoint PPT Presentation

berkeley db for text multimedia retrieval
SMART_READER_LITE
LIVE PREVIEW

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language - - PowerPoint PPT Presentation

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University Motivation Recent advance in text/multimedia retrieval: good algorithms Scalability issue


slide-1
SLIDE 1

Berkeley-DB for Text/Multimedia Retrieval

Chun Jin

Language Technologies Institute School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Motivation

Recent advance in text/multimedia

retrieval: good algorithms

Scalability issue

Continuous data growth Adding new search features

Try: separating the scalability problem

from the retrieval algorithms?

slide-3
SLIDE 3

Our Goal

Providing a library for

application/system building based on Berkeley DB.

System prototyping.

slide-4
SLIDE 4

Text Retrieval

Vector Space Model (VSM) for Text:

D = {t1, t2, …, tm} Q = {t1, t2, …, tm} Sim(D, Q) = cos(D, Q)

To Scale: Inverted Index:

DID TF POS1 POS2 …

t1 -> . . t2 -> . . . . . . .

slide-5
SLIDE 5

Image Retrieval

Feature Space

M = {f1, f2, …, fm} Q = {f1, f2, …, fm} Distance(M, Q) = ||M - Q||

To Scale: Quantization then Index:

f1

  • >

. . fl

  • >

. . . .

MID flm

. . . |flQ –flm| < δ

slide-6
SLIDE 6

Retrieval Algorithm

Get feature entries Compute feature-level similarities Compute document-query similarities

DID TF

. . t1 -> . . t2 -> . . . . . . . Q = {TFt1 , TFt2 , …, TFtm }

? ?

slide-7
SLIDE 7

Retrieval Algorithm

Get feature entries: Berkeley DB:

BTree/Hash indexing Storage/buffer management

Compute feature-level similarities Compute document-query similarities: Join:

AND/OR (Inner/Outer) Join methods Callback to compute Step 2

t1 -> . . t2 -> . . . . . .

slide-8
SLIDE 8

System Architecture

Preprocessor Inverted Indexer Key Indexer Storage Manager Berkeley DB:

Inverted Index

Search Engine Similarity Calculator Data Query Results

slide-9
SLIDE 9

Development Layers

Retrieval Application: Feature Extraction Similarity Measures Retrieval API: Join Methods Iterator Inverted Index Formatter Berkeley DB API: Key Indexer Lib (BTree/Hash, etc.) Storage Management

slide-10
SLIDE 10

Berkeley DB vs. General DBMS:

Indexing techniques Storage management Operation: Join Developer’s API SQL Transaction management Recovery management

Task BDB DBMS √ √ √ √ √ √ √ √ √ √ X √ X √ X √

slide-11
SLIDE 11

Merge Join

List MergeJoin(List left, List right, Feature Qrfeature) while (not left.end() and not right.end()) lpair = left.current; rpair = right.current; if lpair.key = rpair.key FeatureSim v = Qrfeature.Sim(rpair.data); lpair.data = DocSim(lpair.data, v); left = left.next(); right = right.next(); else if lpair.key < rpair.key left = left.next(); else right = right.next() return left; Basic algorithm adopted from Wikipedia

Feature sim Doc sim

slide-12
SLIDE 12

Information Encapsulation

BDB: index key and entry boundary Iterator: sub-entry boundary Join: docID key and the rest of data Similarity function: data structure

t1 ->

slide-13
SLIDE 13

Flexibility

Feature similarity:

Term positions for proximity search Weighted link information Meta data adjustment

Document-Query similarity:

Cosine Euclidean Probabilistic

slide-14
SLIDE 14

Ongoing Work

Inverted Index Structure Design Implementing join methods and

iterators

Inverted Indexer Similarity functions and feature

extraction

slide-15
SLIDE 15

Related Work

Commercial Systems:

Google Endeca Oracle Text DB/Multimedia DB IBM Net Search Extender Thunderstone Texis YouTube

Research

CMU Stanford [Su & Widom IDEAS05]

slide-16
SLIDE 16

Conclusion

Problem:

Scalability issue on text/multimedia retrieval.

Idea:

Separating the problem from retrieval algorithms. Layered architecture.

Goal:

Providing a library for application/system building. Prototyping.

slide-17
SLIDE 17

Acknowledgements

Thanks to Minglong Shao (CMU) and

Zhu Liu (AT&T) for helpful discussions.

Thanks to Jaime Carbonell (CMU) for

his continuous support and encouragement.