Distributed Multi-modal Similarity Retrieval David Novak Seminar of - - PowerPoint PPT Presentation

distributed multi modal similarity retrieval
SMART_READER_LITE
LIVE PREVIEW

Distributed Multi-modal Similarity Retrieval David Novak Seminar of - - PowerPoint PPT Presentation

Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014 David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17 Outline of the Talk Motivation 1 Similarity Search E ff ectiveness and E ffi


slide-1
SLIDE 1

Distributed Multi-modal Similarity Retrieval

David Novak Seminar of DISA Lab, October 14, 2014

David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17

slide-2
SLIDE 2

Outline of the Talk

1

Motivation Similarity Search Effectiveness and Efficiency Multi-modal Search

2

Existing Solutions Similarity Indexing Distributed Key-value Stores

3

Big Data Similarity Retrieval Generic Architecture Specific System

4

Conclusions

David Novak Multi-modal Similarity Retrieval DISA Seminar 2 / 17

slide-3
SLIDE 3

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

slide-4
SLIDE 4

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

everything we can see, hear, measure, observe is in digital form

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

slide-5
SLIDE 5

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

slide-6
SLIDE 6

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity The similarity search problem has two aspects effectiveness: how to measure similarity of two “objects”

domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . )

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

slide-7
SLIDE 7

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .

[cognitive psychology]

everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity The similarity search problem has two aspects effectiveness: how to measure similarity of two “objects”

domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . )

efficiency: how to realize similarity search fast

using a given similarity measure

  • n very large data collections

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

slide-8
SLIDE 8

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos)

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-9
SLIDE 9

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor)

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-10
SLIDE 10

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-11
SLIDE 11

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-12
SLIDE 12

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-13
SLIDE 13

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images

Efficiency problem: what if we had 100 million of images with such descriptors each descriptor is a 4096-dimensional float vector

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-14
SLIDE 14

Motivation Effectiveness and Efficiency

Efficiency: Motivation Example

Type of data: general images (photos) every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images

Efficiency problem: what if we had 100 million of images with such descriptors each descriptor is a 4096-dimensional float vector ⇒ over 1.5 TB of data to be organized for similarity search

answer similarity queries online

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

slide-15
SLIDE 15

Motivation Multi-modal Search

Real Application: Multi-field Data

real-world application data objects would have many “fields”:

attribute fields (numbers, strings, dates, etc.) (several) descriptors for similarity search keywords/annotations for full-text search, etc.

David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17

slide-16
SLIDE 16

Motivation Multi-modal Search

Real Application: Multi-field Data

real-world application data objects would have many “fields”:

attribute fields (numbers, strings, dates, etc.) (several) descriptors for similarity search keywords/annotations for full-text search, etc.

example: { "ID": "image_1", "author": "David Novak", "date": "20140327", "categories": [ "outdoor", "family" ], "DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ], "dominant_color": "0x9E, 0xC2, 0x13", "keywords": "summer, beach, ocean, sun, sand" }

David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17

slide-17
SLIDE 17

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

slide-18
SLIDE 18

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space multi-modal search – combination of several search perspectives, e.g.

direct combination of similarity modalities similarity query with filtering by attribute(s) re-ranking of search result by different criteria

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

slide-19
SLIDE 19

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space multi-modal search – combination of several search perspectives, e.g.

direct combination of similarity modalities similarity query with filtering by attribute(s) re-ranking of search result by different criteria

. . . and do it all on a very large scale

voluminous data collections high query throughput

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

slide-20
SLIDE 20

Existing Solutions Similarity Indexing

Distance-based Similarity Search

generic similarity search

applicable to many domains

data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D − → R+

0 satisfying

postulates of identity, symmetry, and triangle inequality

David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17

slide-21
SLIDE 21

Existing Solutions Similarity Indexing

Distance-based Similarity Search

generic similarity search

applicable to many domains

data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D − → R+

0 satisfying

postulates of identity, symmetry, and triangle inequality query by example: K-NN(q) returns K objects x from the dataset X ⊆ D with the smallest δ(q, x)

2 3 1

q

David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17

slide-22
SLIDE 22

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research memory structures for precise K-NN search

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

slide-23
SLIDE 23

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques

precise and approximate (not all objects from K-NN answer returned)

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

slide-24
SLIDE 24

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques

precise and approximate (not all objects from K-NN answer returned)

  • bjects are partitioned and organized on disk by the similarity metric

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

slide-25
SLIDE 25

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques

precise and approximate (not all objects from K-NN answer returned)

  • bjects are partitioned and organized on disk by the similarity metric

given query q, the “most-promising” partitions form the candidate set

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

slide-26
SLIDE 26

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques

precise and approximate (not all objects from K-NN answer returned)

  • bjects are partitioned and organized on disk by the similarity metric

given query q, the “most-promising” partitions form the candidate set the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

slide-27
SLIDE 27

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of different type

David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

slide-28
SLIDE 28

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of different type memory index that organizes only metadata

Novak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for Efficient Similarity Search. In DEXA 2014, Springer.

David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

slide-29
SLIDE 29

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of different type memory index that organizes only metadata

Novak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for Efficient Similarity Search. In DEXA 2014, Springer. similarity index IX

candidate set CX⊆ X

k-NN(q)

refined answer

data storage 1

calculate δ(q,c), c ∊ CX

2 3 David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

slide-30
SLIDE 30

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed

David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

slide-31
SLIDE 31

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed

GHT?, VPT?, MCAN, M-Chord

David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

slide-32
SLIDE 32

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed

GHT?, VPT?, MCAN, M-Chord

C

2

p

1

p

2

2*c 3*c c C C

1

p (mchord)

K2 N

2

K

1

N

1

K3 N3 N

4

K N K4

+2 +16 +8 +4

(a) (b)

+1

0 = 32 0 = 32 David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

slide-33
SLIDE 33

Existing Solutions Distributed Key-value Stores

Current Distributed Stores

Currently, many efficient distributed key-value or document stores emerged distributed hash tables

  • bjects organized by IDs (ID-object map)

quick access to “documents” by IDs

secondary indexes on attributes

David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17

slide-34
SLIDE 34

Existing Solutions Distributed Key-value Stores

Current Distributed Stores

Currently, many efficient distributed key-value or document stores emerged distributed hash tables

  • bjects organized by IDs (ID-object map)

quick access to “documents” by IDs

secondary indexes on attributes

David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17

slide-35
SLIDE 35

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker worker worker worker

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

slide-36
SLIDE 36

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field

worker worker worker

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

slide-37
SLIDE 37

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field

worker worker

k-NN(q.field)

candidate set CXi⊆ Xi 1 worker refinement on part of CXi 2 merge partial answers 3 refinement on part of CXi 2

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

slide-38
SLIDE 38

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field

inverted file index IXi

field2

attribute index IXi

field3

similarity index IXj

field

worker worker

k-NN(q.field)

candidate set CXi⊆ Xi 1 worker refinement on part of CXi 2 merge partial answers 3 refinement on part of CXi 2

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

slide-39
SLIDE 39

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes)

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

slide-40
SLIDE 40

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes) combined similarity queries (late fusion) K-NN query with attribute filtering distributed re-ranking query answer

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

slide-41
SLIDE 41

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes) combined similarity queries (late fusion) K-NN query with attribute filtering distributed re-ranking query answer efficient management of multiple data collections X = X1 ∪ X2 ∪ · · · ∪ Xs core key-value store is well horizontally scalable

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

slide-42
SLIDE 42

Big Data Similarity Retrieval Specific System

Specific System: Large-scale Image Management

100M objects from the CoPhIR dataset (benchmark): { "ID": "002561195", "title": "My wife & daughter on Gold Coast beach", "tags": "summer, beach, ocean, sun, sand, Australia", "mpeg7_scalable_color": "25 36 0 127 69...", "mpeg7_color_layout": "25 41 53 20; 32; -16...", "mpeg7_color_structure": "25 41 53 20; 32;...", "mpeg7_edge_histogram": "5 1 2 3 7 7 3 6...", "mpeg7_homogeneous_texture": "232 201 198 180 201...", "GPS_coordinates": "45.50382, -73.59921", "flickr_user": "david_novak" }

David Novak Multi-modal Similarity Retrieval DISA Seminar 14 / 17

slide-43
SLIDE 43

Big Data Similarity Retrieval Specific System

System Schema

Infinispan (Ispn) ID-object store Lucene I tags,title B+-tree I flickr_user PPP-Codes index I GPS PPP-Codes index I mpeg7

index

worker

Ispn node

worker

Ispn node

index

worker

Ispn node

worker

Ispn node

index

worker

Ispn node

index

David Novak Multi-modal Similarity Retrieval DISA Seminar 15 / 17

slide-44
SLIDE 44

Big Data Similarity Retrieval Specific System

Specific System: Demo

20M objects of this type: { "ID": "002561195", "title": "My wife & daughter on Gold Coast beach", "keywords": "summer, beach, ocean, sun, sand, Australia", "DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ] }

demo David Novak Multi-modal Similarity Retrieval DISA Seminar 16 / 17

slide-45
SLIDE 45

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17

slide-46
SLIDE 46

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable can manage multi-field data:

attribute, keywords, several similarity modalities many variants of multi-modal search queries

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17

slide-47
SLIDE 47

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable can manage multi-field data:

attribute, keywords, several similarity modalities many variants of multi-modal search queries

Challenges: full implementation and thorough testing the similarity index can be bottleneck = ⇒ distribute it

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17