Composite Correlation Qantization for Efficient Multimodal Retrieval - - PowerPoint PPT Presentation

composite correlation qantization for efficient
SMART_READER_LITE
LIVE PREVIEW

Composite Correlation Qantization for Efficient Multimodal Retrieval - - PowerPoint PPT Presentation

Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1 , Jianmin Wang 1 , and Philip S. Yu 12 1 School of Sofware Tsinghua University 2 Department of Computer Science University of Illinois, Chicago ACM


slide-1
SLIDE 1

Composite Correlation Qantization for Efficient Multimodal Retrieval

Mingsheng Long1, Yue Cao1, Jianmin Wang1, and Philip S. Yu12

1School of Sofware

Tsinghua University

2Department of Computer Science

University of Illinois, Chicago

ACM Conference on Research and Development in Information Retrieval, SIGIR 2016

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 1 / 28

slide-2
SLIDE 2

Outline

1

Introduction Problem Effectiveness and Efficiency Previous Work

2

Composite Correlation Qantization Multimodal Correlation Composite Qantization Optimization Framework

3

Evaluation Results Discussion

4

Summary

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 2 / 28

slide-3
SLIDE 3

Introduction Problem

Multimodal Understanding

How to utilize multimodal data to understand our real world?

Isomorphic space: integration, fusion, correlation, transfer, ...

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 3 / 28

slide-4
SLIDE 4

Introduction Problem

Multimodal Retrieval

Nearest Neighbor (NN) similarity retrieval across modalities

Database: X img = {ximg

1

, . . . , ximg

N } and Qery: qtxt

Cross-modal NN: NN (qtxt) = minximg∈X imgd

  • ximg, qtxt

Precision: 0.625

Image Query Top 16 Returned Tags

[‘lake’]

(a) I → T (Image Qery on Text DB)

Precision: 0.625

Tags Query Top 16 Returned Images

[‘sky sun’]

(b) T → I (Text Qery on Image DB)

Figure: Cross-modal retrieval: similarity retrieval across media modalities.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 4 / 28

slide-5
SLIDE 5

Introduction Effectiveness and Efficiency

Multimodal Embedding

Multimodal embedding reduces cross-modal heterogeneity gap

Coupling: min

N

  • i=1

d(zimg

i

, ztxt

i ) → more flexible

Fusion: zi = f (zimg

i

, ztxt

i ) → tighter relationship

“A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping Multimodal Embedding

001 011

+

Fusion

Multimodal Embedding

001 011

Coupling

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 5 / 28

slide-6
SLIDE 6

Introduction Effectiveness and Efficiency

Indexing and Hashing

Approximate Nearest Neighbor (ANN) Search

Exact Nearest Neighbor Search: linear scan O(NP) Efficient, acceptable accuracy, practical solutions

Reduce the number of distance computations: O(N′P), N′ ≪ N

Indexing: tree, neighborhood graph, inverted index, ...

Reduce the cost of each distance computation: O(NP′), P′ ≪ P

Hashing: Locality-Sensitive Hashing, Spectral Hashing, ...

Produce a few distinct distances (curse of dimensionality) Limited ability and flexibility of distance approximation

Qantization: Vector Qantization (VQ), Iterative Qantization (ITQ), Product Qantization (PQ), Composite Qantization (CQ)

K-means: Impossible for medium and long codes (large K)

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 6 / 28

slide-7
SLIDE 7

Introduction Previous Work

Multimodal Hashing

512-dim floats 128-bits

1M images 20GB 160M two- stage

Previous work: separate pipeline for Multimodal Embedding and Binary Encoding → large information loss, unbalanced encoding

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 7 / 28

slide-8
SLIDE 8

Composite Correlation Qantization

Problem Definition

Definition (Composite Correlation Qantization, CCQ) Given an image set {x1

n}N1 n=1 ∈ RP1 and a text set {x2 n}N2 n=1 ∈ RP2, learn

two correlation mappings f 1 : RP1 → RD and f 2 : RP2 → RD that transform images and texts into a D-dimensional isomorphic latent space, and jointly learn two composite quantizers q1 : RD → {0, 1}H and q2 : RD → {0, 1}H that quantize latent embeddings into compact H-bits binary codes.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 8 / 28

slide-9
SLIDE 9

Composite Correlation Qantization

Overview

A Latent Semantic Analysis (LSA) optimization framework

xv

n ≈ RvCvbv n, where Rv is correlation-maximal mapping, Cv is

similarity-preserving codebook, bv

n is compact binary code

Multimodal Embedding: Correlation Mapping & Code Fusion Composite Qantization: Isomorphic Space (shared codebook)

A “simple and reliable” approach to efficient multimodal retrieval

000 001 011 010 111 101 110 100

2 2 4 8

Isomorphic Codebook

000 001 011 010 111 101 110 100 000 001 011 010 111 101 110 100 001 010 011 100 101 110 111

4 8 2 2

Hash Code “A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping

+

Multimodal Embedding Composite Quantization

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 9 / 28

slide-10
SLIDE 10

Composite Correlation Qantization Multimodal Correlation

Multimodal Correlation

Paired data matrices: X1 = [x1

1, . . . , x1 N] , X2 = [x2 1, . . . , x2 N]

Fusion representation matrix: Z = [z1, . . . , zN] Transformation matrices: R1, R2, which transform X into Z min

R1,R2,Z λ1

  • R1TX1 − Z
  • 2

F + λ2

  • R2TX2 − Z
  • 2

F

(1)

000 001 011 010 111 101 110 100

2 2 4 8

Isomorphic Codebook

000 001 011 010 111 101 110 100 000 001 011 010 111 101 110 100 001 010 011 100 101 110 111

4 8 2 2

Hash Code “A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping

+

Multimodal Embedding Composite Quantization

X1 X2 R1

Z

R2

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 10 / 28

slide-11
SLIDE 11

Composite Correlation Qantization Multimodal Correlation

Multimodal Correlation

This problem is ill-posed, which cannot be solved successfully min

R1,R2,Z λ1

  • R1TX1 − Z
  • 2

F + λ2

  • R2TX2 − Z
  • 2

F

(2) Z = λ1R1TX1 + λ2R2TX2 λ1 + λ2 (3) R1 =

  • X1X1T−1

X1ZT R2 =

  • X2X2T−1

X2ZT (4)

000 001 011 010 111 101 110 100

2 2 4 8

Isomorphic Codebook

000 001 011 010 111 101 110 100 000 001 011 010 111 101 110 100 001 010 011 100 101 110 111

4 8 2 2

Hash Code “A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping

+

Multimodal Embedding Composite Quantization

X1 X2 R1

Z

R2

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 11 / 28

slide-12
SLIDE 12

Composite Correlation Qantization Multimodal Correlation

Multimodal Correlation

Add the covariance maximization with orthogonal constraints min

R1,R2,Zλ1

  • R1TX1 − Z
  • 2

F +

  • R1

⊥ TX1

  • 2

F

  • +λ2
  • R2TX2 − Z
  • 2

F +

  • R2

⊥ TX2

  • 2

F

  • (5)

min

R1,R2,Z λ1

  • X1 − R1Z
  • 2

F + λ2

  • X2 − R2Z
  • 2

F

(6)

000 001 011 010 111 101 110 100

2 2 4 8

Isomorphic Codebook

000 001 011 010 111 101 110 100 000 001 011 010 111 101 110 100 001 010 011 100 101 110 111

4 8 2 2

Hash Code “A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping

+

Multimodal Embedding Composite Quantization

X1 X2 R1

Z

R2

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 12 / 28

slide-13
SLIDE 13

Composite Correlation Qantization Composite Qantization

Composite Qantization

Learn M codebooks: C = [C1, . . . , CM], each codebook has K codewords Cm = [cm1, . . . , cmK] (cluster centroids of K-means) Each zi is approximated by the addtion of M codewords One per codebook, each selected by the binary assignment bmi Code representation: i1i2 . . . iM, where im = nz(bmi) Code length: M log2 K (1-of-K encoding) z ≈ ˆ z = C1b1 + C2b2 + . . . + CMbM = c1i1 + c2i2 + . . . + cMiM (7) C1 = [c11, . . . , c1K] C2 = [c21, . . . , c2K] . . . CM = [cM1, . . . , cMK]

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 13 / 28

slide-14
SLIDE 14

Composite Correlation Qantization Composite Qantization

Composite Qantization

Learn M codebooks: C = [C1, . . . , CM], each codebook has K codewords Cm = [cm1, . . . , cmK] (cluster centroids of K-means) Binary code matrices: B = [B1; . . . ; BM] , Bm = [bm1; . . . ; bmN] Control binary codes quality by quantization error minimization min

Z,C,B

  • Z −

M

m=1 CmBm

  • 2

F = N

  • i=1
  • zi −

M

m=1 Cmbmi

  • 2

2

(8)

000 001 011 010 111 101 110 100

2 2 4 8

Isomorphic Codebook

000 001 011 010 111 101 110 100 000 001 011 010 111 101 110 100 001 010 011 100 101 110 111

4 8 2 2

Hash Code “A Tabby cat is leaning

  • n a wooden table, with
  • ne paw on a laser

mouse and the other on a black laptop” Image Mapping Text Mapping

+

Multimodal Embedding Composite Quantization

Z

C = C1,…,CM

[ ]

B

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 14 / 28

slide-15
SLIDE 15

Composite Correlation Qantization Optimization Framework

Composite Correlation Qantization

Pro 1: Joint optimization: correlation, covariance & quantization Pro 2: Semi-Paired Data Qantization through the δ function Pro 3: Shared codebook & coding enables multimodal retrieval Pro 4: Easy configurations H = Mlog2K, D = min({Pv}V

v=1, H)

min

Rv,C,Bv V

  • v=1

Nv

  • n=1

λv

  • xv

n − Rv M

  • m=1

Cmδ (bv

mn)

  • 2

2

s.t. RvTRv = ID×D, Rv ∈ RPv×D δ (bv

mn)0 = 1, δ (bv mn) ∈ {0, 1}K

δ (bv

mn) =

  • bmn,

n = 1 . . . N0 bv

mn,

  • therwise

v = 1 . . . V, m = 1 . . . M, n = 1 . . . Nv (9)

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 15 / 28

slide-16
SLIDE 16

Composite Correlation Qantization Optimization Framework

Approximate Distance Computation

Asymmetric Qantizer Distance: q¯

v − xv n2 2 ≈ AQD (q¯ v, xv n)

AQD

v, xv n

  • =

v − R¯ v M m=1 Cmbv mn

  • 2

2

= −2 M

m=1 ˜

v, Cmbv mn +

  • M

m=1 Cmbv mn

  • 2

2

+

  • ˜

v

2

2 +

vT ⊥ q¯ v

2

2

(10) Qery-specific distance lookup table: Store the distances from all M × K codebook elements in C = [C1, . . . , CM] to query q¯

v

O(M) additions for term 1, O(M2) or O(1) additions for term 2 Alternative: Cosine Distance cos (q¯

v, xv n) = M m=1 ˜

v, Cmbv mn

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 16 / 28

slide-17
SLIDE 17

Composite Correlation Qantization Optimization Framework

Approximation Error Analysis

Theorem (Approximation Error Bound) The error of approximating Euclidean distance with AQD is bounded by |d (˜ q¯

v, ˜

xv

n) − d (˜

v, ˆ

xv

n)|

  • xv

n − Rv M m=1 Cmbv mn

  • 2.

(11) From triangle inequality, |d (˜ q¯

v, ˜

xv

n) − d (˜

v, ˆ

xv

n)| d (˜

xv

n, ˆ

xv

n). Then

d2 (˜ xv

n, ˆ

xv

n) =

  • RvTxv

n − M m=1 Cmbv mn

  • 2

2

  • RvTxv

n − M m=1 Cmbv mn

  • 2

2 + RvT ⊥ xv n 2 2

=

  • xv

n − Rv M m=1 Cmbv mn

  • 2

2,

(12) Qantize by max cross-modal correlation & within-modal covariance.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 17 / 28

slide-18
SLIDE 18

Evaluation

Experiment Setup

Datasets: NUS-WIDE, Wiki, and Flickr1M Tasks: I → I, T → T, I → T, T → I, I → IT, and T → IT Methods:

Unsupervised hashing: CVH, IMH Deep hashing: CorrAE + Sign Supervised hashing: CMSSH, SCM, QCH

Metrics: MAP@R, Precision-Recall, Precision@R, Efficiency

Table: The Statistics of Three Multimodal Benchmark Datasets Dataset NUS-WIDE Wiki Flickr1M Complete Set 195,834 2,866 1,000,000 Qery Set 2,000 693 1,000 Database 193,834 2,173 24,000 Training Set 10,000 2,173 975,000

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 18 / 28

slide-19
SLIDE 19

Evaluation

Search Pipeline

Repeated for N database vectors Query q0

  • f different modality

from the database

Distance lookup table

between query and codebook elements

Distance between q and x Output nearest vectors Multiple codebooks C Code of database vector x Transformed query q

common space

Indexing for candidate pruning

leave out for future work

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 19 / 28

slide-20
SLIDE 20

Evaluation Results

MAP Results

CCQ significantly outperforms unsupervised hashing methods (CVH, IMH) and deep hashing methods (CorrAE), and generally

  • utperforms supervised hashing methods (CMSSH, SCM, QCH).

Table: MAP Comparison of Multimodal Retrieval on Standard Datasets Task Method NUS-WIDE Wiki Flickr1M I → T CorrAE (deep) 0.4699 0.2033 0.6357 QCH (supervised) 0.5050 0.2368 0.6685 CCQ (ours) 0.5165 0.2371 0.7183 I → IT CCQ (ours) 0.5414 0.2529 0.6989 T → I CorrAE (deep) 0.4634 0.3478 0.6247 QCH (supervised) 0.5389 0.4411 0.6485 CCQ (ours) 0.5413 0.4222 0.7165 I → IT CCQ (ours) 0.7131 0.6394 0.7190

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 20 / 28

slide-21
SLIDE 21

Evaluation Results

NUS-WIDE

Asymmetric difficulty: T → T ≤ T → I ≤ I → T ≤ I → I; If the image modality is high quality → unsupervised hashing is good.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.35 0.4 0.45 0.5 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(a) I → T @ 32 bits

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.35 0.4 0.45 0.5 0.55 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(b) T → I @ 32 bits

Figure: Precision-recall curves on NUS-WIDE cross-modal tasks @ 32 bits.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 21 / 28

slide-22
SLIDE 22

Evaluation Results

Wiki

The low quality of the image modality leads to low cross-modal retrieval performance, which fits supervised hashing methods.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.15 0.2 0.25 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(a) I → T @ 32 bits

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(b) T → I @ 32 bits

Figure: Precision-recall curves on Wiki cross-modal tasks @ 32 bits.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 22 / 28

slide-23
SLIDE 23

Evaluation Results

Flickr1M

In the presence of big data, there is strong motivation to learn accurate models from large-scale dataset (big model capacity).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.55 0.6 0.65 0.7 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(a) I → T @ 32 bits

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.55 0.6 0.65 0.7 Recall Precision CMSSH CVH IMH CorrAE SCM QCH CCQ (ours)

(b) T → I @ 32 bits

Figure: Precision-recall curves on Flickr1M cross-modal tasks @ 32 bits.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 23 / 28

slide-24
SLIDE 24

Evaluation Discussion

Semi-Paired Data Qantization

Training with semi-paired data helps as paired data is limited; semi-supervised learning is helpful for partial-modal big data.

0.5 1 2 4 8 Paired Data Size ( × 103) 0.4 0.5 0.6 0.7 0.8 MAP I → I T → T I → T T → I

(a) NUS-WIDE

0.5 1 2 4 8 Paired Data Size ( × 103) 0.6 0.7 0.8 MAP I → I T → T I → T T → I

(b) Flickr1M

Figure: MAP of CCQ by varying the numbers of paired points for training.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 24 / 28

slide-25
SLIDE 25

Evaluation Discussion

Qantization Loss and Qery Efficiency

MAP loss due to binarization/quantization is controlled by CCQ; Qery processing efficiency is compared to the state of the arts.

IMH CorrAE CCQ

Method

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

MAP

I → I T → T I → T T → I

(a) MAP Loss

Wiki NUS-WIDE Flickr25K Flickr1M

Dataset

5 10 15 20 25 30 35 40

Average Query Time (ms)

CVH IMH CorrAE CCQ

(b) Search Efficiency

Figure: MAP loss by quantization and average search time for each query.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 25 / 28

slide-26
SLIDE 26

Evaluation Discussion

Scalable Training Complexity

Scales linearly to large samples; large-scale implementation via mini-batch paradigm (load fraction of data each time) is trivial.

2 4 6 8 Training Data Size ( × 105) 1 2 3 4 5 6 Traning Time ( × 103 Seconds) CVH CorrAE CCQ

(a) Time

2 4 6 8 Training Data Size ( × 105) 10 20 30 40 50 Main Memory Usage (GB) CVH CorrAE

CCQbatch CCQmini-batch

(b) Memory

Figure: Training time and memory costs on complete Flickr1M dataset.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 26 / 28

slide-27
SLIDE 27

Evaluation Discussion

Cross-Modal Tradeoff Sensitivity

Stable sensitivity is important for unsupervised cross-modal retrieval, as model selection via cross-validation is impossible.

0.1 0.2 0.5 1 2 5 10 20 50 100 200 λ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MAP NUS-WIDE Wiki Flickr1M

(a) I → T @ 32 bits

0.1 0.2 0.5 1 2 5 10 20 50 100 200 λ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MAP NUS-WIDE Wiki Flickr1M

(b) T → I @ 32 bits

Figure: Stable parameter sensitivity for unsupervised cross-modal retrieval.

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 27 / 28

slide-28
SLIDE 28

Summary

Summary

A composite correlation quantization for multimodal retrieval A seamless optimization framework of

Multimodal Correlation Composite Qantization

Learning bound analysis for approximate similarity retrieval Future Work

Multimodal Inverted Multi-Index for indexding CCQ codes Deep neural networks for multimodal correlation embedding

http://ise.thss.tsinghua.edu.cn/~mlong

  • M. Long et al. (Tsinghua University)

Composite Correlation Qantization ACM SIGIR 2016 28 / 28