Large-Scale Face Manifold Learning Sanjiv Kumar Google Research - - PowerPoint PPT Presentation

large scale face manifold learning sanjiv kumar google
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Face Manifold Learning Sanjiv Kumar Google Research - - PowerPoint PPT Presentation

Large-Scale Face Manifold Learning Sanjiv Kumar Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri 1 Face Manifold Learning 2500 50 x 50 pixel faces 50 x 50 pixel random images Space of face images


slide-1
SLIDE 1

1

Large-Scale Face Manifold Learning

Sanjiv Kumar

Google Research New York, NY * Joint work with A. Talwalkar, H. Rowley and M. Mohri

slide-2
SLIDE 2

2 50 x 50 pixel faces 50 x 50 pixel random images

Space of face images significantly smaller than 2562500

Face Manifold Learning

Want to recover the underlying (possibly nonlinear) space !

ℜ2500

(Dimensionality Reduction)

slide-3
SLIDE 3

3

Dimensionality Reduction

  • Linear Techniques

– PCA, Classical MDS – Assume data lies in a subspace – Directions of maximum variance

  • Nonlinear Techniques

– Manifold learning methods

  • LLE
  • ISOMAP
  • Laplacian Eigenmaps

– Assume local linearity of data – Need densely sampled data as input

[Roweis & Saul ’00] [Tenanbaum et al. ’00] [Belkin & Niyogi ’01]

Bottleneck: Computational Complexity ≈ O(n3) !

slide-4
SLIDE 4

4

Outline

  • Manifold Learning

– ISOMAP

  • Approximate Spectral Decomposition

– Nystrom and Column-Sampling approximations

  • Large-scale Manifold learning

– 18M face images from the web – Largest study so far ~270 K points

  • People Hopper – A Social Application on Orkut
slide-5
SLIDE 5

5

  • Find the low-dimensional representation that best

preserves geodesic distances between points

ISOMAP

[Tanenbaum et al., ’00]

slide-6
SLIDE 6

6

  • Find the low-dimensional representation that best

preserves geodesic distances between points

ISOMAP

[Tanenbaum et al., ’00]

Recovers true manifold asymptotically !

Output co-ordinates Geodesic distance

slide-7
SLIDE 7

7

i j

Given n input images:

  • Find t nearest neighbors for each

image : O(n2)

  • Find shortest path distance for

every (i, j), Δij : O(n2 log n)

  • Construct n × n matrix G with

entries as centered Δij

2

– G ~ 18M x 18M dense matrix

  • Optimal k reduced dims: Uk Σk

1/2

O(n3) !

Eigenvectors Eigenvalues

[Tanenbaum et al., ’00]

ISOMAP

slide-8
SLIDE 8

8

Spectral Decomposition

  • Need to do eigen-decomposition of symmetric positive

semi-definite matrix

  • For , G ≈ 1300 TB

– ~100,000 x 12GB RAM machines

  • Iterative methods

– Jacobi, Arnoldi, Hebbian – Need matrix-vector products and several passes over data – Not suitable for large dense matrices

  • Sampling-based methods

– Column-Sampling Approximation – Nystrom Approximation

G

[ ] n×n

[Golub & Loan, ’83][Gorell, ’06]

Relationship and comparative performance?

[Frieze et al., ’98] [Williams & Seeger, ’00]

O(n3)

slide-9
SLIDE 9

9

Approximate Spectral Decomposition

  • Sample l columns randomly without replacement

l

C

  • Column-Sampling Approximation – SVD of C
  • Nystrom Approximation – SVD of W

[Frieze et al., ’98] [Williams & Seeger, ’00][Drineas & Mahony, ’05]

l

slide-10
SLIDE 10

10

Column-Sampling Approximation

slide-11
SLIDE 11

11

Column-Sampling Approximation

slide-12
SLIDE 12

12

Column-Sampling Approximation

O(nl 2) ! O(l 3) !

[n × l ] [l × l ]

slide-13
SLIDE 13

13

Nystrom Approximation

C

l l

slide-14
SLIDE 14

14

Nystrom Approximation

l l

O(l 3) ! C

slide-15
SLIDE 15

15

Nystrom Approximation

l l

C

Not Orthonormal !

O(l 3) !

slide-16
SLIDE 16

16

Nystrom Vs Column-Sampling

  • Experimental Comparison

– A random set of 7K face images – Eigenvalues, eigenvectors, and low-rank approximations

[Kumar, Mohri & Talwalkar, ICML ’09]

slide-17
SLIDE 17

17

Eigenvalues Comparison

% deviation from exact

slide-18
SLIDE 18

18

Eigenvectors Comparison

Principal angle with exact

slide-19
SLIDE 19

19

Low-Rank Approximations

Nystrom gives better reconstruction than Col-Sampling !

slide-20
SLIDE 20

20

Low-Rank Approximations

slide-21
SLIDE 21

21

Low-Rank Approximations

slide-22
SLIDE 22

22

Orthogonalized Nystrom

Nystrom-orthogonal gives worse reconstruction than Nystrom !

slide-23
SLIDE 23

23

Low-Rank Approximations Matrix Projection

slide-24
SLIDE 24

24

Low-Rank Approximations Matrix Projection

slide-25
SLIDE 25

25

Low-Rank Approximations Matrix Projection

˜ G

nys = C l

n W −2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ CTG

˜ G

col = C CTC

( )

−1CTG

slide-26
SLIDE 26

26

Col-Sampling gives better Reconstruction than Nystrom !

Low-Rank Approximations Matrix Projection

– Theoretical guarantees in special cases

[Kumar et al., ICML ’09]

slide-27
SLIDE 27

27

How many columns are needed?

Columns needed to get 75% relative accuracy

  • Sampling Methods

– Theoretical analysis of uniform sampling method – Adaptive sampling methods – Ensemble sampling methods

[Deshpande et al. FOCS ’06] [Kumar et al., ICML ’09] [Kumar et al., AISTATS ’09] [Kumar et al., NIPS ’09]

slide-28
SLIDE 28

28

So Far …

  • Manifold Learning

– ISOMAP

  • Approximate Spectral Decomposition

– Nystrom and Column-Sampling approximations

  • Large-scale Face Manifold learning

– 18 M face images from the web

  • People Hopper – A Social Application on Orkut
slide-29
SLIDE 29

29

Large-Scale Face Manifold Learning

  • Construct Web dataset

– Extracted 18M faces from 2.5B internet images – ~15 hours on 500 machines – Faces normalized to zero mean and unit variance

  • Graph construction

– Exact search ~3 months (on 500 machines) – Approx Nearest Neighbor – Spill Trees (5 NN, ~2 days) – New methods for hashing based kNN search – Less than 5 hours!

[Liu et al., ’04] [Talwalkar, Kumar & Rowley, CVPR ’08] [CVPR ’10] [ICML ’10] [ICML ’11]

slide-30
SLIDE 30

30

Neighborhood Graph Construction

  • Connect each node (face) with its neighbors
  • Is the graph connected?

– Depth-First-Search to find largest connected component – 10 minutes on a single machine – Largest component depends on number of NN ( t )

slide-31
SLIDE 31

31

Samples from connected components

From Largest Component From Smaller Components

slide-32
SLIDE 32

32

Graph Manipulation

  • Approximating Geodesics

– Shortest paths between pairs of face images – Computing for all pairs infeasible

  • Key Idea: Need only a few columns of G for

sampling-based decomposition

– require shortest paths between a few ( l ) nodes and all

  • ther nodes

– 1 hour on 500 machines (l = 10K)

  • Computing Embeddings (k = 100)

– Nystrom: 1.5 hours, 500 machine – Col-Sampling: 6 hours, 500 machines – Projections: 15 mins, 500 machines

O(n2 log n) !

slide-33
SLIDE 33

33

18M-Manifold in 2D

Nystrom Isomap

slide-34
SLIDE 34

34

Shortest Paths on Manifold

18M samples not enough!

slide-35
SLIDE 35

35

Summary

  • Large-scale nonlinear dimensionality reduction

using manifold learning on 18M face images

  • Fast approximate SVD based on sampling

methods

  • Open Questions

– Does a manifold really exist or data may form clusters in low dimensional subspaces? – How much data is really enough?

slide-36
SLIDE 36

36

People Hopper

  • A fun social application on Orkut
  • Face manifold constructed with Orkut database

– Extracted 13M faces from about 146M profile images – ~3 days on 50 machines – Color face image (40x48 pixels)  5760-dim vector – Faces normalized to zero mean and unit variance in intensity space

  • Shortest path search using bidirectional Dijkstra
  • Users can opt-out – Daily incremental graph update
slide-37
SLIDE 37

37

People Hopper Interface

slide-38
SLIDE 38

38

From the Blogs

slide-39
SLIDE 39

39

CMU-PIE Dataset

  • 68 people, 13 poses, 43 illuminations, 4 expressions
  • 35,247 faces detected by a face detector
  • Classification and clustering on poses
slide-40
SLIDE 40

40

Clustering

  • K-means clustering after transformation (k = 100)

– K fixed to be the same as number of classes

  • Two metrics

Purity - points within a cluster come from the same class Accuracy - points from a class form a single cluster

Matrix G is not guaranteed to be positive semi-definite in Isomap !

  • Nystrom: EVD of W (can ignore negative eigenvalues)
  • Col-sampling: SVD of C (signs are lost) !
slide-41
SLIDE 41

41

Optimal 2D embeddings

slide-42
SLIDE 42

42

Laplacian Eigenmaps

Minimize weighted distances between neighbors

  • Find t nearest neighbors for each image : O(n2)
  • Compute weight matrix W:
  • Compute normalized laplacian
  • Optimal k reduced dims: Uk

O(n3)

Bottom eigenvectors of G

[Belkin & Niyogi, ’01]

where

slide-43
SLIDE 43

43

Different Sampling Procedures