RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco - - PowerPoint PPT Presentation

regml 2020 class 7 dictionary learning
SMART_READER_LITE
LIVE PREVIEW

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco - - PowerPoint PPT Presentation

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A mapping of data in new format better suited for further processing Data Representation L.Rosasco, RegML 2020 Data representation (cont.) X


slide-1
SLIDE 1

RegML 2020 Class 7 Dictionary learning

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

Data representation

A mapping of data in new format better suited for further processing Data Representation

L.Rosasco, RegML 2020

slide-3
SLIDE 3

Data representation (cont.)

X data-space, a data representation is a map Φ : X → F, to a representation space F. Different names in different fields: ◮ machine learning: feature map ◮ signal processing: analysis operator/transform ◮ information theory: encoder ◮ computational geometry: embedding

L.Rosasco, RegML 2020

slide-4
SLIDE 4

Outline

Part II: Data representation by learning Dictionary learning Metric learning

L.Rosasco, RegML 2020

slide-5
SLIDE 5

Supervised or Unsupervised?

Supervised (labelled/annotated) data are expensive! Ideally a good data representation should reduce the need of (human)

  • annotation. . .

Unsupervised learning of Φ

L.Rosasco, RegML 2020

slide-6
SLIDE 6

Unsupervised representation learning

Samples S = {x1, . . . , xn} from a distribution ρ on the input space X are available. What are the principles to learn ”good” representation in an unsupervised fashion?

L.Rosasco, RegML 2020

slide-7
SLIDE 7

Unsupervised representation learning principles

Two main concepts

  • 1. Reconstruction, there exists a map Ψ : F → X such that

Ψ ◦ Φ(x) ∼ x, ∀x ∈ X

  • 2. Similarity preservation, it holds

Φ(x) ∼ Φ(x′) ⇔ x ∼ x′, ∀x ∈ X

L.Rosasco, RegML 2020

slide-8
SLIDE 8

Unsupervised representation learning principles

Two main concepts

  • 1. Reconstruction, there exists a map Ψ : F → X such that

Ψ ◦ Φ(x) ∼ x, ∀x ∈ X

  • 2. Similarity preservation, it holds

Φ(x) ∼ Φ(x′) ⇔ x ∼ x′, ∀x ∈ X Most unsupervised work has focused on reconstruction rather than on similarity We give an overview next

L.Rosasco, RegML 2020

slide-9
SLIDE 9

Reconstruction based data representation

Basic idea: the quality of a representation Φ is measured by the reconstruction error provided by an associated reconstruction Ψ x − Ψ ◦ Φ(x) ,

L.Rosasco, RegML 2020

slide-10
SLIDE 10

Empirical data and population

Given S = {x1, . . . , xn} minimize the empirical reconstruction error

  • E(Φ, Ψ) = 1

n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 ,

L.Rosasco, RegML 2020

slide-11
SLIDE 11

Empirical data and population

Given S = {x1, . . . , xn} minimize the empirical reconstruction error

  • E(Φ, Ψ) = 1

n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 , as a proxy to the expected reconstruction error E(Φ, Ψ) =

  • dρ(x) x − Ψ ◦ Φ(x)2 ,

where ρ is the data distribution (fixed but uknown).

L.Rosasco, RegML 2020

slide-12
SLIDE 12

Empirical data and population

min

Φ,Ψ E(Φ, Ψ),

E(Φ, Ψ) =

  • dρ(x) x − Ψ ◦ Φ(x)2 ,
  • Caveat. . .

But reconstruction alone is not enough... copying data, i.e. Ψ ◦ Φ = I, gives zero reconstruction error!

L.Rosasco, RegML 2020

slide-13
SLIDE 13

Dictionary learning

x − Ψ ◦ Φ(x) Let X = Rd, F = Rp

  • 1. linear reconstruction

Ψ ∈ D, with D a subset of the space of linear maps from X to F.

L.Rosasco, RegML 2020

slide-14
SLIDE 14

Dictionary learning

x − Ψ ◦ Φ(x) Let X = Rd, F = Rp

  • 1. linear reconstruction

Ψ ∈ D, with D a subset of the space of linear maps from X to F.

  • 2. nearest neighbor representation,

Φ(x) = ΦΨ(x) = arg min

β∈Fλ

x − Ψβ2 , Ψ ∈ D, where Fλ is a subset of F.

L.Rosasco, RegML 2020

slide-15
SLIDE 15

Linear reconstruction and dictionaries

Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a1, . . . , ap ∈ Rd.

L.Rosasco, RegML 2020

slide-16
SLIDE 16

Linear reconstruction and dictionaries

Each reconstruction Ψ ∈ D can be identified a dictionary matrix with columns a1, . . . , ap ∈ Rd. The reconstruction of an input x ∈ X corresponds to a suitable linear expansion on the dictionary x =

p

  • j=1

ajβj, β1, . . . , βp ∈ R.

L.Rosasco, RegML 2020

slide-17
SLIDE 17

Nearest neighbor representation

Φ(x) = ΦΨ(x) = arg min

β∈Fλ

x − Ψβ2 , Ψ ∈ D, The above representation is called nearest neighbor (NN) since, for Ψ ∈ D, Xλ = ΨFλ, the representation Φ(x) provides the closest point to x in Xλ, d(x, Xλ) = min

x′∈Xλ x − x′2 = min β∈Fλ x − Ψβ2 .

L.Rosasco, RegML 2020

slide-18
SLIDE 18

Nearest neighbor representation (cont.)

NN representation are defined by a constrained inverse problem, min

β∈Fλ x − Ψβ2 .

L.Rosasco, RegML 2020

slide-19
SLIDE 19

Nearest neighbor representation (cont.)

NN representation are defined by a constrained inverse problem, min

β∈Fλ x − Ψβ2 .

Alternatively let Fλ = F and adding a regularization term Rλ : F → R min

β∈F

  • x − Ψβ2 + Rλ(β)
  • .

L.Rosasco, RegML 2020

slide-20
SLIDE 20

Dictionary learning

Then min

Ψ,Φ

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 becomes min

Ψ∈D

  • Dictionary learning

1 n

n

  • i=1

min

βi∈Fλ xi − Ψβi2

  • Representation learning

.

Dictionary learning

◮ learning a regularized representation on a dictionary. . . ◮ while simultaneously learning the dictionary itself.

L.Rosasco, RegML 2020

slide-21
SLIDE 21

Examples

The framework introduced above encompasses a large number of approaches. ◮ PCA (& kernel PCA) ◮ KSVD ◮ Sparse coding ◮ K-means ◮ K-flats ◮ . . .

L.Rosasco, RegML 2020

slide-22
SLIDE 22

Example 1: Principal Component Analysis (PCA)

Let Fλ = Fk = Rk, k ≤ min{n, d}, and D = {Ψ : F → X, linear | Ψ∗Ψ = I}.

L.Rosasco, RegML 2020

slide-23
SLIDE 23

Example 1: Principal Component Analysis (PCA)

Let Fλ = Fk = Rk, k ≤ min{n, d}, and D = {Ψ : F → X, linear | Ψ∗Ψ = I}. ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, Ψβ =

k

  • j=1

ajβj, β ∈ F

L.Rosasco, RegML 2020

slide-24
SLIDE 24

Example 1: Principal Component Analysis (PCA)

Let Fλ = Fk = Rk, k ≤ min{n, d}, and D = {Ψ : F → X, linear | Ψ∗Ψ = I}. ◮ Ψ is a d × k matrix with orthogonal, unit norm columns, Ψβ =

k

  • j=1

ajβj, β ∈ F ◮ Ψ∗ : X → F, Ψ∗x = (a1, x , . . . , ak, x), x ∈ X

L.Rosasco, RegML 2020

slide-25
SLIDE 25

PCA & best subspace

◮ ΨΨ∗ : X → X, ΨΨ∗x = k

j=1 aj aj, x ,

x ∈ X.

x a x − hx, ai a

|{z}

hx,aia

◮ P = ΨΨ∗ is the projection (P = P 2) on the subspace of Rd spanned by a1, . . . , ak.

L.Rosasco, RegML 2020

slide-26
SLIDE 26

Rewriting PCA

Note that, Φ(x) = Ψ∗x = arg min

β∈Fk

x − Ψβ2 , ∀x ∈ X, so that we can rewrite the PCA minimization as min

Ψ∈D

1 n

n

  • i=1

x − ΨΨ∗xi2 .

L.Rosasco, RegML 2020

slide-27
SLIDE 27

Rewriting PCA

Note that, Φ(x) = Ψ∗x = arg min

β∈Fk

x − Ψβ2 , ∀x ∈ X, so that we can rewrite the PCA minimization as min

Ψ∈D

1 n

n

  • i=1

x − ΨΨ∗xi2 .

Subspace learning

The problem of finding a k−dimensional orthogonal projection giving the best reconstruction.

L.Rosasco, RegML 2020

slide-28
SLIDE 28

PCA computation

Let X the n × d data matrix and C = 1

n

XT X.

L.Rosasco, RegML 2020

slide-29
SLIDE 29

PCA computation

Let X the n × d data matrix and C = 1

n

XT X. . . . PCA optimization problem is solved by the eigenvector of C associated to the K largest eigenvalues.

L.Rosasco, RegML 2020

slide-30
SLIDE 30

Learning a linear representation with PCA Subspace learning

The problem of finding a k−dimensional orthogonal projection giving the best reconstruction.

X

PCA assumes the support of the data distribution to be well approximated by a low dimensional linear subspace

L.Rosasco, RegML 2020

slide-31
SLIDE 31

PCA beyond linearity

X

L.Rosasco, RegML 2020

slide-32
SLIDE 32

PCA beyond linearity

X

L.Rosasco, RegML 2020

slide-33
SLIDE 33

PCA beyond linearity

X

L.Rosasco, RegML 2020

slide-34
SLIDE 34

Kernel PCA

Consider φ : X → H, and K(x, x′) = φ(x), φ(x′)H a feature map and associated (reproducing) kernel. We can consider the empirical reconstruction in the feature space, min

Ψ∈D

1 n

n

  • i=1

min

βi∈H φ(xi) − Ψβi2 H .

Connection to manifold learning. . .

L.Rosasco, RegML 2020

slide-35
SLIDE 35

Examples 2: Sparse coding

One of the first and most famous dictionary learning techniques.

L.Rosasco, RegML 2020

slide-36
SLIDE 36

Examples 2: Sparse coding

One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = Rp,

L.Rosasco, RegML 2020

slide-37
SLIDE 37

Examples 2: Sparse coding

One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = Rp, ◮ p ≥ d, Fλ = {β ∈ F : β1 ≤ λ}, λ > 0,

L.Rosasco, RegML 2020

slide-38
SLIDE 38

Examples 2: Sparse coding

One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = Rp, ◮ p ≥ d, Fλ = {β ∈ F : β1 ≤ λ}, λ > 0, ◮ D = {Ψ : F → X | ΨejF ≤ 1}.

L.Rosasco, RegML 2020

slide-39
SLIDE 39

Examples 2: Sparse coding

One of the first and most famous dictionary learning techniques. It corresponds to ◮ F = Rp, ◮ p ≥ d, Fλ = {β ∈ F : β1 ≤ λ}, λ > 0, ◮ D = {Ψ : F → X | ΨejF ≤ 1}. Hence, min

Ψ∈D

  • dictionary learning

1 n

n

  • i=1

min

βi∈Fλ xi − Ψβi2

  • sparse representation

L.Rosasco, RegML 2020

slide-40
SLIDE 40

Sparse coding (cont.)

min

Ψ∈D

1 n

n

  • i=1

min

βi∈Rp,βi≤λ xi − Ψβi2

◮ The problem is not convex. . . but it is separately convex in the βi’s and Ψ. ◮ An alternating minimization is fairly natural (other approaches possible–see e.g. [Schnass ’15, Elad et al. ’06])

L.Rosasco, RegML 2020

slide-41
SLIDE 41

Representation computation

Given a dictionary, the problems min

β∈Fλ xi − Ψβ2 , i = 1, . . . , n

are convex and correspond to a sparse representation problems. They can be solved using convex optimization techniques.

Splitting/proximal methods

β0, βt+1 = Tγ,λ(βt − γΨ∗(xi − Ψβt)), t = 0, . . . , Tmax with Tλ the soft-thresholding operator,

L.Rosasco, RegML 2020

slide-42
SLIDE 42

Dictionary computation

Given Φ(xi) = βi, i = 1, . . . , n, we have min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

Ψ∈D

1 n

  • X − B∗Ψ
  • 2

F ,

where B is the n × p matrix with rows βi, i = 1, . . . , n and we denoted by ·F , the Frobenius norm.

L.Rosasco, RegML 2020

slide-43
SLIDE 43

Dictionary computation

Given Φ(xi) = βi, i = 1, . . . , n, we have min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

Ψ∈D

1 n

  • X − B∗Ψ
  • 2

F ,

where B is the n × p matrix with rows βi, i = 1, . . . , n and we denoted by ·F , the Frobenius norm. It is a convex problem, solvable via standard techniques.

Splitting/proximal methods

Ψ0, Ψt+1 = P(Ψt − γtB∗(X − ΨB)), t = 0, . . . , Tmax where P is the projection corresponding to the constraints, P(Ψj) = Ψj/

  • Ψj

, if

  • Ψj

> 1 P(Ψj) = Ψj, if

  • Ψj

≤ 1.

L.Rosasco, RegML 2020

slide-44
SLIDE 44

Sparse coding model

◮ Sparse coding assumes the support of the data distribution to be a union of p

s

  • subspaces, i.e. all possible s dimensional subspaces in

Rp, where s is the sparsity level. ◮ More general penalties, more general geometric assumptions.

L.Rosasco, RegML 2020

slide-45
SLIDE 45

Example 3: K-means & vector quantization

K-means is typically seen as a clustering algorithm in machine

  • learning. . .

L.Rosasco, RegML 2020

slide-46
SLIDE 46

Example 3: K-means & vector quantization

K-means is typically seen as a clustering algorithm in machine

  • learning. . . but it is also a classical vector quantization approach.

L.Rosasco, RegML 2020

slide-47
SLIDE 47

Example 3: K-means & vector quantization

K-means is typically seen as a clustering algorithm in machine

  • learning. . . but it is also a classical vector quantization approach.

Here we revisit this point of view from a data representation perspective.

L.Rosasco, RegML 2020

slide-48
SLIDE 48

Example 3: K-means & vector quantization

K-means is typically seen as a clustering algorithm in machine

  • learning. . . but it is also a classical vector quantization approach.

Here we revisit this point of view from a data representation perspective. K-means corresponds to ◮ Fλ = Fk = {e1, . . . , ek}, the canonical basis in Rk, k ≤ n

L.Rosasco, RegML 2020

slide-49
SLIDE 49

Example 3: K-means & vector quantization

K-means is typically seen as a clustering algorithm in machine

  • learning. . . but it is also a classical vector quantization approach.

Here we revisit this point of view from a data representation perspective. K-means corresponds to ◮ Fλ = Fk = {e1, . . . , ek}, the canonical basis in Rk, k ≤ n ◮ D = {Ψ : F → X | linear}.

L.Rosasco, RegML 2020

slide-50
SLIDE 50

K-means computation

min

Ψ∈D

1 n

n

  • i=1

min

βi∈{e1,...,ek} xi − Ψβi2

The K-means problem is not convex.

L.Rosasco, RegML 2020

slide-51
SLIDE 51

K-means computation

min

Ψ∈D

1 n

n

  • i=1

min

βi∈{e1,...,ek} xi − Ψβi2

The K-means problem is not convex.

Alternating minimization

  • 1. Initialize dictionary Ψ0.

L.Rosasco, RegML 2020

slide-52
SLIDE 52

K-means computation

min

Ψ∈D

1 n

n

  • i=1

min

βi∈{e1,...,ek} xi − Ψβi2

The K-means problem is not convex.

Alternating minimization

  • 1. Initialize dictionary Ψ0.
  • 2. Let Φ(xi) = βi, i = 1, . . . , n be the solution of the problems

min

β∈{e1,...,ek} xi − Ψβ2 ,

i = 1, . . . , n. with Vj = {x ∈ S | Φ(x) = ej}, (multiple points have same representation since k ≤ n).

L.Rosasco, RegML 2020

slide-53
SLIDE 53

K-means computation

min

Ψ∈D

1 n

n

  • i=1

min

βi∈{e1,...,ek} xi − Ψβi2

The K-means problem is not convex.

Alternating minimization

  • 1. Initialize dictionary Ψ0.
  • 2. Let Φ(xi) = βi, i = 1, . . . , n be the solution of the problems

min

β∈{e1,...,ek} xi − Ψβ2 ,

i = 1, . . . , n. with Vj = {x ∈ S | Φ(x) = ej}, (multiple points have same representation since k ≤ n).

  • 3. Letting aj = Ψej, we can write

min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

a1,...,ak∈Rd

1 n

k

  • j=1
  • x∈Vj

x − aj2 .

L.Rosasco, RegML 2020

slide-54
SLIDE 54

Step 2: assignment

c3 c2 c1

The discrete problem min

β∈{e1,...,ek} xi − Ψβ2 ,

i = 1, . . . , n. can be seen as an assignment step.

L.Rosasco, RegML 2020

slide-55
SLIDE 55

Step 2: assignment

c3 c2 c1

The discrete problem min

β∈{e1,...,ek} xi − Ψβ2 ,

i = 1, . . . , n. can be seen as an assignment step.

Clusters

The sets Vj = {x ∈ S | Φ(x) = ej}, are called Voronoi sets and can be seen as data clusters.

L.Rosasco, RegML 2020

slide-56
SLIDE 56

Step 3: centroid computation

Consider min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

a1,...,ak∈Rd

1 n

k

  • j=1
  • x∈Vj

x − aj2 , where aj = Ψej.

L.Rosasco, RegML 2020

slide-57
SLIDE 57

Step 3: centroid computation

Consider min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

a1,...,ak∈Rd

1 n

k

  • j=1
  • x∈Vj

x − aj2 , where aj = Ψej. The minimization with respect to each column is independent to all

  • thers.

L.Rosasco, RegML 2020

slide-58
SLIDE 58

Step 3: centroid computation

Consider min

Ψ∈D

1 n

n

  • i=1

xi − Ψ ◦ Φ(xi)2 = min

a1,...,ak∈Rd

1 n

k

  • j=1
  • x∈Vj

x − aj2 , where aj = Ψej. The minimization with respect to each column is independent to all

  • thers.

Centroid computation

cj = 1 |Vj|

  • x∈Vj

x = arg min

aj∈Rd

  • x∈Vj

x − aj2 , j = 1, . . . , k.

L.Rosasco, RegML 2020

slide-59
SLIDE 59

K-means convergence

The computational procedure described before is known as Lloyd’s algorithm.

L.Rosasco, RegML 2020

slide-60
SLIDE 60

K-means convergence

The computational procedure described before is known as Lloyd’s algorithm. ◮ Since it is an alternating minimization approach, the value of the

  • bjective function can be shown to decrease with the iterations.

L.Rosasco, RegML 2020

slide-61
SLIDE 61

K-means convergence

The computational procedure described before is known as Lloyd’s algorithm. ◮ Since it is an alternating minimization approach, the value of the

  • bjective function can be shown to decrease with the iterations.

◮ Since there is only a finite number of possible partitions of the data in k clusters, Lloyd’s algorithm is ensured to converge to a local minimum in a finite number of steps.

L.Rosasco, RegML 2020

slide-62
SLIDE 62

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

L.Rosasco, RegML 2020

slide-63
SLIDE 63

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

K-means++ [Arthur, Vassilvitskii;07]

L.Rosasco, RegML 2020

slide-64
SLIDE 64

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

K-means++ [Arthur, Vassilvitskii;07]

  • 1. Choose a centroid uniformly at random from the data,

L.Rosasco, RegML 2020

slide-65
SLIDE 65

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

K-means++ [Arthur, Vassilvitskii;07]

  • 1. Choose a centroid uniformly at random from the data,
  • 2. Compute distances of data to the nearest centroid already chosen.

L.Rosasco, RegML 2020

slide-66
SLIDE 66

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

K-means++ [Arthur, Vassilvitskii;07]

  • 1. Choose a centroid uniformly at random from the data,
  • 2. Compute distances of data to the nearest centroid already chosen.
  • 3. Choose a new centroid from the data using probabilities proportional

to such distances (squared).

L.Rosasco, RegML 2020

slide-67
SLIDE 67

K-means initialization

Convergence to a global minimum can be ensured (with high probability), provided a suitable initialization.

K-means++ [Arthur, Vassilvitskii;07]

  • 1. Choose a centroid uniformly at random from the data,
  • 2. Compute distances of data to the nearest centroid already chosen.
  • 3. Choose a new centroid from the data using probabilities proportional

to such distances (squared).

  • 4. Repeat steps 2 and 3 until k centers have been chosen.

L.Rosasco, RegML 2020

slide-68
SLIDE 68

K-means & piece-wise representation

M = supp{ρ}

x ≈ c1 c2 c3

  • 2

4 1 3 5

c3 c2 c1

◮ k-means representation: extreme sparse representation, only one non zero coefficient (vector quantization).

L.Rosasco, RegML 2020

slide-69
SLIDE 69

K-means & piece-wise representation

M = supp{ρ}

x ≈ c1 c2 c3

  • 2

4 1 3 5

c3 c2 c1

◮ k-means representation: extreme sparse representation, only one non zero coefficient (vector quantization). ◮ k-means reconstruction: piecewise constant approximation of the data, each point is reconstructed by the nearest mean.

L.Rosasco, RegML 2020

slide-70
SLIDE 70

K-means & piece-wise representation

M = supp{ρ}

x ≈ c1 c2 c3

  • 2

4 1 3 5

c3 c2 c1

◮ k-means representation: extreme sparse representation, only one non zero coefficient (vector quantization). ◮ k-means reconstruction: piecewise constant approximation of the data, each point is reconstructed by the nearest mean. This latter perspective suggests extensions of k-means considering higher

  • rder data approximation such as, e.g. piecewise linear.

L.Rosasco, RegML 2020

slide-71
SLIDE 71

K-flats & piece-wise linear representation

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

[Bradley, Mangasarian ’00, Canas, R.’12]

L.Rosasco, RegML 2020

slide-72
SLIDE 72

K-flats & piece-wise linear representation

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

[Bradley, Mangasarian ’00, Canas, R.’12] ◮ k-flats representation: structured sparse representation, coefficients are projection on a flat.

L.Rosasco, RegML 2020

slide-73
SLIDE 73

K-flats & piece-wise linear representation

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

[Bradley, Mangasarian ’00, Canas, R.’12] ◮ k-flats representation: structured sparse representation, coefficients are projection on a flat. ◮ k-flats reconstruction: piecewise linear approximation of the data, each point is reconstructed by projection on the nearest flat.

L.Rosasco, RegML 2020

slide-74
SLIDE 74

Remarks on K-flats

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

◮ Principled way to enrich k-means representation (cfr softmax).

L.Rosasco, RegML 2020

slide-75
SLIDE 75

Remarks on K-flats

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

◮ Principled way to enrich k-means representation (cfr softmax). ◮ Geometric structured dictionary learning.

L.Rosasco, RegML 2020

slide-76
SLIDE 76

Remarks on K-flats

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

◮ Principled way to enrich k-means representation (cfr softmax). ◮ Geometric structured dictionary learning. ◮ Non-local approximations.

L.Rosasco, RegML 2020

slide-77
SLIDE 77

K-flats computations Alternating minimization

  • 1. Initialize flats Ψ1, . . . , Ψk.
  • 2. Assign point to nearest flat,

Vj = {x ∈ X |

  • x − ΨjΨ∗

jx

  • ≤ x − ΨtΨ∗

t x ,

t = j}.

  • 3. Update flats by computing (local) PCA in each cell Vj, j = 1, . . . , k.

L.Rosasco, RegML 2020

slide-78
SLIDE 78

Kernel K-means & K-flats

It is easy to extend K-means & K-flats using kernels. φ : X → H, and K(x, x′) = φ(x), φ(x′)H Consider the empirical reconstruction problem in the feature space, min

Ψ∈D

1 n

n

  • i=1

min

βi∈{e1,...,ek}⊂H φ(xi) − Ψβi2 H .

Note: Easy to see that computation can be performed in closed form ◮ Kernel k-means: distance computation. ◮ Kernel k-flats: distance computation+local KPCA.

L.Rosasco, RegML 2020

slide-79
SLIDE 79

Geometric Wavelets (GW)- Reconstruction Trees

◮ Select (rather than compute) a partition of the data-space ◮ Approximate the point in each cell via a vector/plane.

multi-scale

Selection via multi-scale/coarse-to-fine pruning of a partition tree [Maggioni et al.. . . ]

L.Rosasco, RegML 2020

slide-80
SLIDE 80

K-means/flats and GW

◮ Can be seen as piecewise representations. ◮ The data model is a manifold– limit when the number of pieces goes to infinity ◮ GMRA is local (cells are connected) while K-Flats is not. . . ◮ . . . but GMRA is multi-scale while K-flats is not. . .

supp(ρ)

M = supp{ρ}

x ≈ Ψ1 Ψ2 Ψ3

  • 2

4 c2 3 5

Ψ1 Ψ2 Ψ3

L.Rosasco, RegML 2020

slide-81
SLIDE 81

Dictionary learning & matrix factorization

PCA,Sparse Coding, K-means/flats, Reconstruction trees are some examples of methods based on (P1) min

Ψ∈D

  • Dictionary learning

1 n

n

  • i=1

min

βi∈Fλ xi − Ψβi2

  • Representation learning

. In fact, under mild conditions the above problem is a special case of Matrix Factorization:

L.Rosasco, RegML 2020

slide-82
SLIDE 82

Dictionary learning & matrix factorization

PCA,Sparse Coding, K-means/flats, Reconstruction trees are some examples of methods based on (P1) min

Ψ∈D

  • Dictionary learning

1 n

n

  • i=1

min

βi∈Fλ xi − Ψβi2

  • Representation learning

. In fact, under mild conditions the above problem is a special case of Matrix Factorization: If the minimizations of the βi’s are independent, then (P1) ⇔ min

B,Ψ

  • X − ΨB
  • 2

F

where B has columns (βi)i, X data matrix, and ·F is the Frobenius norm. The equivalence holds for all the methods we saw before!

L.Rosasco, RegML 2020

slide-83
SLIDE 83

From reconstruction to similarity

We have seen two concepts emerging ◮ parsimonious reconstruction ◮ similarity preservation

What about similarity preservation?

L.Rosasco, RegML 2020

slide-84
SLIDE 84

Randomized linear representation

Consider randomized representation/reconstruction given by a set of random templates smaller then data dimension, that is a1, . . . , ak, k < d.

L.Rosasco, RegML 2020

slide-85
SLIDE 85

Randomized linear representation

Consider randomized representation/reconstruction given by a set of random templates smaller then data dimension, that is a1, . . . , ak, k < d. Consider Φ : X → F = Rk such that Φ(x) = Ax = (x, a1 , . . . , x, ak), ∀x ∈ X, with A random i.i.d. matrix, with rows a1, . . . , ak

L.Rosasco, RegML 2020

slide-86
SLIDE 86

Johnson-Lindenstrauss Lemma

The representation Φ(x) = Ax defines a stable embedding, i.e. (1 − ǫ) x − x′ ≤ Φ(x) − Φ(x′) ≤ (1 + ǫ) x − x′ with high probability and for all x, x′ ∈ C ⊂ X. The precision ǫ depends on : 1) number of random atoms k, 2) the set C

L.Rosasco, RegML 2020

slide-87
SLIDE 87

Johnson-Lindenstrauss Lemma

The representation Φ(x) = Ax defines a stable embedding, i.e. (1 − ǫ) x − x′ ≤ Φ(x) − Φ(x′) ≤ (1 + ǫ) x − x′ with high probability and for all x, x′ ∈ C ⊂ X. The precision ǫ depends on : 1) number of random atoms k, 2) the set C

Example: If C is a finite set |C| = n, then ǫ ∼

  • log n

k .

L.Rosasco, RegML 2020

slide-88
SLIDE 88

Metric learning Metric learning

Find D : X × X → R such that x similar x′ ⇔ D(x, x′)

  • 1. How to parameterize D?
  • 2. How we know whether data points are similar?
  • 3. How do we turn all into an optimization problem?

L.Rosasco, RegML 2020

slide-89
SLIDE 89

Metric learning (cont.)

  • 1. How to parameterize D?

Mahalanobis D(x, x′) = x − x′, M(x − x′) where M symmetric PD, or rather Φ(x) = Bx with M = B∗B (using kernels possible).

L.Rosasco, RegML 2020

slide-90
SLIDE 90

Metric learning (cont.)

  • 1. How to parameterize D?

Mahalanobis D(x, x′) = x − x′, M(x − x′) where M symmetric PD, or rather Φ(x) = Bx with M = B∗B (using kernels possible).

  • 2. How to know whether points are similar?

Most works assume supervised data (xi, xj, yi,j)i,j.

L.Rosasco, RegML 2020

slide-91
SLIDE 91

Metric learning (cont.)

  • 1. How to parameterize D?

Mahalanobis D(x, x′) = x − x′, M(x − x′) where M symmetric PD, or rather Φ(x) = Bx with M = B∗B (using kernels possible).

  • 2. How to know whether points are similar?

Most works assume supervised data (xi, xj, yi,j)i,j.

  • 3. How to turn all into an optimization problem?

Extension of classification algorithms such as support vector machines.

L.Rosasco, RegML 2020

slide-92
SLIDE 92

This class

◮ dictionary learning ◮ metric learning

L.Rosasco, RegML 2020

slide-93
SLIDE 93

Next class

Deep learning!

L.Rosasco, RegML 2020