Part 3: Latent representations and unsupervised learning Dale - - PowerPoint PPT Presentation

part 3 latent representations and unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Part 3: Latent representations and unsupervised learning Dale - - PowerPoint PPT Presentation

Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for


slide-1
SLIDE 1

Part 3: Latent representations and unsupervised learning

Dale Schuurmans University of Alberta

slide-2
SLIDE 2

Supervised versus unsupervised learning

Prominent training principles

Discriminative y x typical for supervised Generative y x typical for unsupervised

slide-3
SLIDE 3

Unsupervised representation learning

Consider generative training x φ

slide-4
SLIDE 4

Unsupervised representation learning

Examples

  • dimensionality reduction (PCA, exponential family PCA)
  • sparse coding
  • independent component analysis
  • deep learning

. . .

Usually involves learning both

a latent representation for data and a data reconstruction model

Context

could be: unsupervised, semi-supervised, or supervised

slide-5
SLIDE 5

Challenge

Optimal feature discovery appears to be generally intractable

Have to jointly train

  • latent representation
  • data reconstruction model

Usually resort to alternating minimization

(sole exception: PCA)

slide-6
SLIDE 6

First consider unsupervised feature discovery

slide-7
SLIDE 7

Unsupervised feature discovery

Single layer case = matrix factorization

x ϕ B

X B Φ ≈

  • riginal data

learned dictionary new representation n!t n!m m!t

t = # training examples n = # original features m = # new features

Choose B and Φ to minimize data reconstruction loss

L(BΦ; X) = t

i=1 L(BΦ:i; X:i)

Seek desired structure in latent feature representation

Φ low rank : dimensionality reduction Φ sparse : sparse coding Φ rows independent : independent component analysis

slide-8
SLIDE 8

Generalized matrix factorization

Assume reconstruction loss L(ˆ x; x) is convex in first argument

Bregman divergence

L(ˆ x; x) = DF(ˆ xx) = DF ∗(f (x)f (ˆ x)) (F strictly convex potential with transfer f = ∇F) Tries to make ˆ x ≈ x

Matching loss

L(ˆ x; x) = DF(ˆ xf −1(x)) = DF ∗(xf (ˆ x)) Tries to make f (ˆ x) ≈ x (A nonlinear predictor, but loss still convex in ˆ x)

x ϕ B

Regular exponential family

L(ˆ x; x) = − log pB(x|φ) = DF(ˆ xf −1(x)) − F ∗(x) − const

slide-9
SLIDE 9

Training problem

min

B∈Rn×m

min

Φ∈Rm×t L(BΦ; X)

How to impose desired structure on Φ?

slide-10
SLIDE 10

Training problem

min

B∈Rn×m

min

Φ∈Rm×t L(BΦ; X)

How to impose desired structure on Φ? Dimensionality reduction

Fix # features m < min(n, t)

  • But only known to be tractable if L( ˆ

X; X) = ˆ X − X2

F (PCA)

  • No known efficient algorithm for other standard losses

Problem

rank(Φ) = m constraint is too hard

slide-11
SLIDE 11

Training problem

min

B∈Bm

2

min

Φ∈Rm×t L(BΦ; X)+αΦ2,1

How to impose desired structure on Φ? Relaxed dimensionality reduction (subspace learning)

Add rank reducing regularizer Φ2,1 = m

j=1 Φj:2

Favors null rows in Φ But need to add constraint to B B:j ∈ B2 = {b : b2 ≤ 1} (Otherwise can make Φ small just by making B large)

slide-12
SLIDE 12

Training problem

min

B∈Bm

q

min

Φ∈Rm×t L(BΦ; X)+αΦ1,1

How to impose desired structure on Φ? Sparse coding

Use sparsity inducing regularizer Φ1,1 = m

j=1

t

i=1 |Φji|

Favors sparse entries in Φ Need to add constraint to B B:j ∈ Bq = {b : bq ≤ 1} (Otherwise can make Φ small just by making B large)

slide-13
SLIDE 13

Training problem

min

B∈Rn×m

min

Φ∈Rm×t L(BΦ; X)+αD(Φ)

How to impose desired structure on Φ? Independent components analysis

Usually enforces BΦ = X as a constraint

  • but interpolation is generally a bad idea
  • Instead just minimize reconstruction loss

plus a dependence measure D(Φ) as a regularizer

Difficulty

Formulating a reasonable convex dependence penalty

slide-14
SLIDE 14

Training problem

Consider subspace learning and sparse coding

min

B∈Bm

min

Φ∈Rm×t L(BΦ; X) + αΦ

Choice of Φ and B determines type of representation recovered

slide-15
SLIDE 15

Training problem

Consider subspace learning and sparse coding

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦ

Choice of Φ and B determines type of representation recovered

Problem

Still have rank constraint imposed by # new features m

Idea

Just relax m → ∞

  • Rely on sparsity inducing norm Φ to select features
slide-16
SLIDE 16

Training problem

Consider subspace learning and sparse coding

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦ

Still have a problem

Optimization problem is not jointly convex in B and Φ

slide-17
SLIDE 17

Training problem

Consider subspace learning and sparse coding

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦ

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 1: Alternate!

  • convex in B given Φ
  • convex in Φ given B

Could use any other form of local training

slide-18
SLIDE 18

Training problem

Consider subspace learning and sparse coding

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦ

Still have a problem

Optimization problem is not jointly convex in B and Φ

Idea 2: Boost!

  • Implicitly fix B to universal dictionary
  • Keep row-wise sparse Φ
  • Incrementally select column in B (“weak learning problem”)
  • Update sparse Φ

Can prove convergence under broad conditions

slide-19
SLIDE 19

Training problem

Consider subspace learning and sparse coding

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦ

Still have a problem?

Optimization problem is not jointly convex in B and Φ

Idea 3: Solve!

  • Can easily solve for globally optimal joint B and Φ
  • But requires a significant reformulation
slide-20
SLIDE 20

A useful observation

slide-21
SLIDE 21

Equivalent reformulation

Theorem

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α| ˆ X |

  • |

· | is an induced matrix norm on ˆ X determined by B and · p,1

Important fact

Norms are always convex

Computational strategy

  • 1. Solve for optimal response matrix ˆ

X first (convex minimization)

  • 2. Then recover optimal B and Φ from ˆ

X

slide-22
SLIDE 22

Example: subspace learning

min

B∈B∞

2

min

Φ∈R∞×t L(BΦ; X) + αΦ2,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α ˆ Xtr

Recovery

  • Let UΣV ′ = svd( ˆ

X)

  • Set B = U and Φ = ΣV ′

Preserves optimality

  • B:j2 = 1 hence B ∈ Bn

2

  • Φ2,1 = ΣV ′2,1 =

j σjV:j2 = j σj = ˆ

Xtr Thus L( ˆ X; X) + α ˆ Xtr = L(BΦ; X) + αΦ2,1

slide-23
SLIDE 23

Example: sparse coding

min

B∈B∞

q

min

Φ∈R∞×t L(BΦ; X) + αΦ1,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α ˆ X ′q,1

Recovery

B =

  • 1

ˆ X:1q ˆ X:1, ..., 1 ˆ X:tq ˆ X:t

  • (rescaled columns)

Φ =    ˆ X:1q ... ˆ X:tq    (diagonal matrix)

Preserves optimality

  • B:jq = 1 hence B ∈ Bt

q

  • Φ1,1 =

j ˆ

X:jq = ˆ X ′q,1 Thus L( ˆ X; X) + α ˆ X ′q,1 = L(BΦ; X) + αΦ1,1

slide-24
SLIDE 24

Example: sparse coding

Outcome

Sparse coding with · 1,1 regularization = vector quantization

  • drops some examples
  • memorizes remaining examples

Optimal solution is not overcomplete

Could not make these observations using local solvers

slide-25
SLIDE 25

Simple extensions

  • Missing observations in X
  • Robustness to outliers in X

min

S∈Rn×t

min

ˆ X∈Rn×t L( ( ˆ

X + S)Ω ; XΩ ) + α| ˆ X | + βS1,1 Ω = observed indices in X S = speckled outlier noise (jointly convex in ˆ X and S)

slide-26
SLIDE 26

Explaining the useful result

Theorem

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗

(B,p∗)

slide-27
SLIDE 27

Explaining the useful result

Theorem

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗

(B,p∗)

A dual norm

ˆ X ′∗

(B,p∗) =

max

Λ′(B,p∗)≤1 tr(Λ′ ˆ

X) (standard definition of a dual norm)

slide-28
SLIDE 28

Explaining the useful result

Theorem

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗

(B,p∗)

A dual norm

ˆ X ′∗

(B,p∗) =

max

Λ′(B,p∗)≤1 tr(Λ′ ˆ

X) (standard definition of a dual norm)

  • f a vector-norm induced matrix norm

Λ′(B,p∗) = max

b∈B Λ′bp∗

(easy to prove this yields a norm on matrices)

slide-29
SLIDE 29

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

min

Φ:BΦ= ˆ X

Φp,1

slide-30
SLIDE 30

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

min

Φ:BΦ= ˆ X

Φp,1 For any B ∈ B∞ that spans the columns of ˆ X min

Φ:BΦ= ˆ X

Φp,1 = min

Φ max Λ

max

V p∗,∞≤1 tr(V ′Φ) + tr(Λ′( ˆ

X − BΦ)) = max

V p∗,∞≤1 max Λ min Φ tr(Λ′ ˆ

X) + tr(Φ′(V − B′Λ)) = max

V p∗,∞≤1

max

Λ:B′Λ=V tr(Λ′ ˆ

X) = max

Λ:B′Λp∗,∞≤1 tr(Λ′ ˆ

X)

slide-31
SLIDE 31

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1 tr(Λ′ ˆ

X)

slide-32
SLIDE 32

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X)

slide-33
SLIDE 33

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞ tr(Λ′ ˆ

X)

slide-34
SLIDE 34

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X)

slide-35
SLIDE 35

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B tr(Λ′ ˆ

X)

slide-36
SLIDE 36

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B

  • tr(Λ′ ˆ

X)

slide-37
SLIDE 37

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ

X)

slide-38
SLIDE 38

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ

X)

slide-39
SLIDE 39

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ

X)

  • =

min

ˆ X∈Rn×t L( ˆ

X; X) + α ˆ X ′∗

(B,p∗)

slide-40
SLIDE 40

Proof outline

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t

min

B∈B∞

min

Φ:BΦ= ˆ X

L( ˆ X; X) + αΦp,1 = min

ˆ X∈Rn×t L( ˆ

X; X) + α min

B∈B∞

max

Λ:B′Λp∗,∞≤1

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:B′Λp∗,∞≤1, ∀B∈B∞

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:b′Λp∗≤1, ∀b∈B

  • tr(Λ′ ˆ

X) = min

ˆ X∈Rn×t L( ˆ

X; X) + α max

Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ

X)

  • =

min

ˆ X∈Rn×t L( ˆ

X; X) + α ˆ X ′∗

(B,p∗)

done

slide-41
SLIDE 41

Closed form induced norms

Theorem

min

B∈B∞

min

Φ∈R∞×t L(BΦ; X) + αΦp,1

= min

ˆ X∈Rn×t L( ˆ

X; X) + α ˆ X ′∗

(B,p∗)

Special cases

B2, Φ2,1 → ˆ X ′∗

(B2,2)

= ˆ Xtr (subspace learning) Bq, Φ1,1 → ˆ X ′∗

(Bq,∞)

= ˆ X ′q,1 (sparse coding) B1, Φp,1 → ˆ X ′∗

(B1,p∗)

= ˆ Xp,1

slide-42
SLIDE 42

Some simple experiments

slide-43
SLIDE 43

Experimental results

Alternate : repeatedly optimize over B, Φ successively Global : recover global joint minimizer over B, Φ

slide-44
SLIDE 44

Experimental results: Sparse coding

Objective value achieved

data set COIL WBC BCI Ionos G241N Alternate 1.314 4.918 0.898 1.612 1.312 Global 0.207 0.659 0.306 0.330 0.207 ×10−2

(squared loss, q = 2, α = 10−5)

slide-45
SLIDE 45

Experimental results: Sparse coding

Run time (seconds)

data set COIL WBC BCI Ionos G241N Alternate 1.95 10.54 0.88 1.71 2.37 Global 0.06 0.01 0.01 0.01 0.09

(squared loss, q = 2, α = 10−5)

slide-46
SLIDE 46

Experimental results: Subspace learning

Objective value achieved

data set COIL WBC BCI Ionos G241N Alternate 1.314 4.957 0.903 1.632 1.313 Global 0.072 0.072 0.092 0.079 0.205 ×10−2

(squared loss, α = 10−5)

slide-47
SLIDE 47

Experimental results: Subspace learning

Run time (seconds)

data set COIL WBC BCI Ionos G241N Alternate 2.40 9.31 1.12 0.47 2.43 Global 2.18 0.06 0.19 0.06 2.11

(squared loss, α = 10−5)

slide-48
SLIDE 48

Catch

Every norm is convex But not every induced matrix norm is tractable X2 = σmax(X) X1 = max

j

  • i

|Xij| X∞ = max

i

  • j

|Xij| Xp NP-hard to approximate for p = 1, 2, ∞

Question

Any other useful induced matrix norms that are tractable?

Yes!

slide-49
SLIDE 49

Semi-supervised feature discovery

slide-50
SLIDE 50

Semi-supervised feature discovery

B Φl ≈

labeled (n+k)!(tl+tu)

Xl W Φu Xu Yl

unlabeled inputs

  • utputs

tl = # labeled n = # original features tu = # unlabeled k = # output dimensions t = tl + tu

y x B W ϕ

Learn

Φ = [Φl, Φu] data representation B = input reconstruction model f (BΦ) ≈ X W = output reconstruction model h(W Φl) ≈ Yl

slide-51
SLIDE 51

Semi-supervised feature discovery

Let Z = Xl Xu Yl ∅

  • U =

B W

  • U =

B W

  • Formulation

min

B∈B∞

min

W ∈W∞

min

Φ∈R∞×t Lu(BΦ; X) + βLs(W Φl; Yl) + αΦp,1

= min

ˆ Z∈R(n+k)×t

˜ L( ˆ Z; Z) + α ˆ Z ′∗

(U,p∗)

Note

Imposing separate constraints on B and W

Questions

  • Is the induced norm ˆ

Z ′∗

(U,p∗) efficiently computable?

  • Can optimal B, W , Φ be recovered from optimal ˆ

Z?

slide-52
SLIDE 52

Example: sparse coding formulation

Regularizer: Φ1,1 Constraints: Bq1 = {b : bq1 ≤ 1} Wq2 = {w : wq2 ≤ γ} Uq1

q2

= B × W

Theorem

ˆ Z ′∗

(Uq1

q2 ,∞)

=

  • j max
  • ˆ

Z X

:j q1, 1 γ ˆ

Z Y

:j q2

  • efficiently computable

Recovery

Φjj = max

  • ˆ

Z X

:j q1, 1 γ ˆ

Z Y

:j q2

  • (diagonal matrix)

U = ˆ ZΦ−1

Preserves optimality

But still reduces to a form of vector quantization

slide-53
SLIDE 53

Example: subspace learning formulation

Regularizer: Φ2,1 Constraints: B2 = {b : b2 ≤ 1} W2 = {w : w2 ≤ γ} U2

2

= B × W

Theorem

ˆ Z ′∗

(U2

2 ,∞)

= max

ρ≥0 D−1 ρ

ˆ Ztr where Dρ = √1 + γρ I

  • 1+γρ

ρ

I

  • efficiently computable: quasi-concave in ρ
slide-54
SLIDE 54

Example: subspace learning formulation

Lemma: dual norm

Λ′2

(U2

2 ,2)

= max

h:hX 2=1, hY 2=γ h′ΛΛ′h

= max

H:H0, tr(HI X )=1, tr(HI Y )=γ tr(HΛΛ′)

= min

λ≥0, ν≥0

min

Λ:ΛΛ′ λI X +νI Y

λ + γν = min

λ≥0, ν≥0

min

Λ:Dν/λΛ2

sp≤λ+γν λ + γν

= min

λ≥0, ν≥0 Dν/λΛ2 sp

= min

ρ≥0

DρΛ2

sp

slide-55
SLIDE 55

Example: subspace learning formulation

Can easily derive target norm from dual norm

ˆ Z ′∗

(U2

2 ,2)

= max

Λ′(U2

2 ,2)≤1 tr(Λ′ ˆ

Z) = max

ρ≥0

max

Λ:DρΛsp≤1 tr(Λ′ ˆ

Z) = max

ρ≥0

max

˜ Λ:˜ Λsp≤1

tr(˜ Λ′D−1

ρ

ˆ Z) = max

ρ≥0

D−1

ρ

ˆ Ztr (proves theorem)

slide-56
SLIDE 56

Example: subspace learning formulation

Computational strategy

Solve in dual, since Λ′(U2

2 ,∞) can be computed efficiently via

partitioned power method iteration min

Λ

˜ L⋆(Λ; Z) + α⋆Λ′(U2

2 ,2)

Given ˆ Λ

  • Recover ˆ

Z X and ˆ Z Y

l

by solving min

ˆ Z X , ˆ Z Y Lu( ˆ

Z X; X) + Ls( ˆ Z Y

l ; Yl) − tr( ˆ

Z X ′ˆ ΛX) − tr( ˆ Z Y

l ′ˆ

ΛY

l )

  • Recover ˆ

Z Y

u by minimizing ˆ

Z ′(U2

2 ,2) (keeping ˆ

Z X, ˆ Z Y

l

fixed)

slide-57
SLIDE 57

Example: subspace learning formulation

Recovery

Given optimal ˆ Z, recover U and Φ iteratively by repeating:

  • (Φ(ℓ), Λ(ℓ)) ∈ arg minΦ maxΛ Φ2,1 + tr(Λ′( ˆ

Z − U(ℓ)Φ))

  • u(ℓ+1) ∈ arg maxu∈U2

2 u′Λ(ℓ)2

  • U(ℓ+1) = [U(ℓ), u(ℓ+1)]

Converges to optimal U and Φ

  • U(ℓ)Φ(ℓ) = ˆ

Z for all ℓ

  • Φ(ℓ)2,1 → ˆ

Z ′∗

(U2

2 ,2)

slide-58
SLIDE 58

Some simple experiments

slide-59
SLIDE 59

Experimental results: Subspace learning

Staged : first locally optimize B, Φ, then optimize W Alternate : repeatedly optimize over B, W , Φ successively Global : recover joint global minimizer over B, Φ, W

slide-60
SLIDE 60

Experimental results: Subspace learning

Objective value achieved

data set COIL WBC BCI Ionos G241N Staged 1.384 1.321 0.799 0.769 1.381 Alternate 0.076 0.122 0.609 0.081 0.076 Global 0.070 0.113 0.069 0.078 0.070

(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)

slide-61
SLIDE 61

Experimental results: Subspace learning

Run time (seconds)

data set COIL WBC BCI Ionos G241N Staged 272 73 45 28 290 Alternate 2352 324 227 112 2648 Global 106 8 25 61 94

(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)

slide-62
SLIDE 62

Experimental results: Subspace learning

Transductive generalization error

data set COIL WBC BCI Ionos G241N Staged 0.476 0.200 0.452 0.335 0.484 Alternate 0.464 0.388 0.440 0.457 0.478 Global 0.388 0.134 0.380 0.243 0.380

(Lee et al. 2009)

0.414 0.168 0.436 0.350 0.452

(Goldberg et al. 2010)

0.484 0.288 0.540 0.338 0.524

(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)

slide-63
SLIDE 63

Conclusion

Global training can be more efficient than local training

Alternation is inherently slow to converge

Global training simplifies practical application

  • no under-training
  • only need to guard against over-fitting
  • can use standard regularization techniques