Part 3: Latent representations and unsupervised learning Dale - - PowerPoint PPT Presentation
Part 3: Latent representations and unsupervised learning Dale - - PowerPoint PPT Presentation
Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta Supervised versus unsupervised learning Prominent training principles Generative Discriminative y y x x typical for supervised typical for
Supervised versus unsupervised learning
Prominent training principles
Discriminative y x typical for supervised Generative y x typical for unsupervised
Unsupervised representation learning
Consider generative training x φ
Unsupervised representation learning
Examples
- dimensionality reduction (PCA, exponential family PCA)
- sparse coding
- independent component analysis
- deep learning
. . .
Usually involves learning both
a latent representation for data and a data reconstruction model
Context
could be: unsupervised, semi-supervised, or supervised
Challenge
Optimal feature discovery appears to be generally intractable
Have to jointly train
- latent representation
- data reconstruction model
Usually resort to alternating minimization
(sole exception: PCA)
First consider unsupervised feature discovery
Unsupervised feature discovery
Single layer case = matrix factorization
x ϕ B
X B Φ ≈
- riginal data
learned dictionary new representation n!t n!m m!t
t = # training examples n = # original features m = # new features
Choose B and Φ to minimize data reconstruction loss
L(BΦ; X) = t
i=1 L(BΦ:i; X:i)
Seek desired structure in latent feature representation
Φ low rank : dimensionality reduction Φ sparse : sparse coding Φ rows independent : independent component analysis
Generalized matrix factorization
Assume reconstruction loss L(ˆ x; x) is convex in first argument
Bregman divergence
L(ˆ x; x) = DF(ˆ xx) = DF ∗(f (x)f (ˆ x)) (F strictly convex potential with transfer f = ∇F) Tries to make ˆ x ≈ x
Matching loss
L(ˆ x; x) = DF(ˆ xf −1(x)) = DF ∗(xf (ˆ x)) Tries to make f (ˆ x) ≈ x (A nonlinear predictor, but loss still convex in ˆ x)
x ϕ B
Regular exponential family
L(ˆ x; x) = − log pB(x|φ) = DF(ˆ xf −1(x)) − F ∗(x) − const
Training problem
min
B∈Rn×m
min
Φ∈Rm×t L(BΦ; X)
How to impose desired structure on Φ?
Training problem
min
B∈Rn×m
min
Φ∈Rm×t L(BΦ; X)
How to impose desired structure on Φ? Dimensionality reduction
Fix # features m < min(n, t)
- But only known to be tractable if L( ˆ
X; X) = ˆ X − X2
F (PCA)
- No known efficient algorithm for other standard losses
Problem
rank(Φ) = m constraint is too hard
Training problem
min
B∈Bm
2
min
Φ∈Rm×t L(BΦ; X)+αΦ2,1
How to impose desired structure on Φ? Relaxed dimensionality reduction (subspace learning)
Add rank reducing regularizer Φ2,1 = m
j=1 Φj:2
Favors null rows in Φ But need to add constraint to B B:j ∈ B2 = {b : b2 ≤ 1} (Otherwise can make Φ small just by making B large)
Training problem
min
B∈Bm
q
min
Φ∈Rm×t L(BΦ; X)+αΦ1,1
How to impose desired structure on Φ? Sparse coding
Use sparsity inducing regularizer Φ1,1 = m
j=1
t
i=1 |Φji|
Favors sparse entries in Φ Need to add constraint to B B:j ∈ Bq = {b : bq ≤ 1} (Otherwise can make Φ small just by making B large)
Training problem
min
B∈Rn×m
min
Φ∈Rm×t L(BΦ; X)+αD(Φ)
How to impose desired structure on Φ? Independent components analysis
Usually enforces BΦ = X as a constraint
- but interpolation is generally a bad idea
- Instead just minimize reconstruction loss
plus a dependence measure D(Φ) as a regularizer
Difficulty
Formulating a reasonable convex dependence penalty
Training problem
Consider subspace learning and sparse coding
min
B∈Bm
min
Φ∈Rm×t L(BΦ; X) + αΦ
Choice of Φ and B determines type of representation recovered
Training problem
Consider subspace learning and sparse coding
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦ
Choice of Φ and B determines type of representation recovered
Problem
Still have rank constraint imposed by # new features m
Idea
Just relax m → ∞
- Rely on sparsity inducing norm Φ to select features
Training problem
Consider subspace learning and sparse coding
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦ
Still have a problem
Optimization problem is not jointly convex in B and Φ
Training problem
Consider subspace learning and sparse coding
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦ
Still have a problem
Optimization problem is not jointly convex in B and Φ
Idea 1: Alternate!
- convex in B given Φ
- convex in Φ given B
Could use any other form of local training
Training problem
Consider subspace learning and sparse coding
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦ
Still have a problem
Optimization problem is not jointly convex in B and Φ
Idea 2: Boost!
- Implicitly fix B to universal dictionary
- Keep row-wise sparse Φ
- Incrementally select column in B (“weak learning problem”)
- Update sparse Φ
Can prove convergence under broad conditions
Training problem
Consider subspace learning and sparse coding
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦ
Still have a problem?
Optimization problem is not jointly convex in B and Φ
Idea 3: Solve!
- Can easily solve for globally optimal joint B and Φ
- But requires a significant reformulation
A useful observation
Equivalent reformulation
Theorem
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α| ˆ X |
- |
· | is an induced matrix norm on ˆ X determined by B and · p,1
Important fact
Norms are always convex
Computational strategy
- 1. Solve for optimal response matrix ˆ
X first (convex minimization)
- 2. Then recover optimal B and Φ from ˆ
X
Example: subspace learning
min
B∈B∞
2
min
Φ∈R∞×t L(BΦ; X) + αΦ2,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α ˆ Xtr
Recovery
- Let UΣV ′ = svd( ˆ
X)
- Set B = U and Φ = ΣV ′
Preserves optimality
- B:j2 = 1 hence B ∈ Bn
2
- Φ2,1 = ΣV ′2,1 =
j σjV:j2 = j σj = ˆ
Xtr Thus L( ˆ X; X) + α ˆ Xtr = L(BΦ; X) + αΦ2,1
Example: sparse coding
min
B∈B∞
q
min
Φ∈R∞×t L(BΦ; X) + αΦ1,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α ˆ X ′q,1
Recovery
B =
- 1
ˆ X:1q ˆ X:1, ..., 1 ˆ X:tq ˆ X:t
- (rescaled columns)
Φ = ˆ X:1q ... ˆ X:tq (diagonal matrix)
Preserves optimality
- B:jq = 1 hence B ∈ Bt
q
- Φ1,1 =
j ˆ
X:jq = ˆ X ′q,1 Thus L( ˆ X; X) + α ˆ X ′q,1 = L(BΦ; X) + αΦ1,1
Example: sparse coding
Outcome
Sparse coding with · 1,1 regularization = vector quantization
- drops some examples
- memorizes remaining examples
Optimal solution is not overcomplete
Could not make these observations using local solvers
Simple extensions
- Missing observations in X
- Robustness to outliers in X
min
S∈Rn×t
min
ˆ X∈Rn×t L( ( ˆ
X + S)Ω ; XΩ ) + α| ˆ X | + βS1,1 Ω = observed indices in X S = speckled outlier noise (jointly convex in ˆ X and S)
Explaining the useful result
Theorem
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗
(B,p∗)
Explaining the useful result
Theorem
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗
(B,p∗)
A dual norm
ˆ X ′∗
(B,p∗) =
max
Λ′(B,p∗)≤1 tr(Λ′ ˆ
X) (standard definition of a dual norm)
Explaining the useful result
Theorem
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α| ˆ X | for an induced matrix norm | ˆ X | = ˆ X ′∗
(B,p∗)
A dual norm
ˆ X ′∗
(B,p∗) =
max
Λ′(B,p∗)≤1 tr(Λ′ ˆ
X) (standard definition of a dual norm)
- f a vector-norm induced matrix norm
Λ′(B,p∗) = max
b∈B Λ′bp∗
(easy to prove this yields a norm on matrices)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
min
Φ:BΦ= ˆ X
Φp,1
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
min
Φ:BΦ= ˆ X
Φp,1 For any B ∈ B∞ that spans the columns of ˆ X min
Φ:BΦ= ˆ X
Φp,1 = min
Φ max Λ
max
V p∗,∞≤1 tr(V ′Φ) + tr(Λ′( ˆ
X − BΦ)) = max
V p∗,∞≤1 max Λ min Φ tr(Λ′ ˆ
X) + tr(Φ′(V − B′Λ)) = max
V p∗,∞≤1
max
Λ:B′Λ=V tr(Λ′ ˆ
X) = max
Λ:B′Λp∗,∞≤1 tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1 tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞ tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B
- tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ
X)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ
X)
- =
min
ˆ X∈Rn×t L( ˆ
X; X) + α ˆ X ′∗
(B,p∗)
Proof outline
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t
min
B∈B∞
min
Φ:BΦ= ˆ X
L( ˆ X; X) + αΦp,1 = min
ˆ X∈Rn×t L( ˆ
X; X) + α min
B∈B∞
max
Λ:B′Λp∗,∞≤1
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:B′Λp∗,∞≤1, ∀B∈B∞
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:b′Λp∗≤1, ∀b∈B
- tr(Λ′ ˆ
X) = min
ˆ X∈Rn×t L( ˆ
X; X) + α max
Λ:Λ′(B,p∗)≤1 tr(Λ′ ˆ
X)
- =
min
ˆ X∈Rn×t L( ˆ
X; X) + α ˆ X ′∗
(B,p∗)
done
Closed form induced norms
Theorem
min
B∈B∞
min
Φ∈R∞×t L(BΦ; X) + αΦp,1
= min
ˆ X∈Rn×t L( ˆ
X; X) + α ˆ X ′∗
(B,p∗)
Special cases
B2, Φ2,1 → ˆ X ′∗
(B2,2)
= ˆ Xtr (subspace learning) Bq, Φ1,1 → ˆ X ′∗
(Bq,∞)
= ˆ X ′q,1 (sparse coding) B1, Φp,1 → ˆ X ′∗
(B1,p∗)
= ˆ Xp,1
Some simple experiments
Experimental results
Alternate : repeatedly optimize over B, Φ successively Global : recover global joint minimizer over B, Φ
Experimental results: Sparse coding
Objective value achieved
data set COIL WBC BCI Ionos G241N Alternate 1.314 4.918 0.898 1.612 1.312 Global 0.207 0.659 0.306 0.330 0.207 ×10−2
(squared loss, q = 2, α = 10−5)
Experimental results: Sparse coding
Run time (seconds)
data set COIL WBC BCI Ionos G241N Alternate 1.95 10.54 0.88 1.71 2.37 Global 0.06 0.01 0.01 0.01 0.09
(squared loss, q = 2, α = 10−5)
Experimental results: Subspace learning
Objective value achieved
data set COIL WBC BCI Ionos G241N Alternate 1.314 4.957 0.903 1.632 1.313 Global 0.072 0.072 0.092 0.079 0.205 ×10−2
(squared loss, α = 10−5)
Experimental results: Subspace learning
Run time (seconds)
data set COIL WBC BCI Ionos G241N Alternate 2.40 9.31 1.12 0.47 2.43 Global 2.18 0.06 0.19 0.06 2.11
(squared loss, α = 10−5)
Catch
Every norm is convex But not every induced matrix norm is tractable X2 = σmax(X) X1 = max
j
- i
|Xij| X∞ = max
i
- j
|Xij| Xp NP-hard to approximate for p = 1, 2, ∞
Question
Any other useful induced matrix norms that are tractable?
Yes!
Semi-supervised feature discovery
Semi-supervised feature discovery
B Φl ≈
labeled (n+k)!(tl+tu)
Xl W Φu Xu Yl
unlabeled inputs
- utputs
tl = # labeled n = # original features tu = # unlabeled k = # output dimensions t = tl + tu
y x B W ϕ
Learn
Φ = [Φl, Φu] data representation B = input reconstruction model f (BΦ) ≈ X W = output reconstruction model h(W Φl) ≈ Yl
Semi-supervised feature discovery
Let Z = Xl Xu Yl ∅
- U =
B W
- U =
B W
- Formulation
min
B∈B∞
min
W ∈W∞
min
Φ∈R∞×t Lu(BΦ; X) + βLs(W Φl; Yl) + αΦp,1
= min
ˆ Z∈R(n+k)×t
˜ L( ˆ Z; Z) + α ˆ Z ′∗
(U,p∗)
Note
Imposing separate constraints on B and W
Questions
- Is the induced norm ˆ
Z ′∗
(U,p∗) efficiently computable?
- Can optimal B, W , Φ be recovered from optimal ˆ
Z?
Example: sparse coding formulation
Regularizer: Φ1,1 Constraints: Bq1 = {b : bq1 ≤ 1} Wq2 = {w : wq2 ≤ γ} Uq1
q2
= B × W
Theorem
ˆ Z ′∗
(Uq1
q2 ,∞)
=
- j max
- ˆ
Z X
:j q1, 1 γ ˆ
Z Y
:j q2
- efficiently computable
Recovery
Φjj = max
- ˆ
Z X
:j q1, 1 γ ˆ
Z Y
:j q2
- (diagonal matrix)
U = ˆ ZΦ−1
Preserves optimality
But still reduces to a form of vector quantization
Example: subspace learning formulation
Regularizer: Φ2,1 Constraints: B2 = {b : b2 ≤ 1} W2 = {w : w2 ≤ γ} U2
2
= B × W
Theorem
ˆ Z ′∗
(U2
2 ,∞)
= max
ρ≥0 D−1 ρ
ˆ Ztr where Dρ = √1 + γρ I
- 1+γρ
ρ
I
- efficiently computable: quasi-concave in ρ
Example: subspace learning formulation
Lemma: dual norm
Λ′2
(U2
2 ,2)
= max
h:hX 2=1, hY 2=γ h′ΛΛ′h
= max
H:H0, tr(HI X )=1, tr(HI Y )=γ tr(HΛΛ′)
= min
λ≥0, ν≥0
min
Λ:ΛΛ′ λI X +νI Y
λ + γν = min
λ≥0, ν≥0
min
Λ:Dν/λΛ2
sp≤λ+γν λ + γν
= min
λ≥0, ν≥0 Dν/λΛ2 sp
= min
ρ≥0
DρΛ2
sp
Example: subspace learning formulation
Can easily derive target norm from dual norm
ˆ Z ′∗
(U2
2 ,2)
= max
Λ′(U2
2 ,2)≤1 tr(Λ′ ˆ
Z) = max
ρ≥0
max
Λ:DρΛsp≤1 tr(Λ′ ˆ
Z) = max
ρ≥0
max
˜ Λ:˜ Λsp≤1
tr(˜ Λ′D−1
ρ
ˆ Z) = max
ρ≥0
D−1
ρ
ˆ Ztr (proves theorem)
Example: subspace learning formulation
Computational strategy
Solve in dual, since Λ′(U2
2 ,∞) can be computed efficiently via
partitioned power method iteration min
Λ
˜ L⋆(Λ; Z) + α⋆Λ′(U2
2 ,2)
Given ˆ Λ
- Recover ˆ
Z X and ˆ Z Y
l
by solving min
ˆ Z X , ˆ Z Y Lu( ˆ
Z X; X) + Ls( ˆ Z Y
l ; Yl) − tr( ˆ
Z X ′ˆ ΛX) − tr( ˆ Z Y
l ′ˆ
ΛY
l )
- Recover ˆ
Z Y
u by minimizing ˆ
Z ′(U2
2 ,2) (keeping ˆ
Z X, ˆ Z Y
l
fixed)
Example: subspace learning formulation
Recovery
Given optimal ˆ Z, recover U and Φ iteratively by repeating:
- (Φ(ℓ), Λ(ℓ)) ∈ arg minΦ maxΛ Φ2,1 + tr(Λ′( ˆ
Z − U(ℓ)Φ))
- u(ℓ+1) ∈ arg maxu∈U2
2 u′Λ(ℓ)2
- U(ℓ+1) = [U(ℓ), u(ℓ+1)]
Converges to optimal U and Φ
- U(ℓ)Φ(ℓ) = ˆ
Z for all ℓ
- Φ(ℓ)2,1 → ˆ
Z ′∗
(U2
2 ,2)
Some simple experiments
Experimental results: Subspace learning
Staged : first locally optimize B, Φ, then optimize W Alternate : repeatedly optimize over B, W , Φ successively Global : recover joint global minimizer over B, Φ, W
Experimental results: Subspace learning
Objective value achieved
data set COIL WBC BCI Ionos G241N Staged 1.384 1.321 0.799 0.769 1.381 Alternate 0.076 0.122 0.609 0.081 0.076 Global 0.070 0.113 0.069 0.078 0.070
(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)
Experimental results: Subspace learning
Run time (seconds)
data set COIL WBC BCI Ionos G241N Staged 272 73 45 28 290 Alternate 2352 324 227 112 2648 Global 106 8 25 61 94
(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)
Experimental results: Subspace learning
Transductive generalization error
data set COIL WBC BCI Ionos G241N Staged 0.476 0.200 0.452 0.335 0.484 Alternate 0.464 0.388 0.440 0.457 0.478 Global 0.388 0.134 0.380 0.243 0.380
(Lee et al. 2009)
0.414 0.168 0.436 0.350 0.452
(Goldberg et al. 2010)
0.484 0.288 0.540 0.338 0.524
(1/3 labeled, 2/3 unlabeled, squared loss, α⋆ = 10, β = 0.1)
Conclusion
Global training can be more efficient than local training
Alternation is inherently slow to converge
Global training simplifies practical application
- no under-training
- only need to guard against over-fitting
- can use standard regularization techniques