Guaranteed Learning of Latent Variable Models through Tensor Methods
Furong Huang
University of Maryland
furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018
1 / 75
Guaranteed Learning of Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation
Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1 / 75 Tutorial Topic Learning algorithms for latent variable models based on
furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018
1 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model T ensor Decomposition Inference
2 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model T ensor Decomposition Inference
2 / 75
3 / 75
3 / 75
4 / 75
5 / 75
6 / 75
7 / 75
8 / 75
” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun,
9 / 75
10 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model Learning Algorithm Inference 11 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model Learning Algorithm Inference
11 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent Variable model MCMC Inference
◮ Exponential mixing time 11 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model
▲ ✁el ✐✂ ✂ ✄ ☎ ✆ t ✐✂ ✄✝Inference
◮ Exponential mixing time
◮ Exponential critical points 11 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model
✞ ✟ ✠el ✟ ✡☛ ☛ ☞ ✌ ✍ ✎ ✡☛ ☞✏Inference
◮ Exponential mixing time
◮ Exponential critical points
11 / 75
Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A
Unlabeled data Latent variable model
❚ensor Decomposition Inference
= + +
◮ Exponential mixing time
◮ Exponential critical points
11 / 75
1
2
12 / 75
1
2
3
12 / 75
1
2
3
4
12 / 75
1
2
3
4
◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 12 / 75
1
2
3
4
◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5
12 / 75
1
2
3
4
◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5
◮ Decomposition for tensors with linearly independent components ◮ Decomposition for tensors with orthogonal components 12 / 75
1
2
3
4
◮ Identifiability ◮ Parameter recovery via decomposition of exact moments 5
◮ Decomposition for tensors with linearly independent components ◮ Decomposition for tensors with orthogonal components 6
7
12 / 75
1
2
3
4
5
6
7
13 / 75
14 / 75
14 / 75
i=1
i=1 Prθ(xi)
θ∈Θ
◮ No “direct” estimators when some variables are hidden ◮ Local optimization via Expectation-Maximization (EM) (Dempster, Laird,
& Rubin, 1977)
15 / 75
i=1 and the number of Gaussian components K, the
h=1.
θ n
log K
πh det(Σh)1/2 exp
2(xi − µh)⊤Σ−1
h (xi − µh)
Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015).
16 / 75
i=1 are generated by distribution Prθ(xi) where
Daskalakis, Tzamos, & Zampetakis, 2016).
Zhang, Balakrishnan, Wainwright, & Jordan, 2016).
17 / 75
18 / 75
◮ E.g., assume min
i=j µi−µj2 σ2
i +σ2 j
is sufficiently large.
◮ (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002; . . . )
◮ E.g., assume sparsity, separable (anchor words). ◮ (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012; . . . )
◮ E.g., assume µ1, . . ., µK span a K-dimensional space.
19 / 75
1
2
3
4
5
6
7
20 / 75
1
◮ Moments
Eθ[f(X)]
2
i=1):
◮ Empirical Moments
3
◮ Moment matching
Eθ[f(X)] n→∞ =
i=1 ∼ N(µ, Σ2)?
21 / 75
22 / 75
23 / 75
24 / 75
25 / 75
R
26 / 75
R
K
K
27 / 75
=
28 / 75
1
1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2
√ 2 2 , − √ 2 2 ]
√ 2 2 , √ 2 2 ]
29 / 75
1
1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2
Unique with eigenvalue gap
√ 2 2 , − √ 2 2 ]
√ 2 2 , √ 2 2 ]
29 / 75
1
1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2
Unique with eigenvalue gap
√ 2 2 , − √ 2 2 ]
√ 2 2 , √ 2 2 ]
29 / 75
1
1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2
Unique with eigenvalue gap
√ 2 2 , − √ 2 2 ]
√ 2 2 , √ 2 2 ]
29 / 75
1
1 + e2e⊤ 2 = u1u⊤ 1 + u2u⊤ 2
Unique with eigenvalue gap
√ 2 2 , − √ 2 2 ]
√ 2 2 , √ 2 2 ]
29 / 75
1
2
3
4
5
6
7
30 / 75
◮ each associated with a distribution
h=1
◮ per document i, w(i) ∈ ∆K−1
Word Count per Document Topic Word Matrix
game season play
Poli
✑cs Science Sports
game season play
Business
31 / 75
◮ each associated with a distribution
h=1
◮ per document i, w(i) ∈ {e1, . . . , eK}
Word Count per Document
Topic Word Matrix
game season play
Polics Science Sports
game season play
Business
31 / 75
Polics Science Sports Business
P
i
s Science Sports
game season play
B u s i n e s s
h=1, given iid samples
i=1)
L , the length of document is L = j c(i) j
32 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h]
E[x|topic = h] =
= ah
33 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h] = K h=1 whah
E[x|topic = h] =
= ah
33 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h] = K h=1 whah
E[x|topic = h] =
= ah
M1 = E[x] =
whah;
n
n
x(i)
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness 33 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h] = K h=1 whah
E[x|topic = h] =
= ah
M1 = E[x] =
whah;
n
n
x(i)
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness
33 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h] = K h=1 whah
E[x|topic = h] =
= ah
M2 = E[x ⊗ x] =
whah ⊗ ah;
n
n
x(i) ⊗ x(i)
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness 33 / 75
P
i
s Science Sports
game season play
B u s i n e s s
Generative process:
◮ Choose h ∼ Cat(w1, . . . , wK) ◮ Generate L words ∼ ah
E[x] = K
h=1 P[topic = h]E[x|topic = h] = K h=1 whah
E[x|topic = h] =
= ah
M2 = E[x ⊗ x] =
whah ⊗ ah;
n
n
x(i) ⊗ x(i)
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness
33 / 75
W ⊤ W ⊤ W ⊤
M2 = E[x ⊗ x] =
whah ⊗ ah;
n
n
x(i) ⊗ x(i)
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness
33 / 75
W ⊤ W ⊤ W ⊤
M3 = E[x⊗3] =
whah⊗3;
n
n
x(i)⊗3
campus police witness
c r i m e Sports Educaon
campus police witness campus police witness
33 / 75
W ⊤ W ⊤ W ⊤
M3(W , W , W ) = E[(W ⊤x)⊗3] =
wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n
n
(W ⊤x(i))⊗3
h=1
33 / 75
W ⊤ W ⊤ W ⊤
M3(W , W , W ) = E[(W ⊤x)⊗3] =
wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n
n
(W ⊤x(i))⊗3
33 / 75
W ⊤ W ⊤ W ⊤
M3(W , W , W ) = E[(W ⊤x)⊗3] =
wh(W ⊤ah)⊗3; M3(W , W , W ) = 1 n
n
(W ⊤x(i))⊗3
33 / 75
◮ Distribution of three-word documents (word triples)
M3 = E[x ⊗ x ⊗ x] =
whah ⊗ ah ⊗ ah
◮
M3: Co-occurrence of word triples
34 / 75
103 104 105
Perplexity Tensor Variational
2 4 6 8 10 ×104
Running Time (s) 35 / 75
103 104 105
Perplexity Tensor Variational
2 4 6 8 10 ×104
Running Time (s)
Facebook: n ∼ 20k Yelp: n ∼ 40k DBLPsub: n ∼ 0.1m DBLP: n ∼ 1m
10-2 10-1 100 101
Error /group FB YP DBLPsub DBLP
102 103 104 105 106
Running Times (s) FB YP DBLPsub DBLP 35 / 75
103 104 105
Perplexity Tensor Variational
2 4 6 8 10 ×104
Running Time (s)
Facebook: n ∼ 20k Yelp: n ∼ 40k DBLPsub: n ∼ 0.1m DBLP: n ∼ 1m
10-2 10-1 100 101
Error /group FB YP DBLPsub DBLP
102 103 104 105 106
Running Times (s) FB YP DBLPsub DBLP
“Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, by F. Huang, A. Anandkumar, Oct. 2015. 35 / 75
1
2
3
4
5
6
7
36 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
37 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h < µh, c > µh ⊗ µh
37 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h < µh, c > µh ⊗ µh
37 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h < µh, c > µh ⊗ µh
37 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h < µh, c > µh ⊗ µh
h=1
37 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h=1 a.s.
h=1
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
38 / 75
h=1 µh⊗3 with linearly independent
h=1, find the components (up to scaling).
h=1 a.s.
h=1
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
h=1 ≡ unknown components {µh}K h=1 (up to scaling)?
38 / 75
h=1,
c′ U † a.s.
c′ )U †,
39 / 75
h=1,
c′ U † a.s.
c′ )U †,
i=1 and random choice of c and c′:
1
39 / 75
h=1,
c′ U † a.s.
c′ )U †,
i=1 and random choice of c and c′:
1
2
39 / 75
h=1,
c′ U † a.s.
c′ )U †,
i=1 and random choice of c and c′:
1
2
3
c′
39 / 75
h=1,
c′ U † a.s.
c′ )U †,
i=1 and random choice of c and c′:
1
2
3
c′
i=1 are the eigenvectors of T (I, I, c)T (I, I, c)† with distinct
39 / 75
40 / 75
41 / 75
i=1
41 / 75
i=1
◮ E[x ⊗ x ⊗ x] =
h whah ⊗ ah ⊗ ah
◮
E[x ⊗ x ⊗ x] = 1
n n
xi ⊗ xi ⊗ xi
41 / 75
i=1
◮ E[x ⊗ x ⊗ x] =
h whah ⊗ ah ⊗ ah
◮
E[x ⊗ x ⊗ x] = 1
n n
xi ⊗ xi ⊗ xi
2 in some norm, e.g., ◮ Operator norm: E[x ⊗ x ⊗ x] −
E[x ⊗ x ⊗ x] n− 1
2
◮ where T :=
sup
x,y,z∈Sd−1 T (x, y, z)
◮ Frobenius norm: E[x ⊗ x ⊗ x] −
E[x ⊗ x ⊗ x]F n− 1
2
◮ where T F :=
i,j,k
T 2
i,j,k
41 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
2 42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1 ?
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
2 42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1 ?
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1 ?
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1 ?
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
1 poly d is required.
42 / 75
h=1 µh⊗3 with linearly independent components
h=1, find the components (up to scaling).
h=1 a.s.
h=1 ?
1: Sample c and c′ independently & uniformly at random from Sd−1 2: Return {
h=1 ← eigenvectors of
1 poly d is required. A different approach?
42 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
43 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
43 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
43 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
h=1 as eigenvectors of tensor M3.
43 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
h=1 as eigenvectors of tensor M3.
h=1 is not orthogonal in general.
43 / 75
K
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
h=1 as eigenvectors of tensor M3.
h=1 is not orthogonal in general.
43 / 75
44 / 75
45 / 75
45 / 75
h=1 ∈ Rd×K has full column rank, it is an invertible
45 / 75
+ +
+ +
Tensor M3 Tensor T
46 / 75
+ +
+ +
Tensor M3 Tensor T
h wh(W ⊤ah)⊗3.
46 / 75
+ +
+ +
Tensor M3 Tensor T
h wh(W ⊤ah)⊗3.
h∈[K]
46 / 75
+ +
+ +
Tensor M3 Tensor T
h wh(W ⊤ah)⊗3.
h∈[K]
46 / 75
47 / 75
47 / 75
47 / 75
h
h
47 / 75
h=1 are assumed to be linearly independent.
h=1 has orthonormal columns?
h whah, ai2ah= wiai, ∀i.
h=1 as eigenvectors of tensor M3.
h=1 is not orthogonal in general.
48 / 75
h=1 λhvh ⊗ vh with orthonormal components
h=1 (vh ⊥
49 / 75
h=1 λhvh ⊗ vh with orthonormal components
h=1 (vh ⊥
h λhvi, vhvh = λivi
49 / 75
h=1 λhvh ⊗ vh with orthonormal components
h=1 (vh ⊥
h λhvi, vhvh = λivi
49 / 75
h=1 λhvh ⊗ vh with orthonormal components
h=1 (vh ⊥
h λhvi, vhvh = λivi
h=1 preserve direction
49 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
50 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
h λhvi, vh2vh = λivi
50 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
h λhvi, vh2vh = λivi
50 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
h λhvi, vh2vh = λivi
h=1 preserve direction
50 / 75
h=1 λhvh⊗2 with orthonormal components
h=1 (vh ⊥
h=1 w.h.p.
h=1
1: for h = 1 : K do 2:
3:
4:
M(I,ui−1) M(I,ui−1)
5:
6:
7:
8: end for
51 / 75
h=1 λhvh⊗2 with orthonormal components
h=1 (vh ⊥
h=1 w.h.p.
h=1
1: for h = 1 : K do 2:
3:
4:
M(I,ui−1) M(I,ui−1)
5:
6:
7:
8: end for
h=1 ≡ {vh}K h=1 w.h.p.?
51 / 75
h=1 λhvh⊗2 with orthonormal components
h=1 (vh ⊥
h=1 w.h.p.
h=1
1: for h = 1 : K do 2:
3:
4:
M(I,ui−1) M(I,ui−1)
5:
6:
7:
8: end for
h=1 ≡ {vh}K h=1 w.h.p.?
51 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
h=1 w.h.p.
h=1
1: for h = 1 : K do 2:
3:
4:
T (I,ui−1,ui−1) T (I,ui−1,ui−1)
5:
6:
7:
8: end for
52 / 75
h=1 λhvh⊗3 with orthonormal components
h=1 (vh ⊥
h=1 w.h.p.
h=1
1: for h = 1 : K do 2:
3:
4:
T (I,ui−1,ui−1) T (I,ui−1,ui−1)
5:
6:
7:
8: end for
h=1 ≡ {vh}K h=1 w.h.p.?
52 / 75
h=1 such that corresponding eigenvalues
h=1
h=1 are distinct.
λ1 < 1 and c1 = 0, matrix power method converges to v1.
◮ Linear transform permits M(I, u0) =
h λh
h u0
h λhchvh,
i.e., projection in vh direction is scaled by λh.
◮ In t iterations,
1 v
2
i v
2 ≥ 1 − K
λ1
2t .
53 / 75
h=1 such that
h=1 are distinct. Initialization dependent.
λ1|c1| < 1 and λ1|c1| = 0, tensor power method converges to v1.
λ1|c1| < 1.
◮ Bi-linear transform permits T (I, u0, u0) = h λh
v⊤
h u0
2vh =
h λhc2 h vh
i.e., projection in vh direction is squared then scaled by λh.
◮ In t iterations,
1 v2
i v2 ≥ 1 − k
maxi=1 λi
2
v1c1
.
54 / 75
55 / 75
1
1
55 / 75
1
2
1
2
55 / 75
1
2
3
1
2
3
55 / 75
h=1 are eigenvectors as T (I, vh, vh) = λhvh.
56 / 75
h=1 are eigenvectors as T (I, vh, vh) = λhvh.
◮ E.g., when {λh}K
h=1 ≡ 1
56 / 75
h=1 are eigenvectors as T (I, vh, vh) = λhvh.
◮ E.g., when {λh}K
h=1 ≡ 1
h=1)?
56 / 75
h=1 are eigenvectors as T (I, vh, vh) = λhvh.
◮ E.g., when {λh}K
h=1 ≡ 1
h=1)?
56 / 75
h=1 are eigenvectors as T (I, vh, vh) = λhvh.
◮ E.g., when {λh}K
h=1 ≡ 1
h=1)?
56 / 75
57 / 75
v
Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).
v
Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).
58 / 75
v
Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).
v
Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).
58 / 75
v
Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).
v
Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).
∇L(v, λ) = 2(M(I, v) − λv) = 0
Eigenvectors are stationary points. Power method v ←
M(I,v) M(I,v) is a version
∇L(v, λ) = 3(T (I, v, v) − λv) = 0
Eigenvectors are stationary points. Power method v ←
T (I,v,v) T (I,v,v) is a
version of gradient ascent.
58 / 75
v
Lagrangian: L(v, λ) := M(v, v) − λ(v⊤v − 1).
v
Lagrangian: L(v, λ) := T(v, v, v) − 1.5λ(v⊤v − 1).
∇L(v, λ) = 2(M(I, v) − λv) = 0
Eigenvectors are stationary points. Power method v ←
M(I,v) M(I,v) is a version
∇L(v, λ) = 3(T (I, v, v) − λv) = 0
Eigenvectors are stationary points. Power method v ←
T (I,v,v) T (I,v,v) is a
version of gradient ascent.
v1 is the only local optimum. All other eigenvectors are saddle points. {vh}K
h=1 are the only local optima.
All spurious eigenvectors are saddle points.
58 / 75
59 / 75
x:x=1 |E(x, x, x)| ≤ ǫ.
60 / 75
x:x=1 |E(x, x, x)| ≤ ǫ.
ǫ
K ,
60 / 75
x:x=1 |E(x, x, x)| ≤ ǫ.
ǫ
K ,
60 / 75
61 / 75
(Wang & Lu, 2017)
◮ Simultaneous recovery of eigenvectors ◮ Initialization is not optimal
(Sharan & Valiant, 2017)
◮ Random initialization ◮ Proved convergence for symmetric tensor
62 / 75
1
2
3
4
5
6
7
63 / 75
Image classification Speech recognition Text processing
64 / 75
Image classification Speech recognition Text processing
64 / 75
3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10
shallow 8 layers 19 layers 22 layers
8 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
65 / 75
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
65 / 75
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000
AlexNet, 8 layers (ILSVRC 2012)
3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000
VGG, 19 layers (ILSVRC 2014) GoogleNet, 22 layers (ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
65 / 75
AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015)
3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000VGG, 19 layers (ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
65 / 75
34 58 66 86 H
❖G, DP ▼ ❆l ❡ ✒Net(RCNN) VGG (RCNN) ResNet (
❋aster RCNN) ✯P
✓SC ✓L V ✔C 200 ✼ Object Detection m ✓P ( ✪)shallo
✇8 layers 16 layers
101 layers
*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
65 / 75
66 / 75
◮ Ill-suited for smart phones or IoT device.
66 / 75
◮ Ill-suited for smart phones or IoT device.
66 / 75
m-order tensor T ∈ RI0×I1×···×Im−1
67 / 75
m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1
r=0 M (0) r,i0 · · · M (m−1) r,im−1
67 / 75
m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1
r=0 M (0) r,i0 · · · M (m−1) r,im−1
Tucker (TK) Decomposition More general than CP decomposition Multilinear operation on a core tensor C: C(M (0), . . . , M (m−1)) T i0,··· ,im−1 = R0−1
r0=0 · · · Rm−1−1 rm−1=0 Cr0,...,rm−1M (0) r0,i0 · · · M (m−1) rm−1,im−1
67 / 75
m-order tensor T ∈ RI0×I1×···×Im−1 CANDECOMP/PARAFAC (CP) Decomposition Factorize a tensor into sum of rank-1 tensors Rank-1 tensor is defined as outer product of multiple vectors T i0,··· ,im−1 = R−1
r=0 M (0) r,i0 · · · M (m−1) r,im−1
Tucker (TK) Decomposition More general than CP decomposition Multilinear operation on a core tensor C: C(M (0), . . . , M (m−1)) T i0,··· ,im−1 = R0−1
r0=0 · · · Rm−1−1 rm−1=0 Cr0,...,rm−1M (0) r0,i0 · · · M (m−1) rm−1,im−1
Tensor-Train (TT) Decomposition Factorize a tensor into a number of interconnected lower-order tensors T i0,...,im−1 = R0−1
r0=1 · · · Rm−2−1 rm−2=1 T (0) i0,r0 T (1) r0,i1,r1 · · · T (m−1) rm−2,im−1
67 / 75
68 / 75
Filter height/width H/W, No. of input/output channels S/T .
68 / 75
Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .
68 / 75
Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .
CP: Decompose kernel K into 3 factor tensors Ki,j,s,t =
R−1
K(0)
s,r K(1) i,j,r K(2) r,t
CP decomposition H W R R R S T
68 / 75
Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .
TK: Decompose K into 1 core tensor, 2 factor tensors Ki,j,s,t =
Rs−1
Rt−1
K(0)
s,rs K(1) i,j,rs,rt K(2) rt,t
TK decomposition H W Rs Rs Rt Rt S T
68 / 75
Filter height/width H/W, No. of input/output channels S/T . Map an input tensor U ∈ RX×Y ×S to an output tensor V ∈ RX′×Y ′×T .
TT: Decompose K into 4 factor tensors Ki,j,s,t =
Rs−1
R−1
Rt−1
K(0)
s,rsK(1) rs,i,r K(2) r,j,rt K(3) rt,t
TT decomposition H W Rs R Rt S T
68 / 75
69 / 75
Tensorization: kernel reshaped to higher order tensor.
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1.
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.
H H W W R R R R R R R S T S
1 m
T
1 m
S
1 m
T
1 m
S
1 m
T
1 m
1 m + HW)R
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.
... ...
TK tensorized TK
H H W W R R R R R R R R Rs Rs Rt Rt S T S
1 m
T
1 m
S
1 m
T
1 m
1 m +T 1 m )R+HWR2m
69 / 75
Tensorization: kernel reshaped to higher order tensor. S = m−1
i=0 Si and T = m−1 i=0 Ti.
Input tensor U ∈ RX×Y ×S tensorized to U′ ∈ RX×Y ×S0×···×Sm−1. Output reshaped V ∈ RX×Y ×T to V′ ∈ RX′×Y ′×T0×···×Tm−1.
H H W W R R R Rs R Rt S T S
1 m
T
1 m
S
1 m
T
1 m
S
1 m
T
1 m
1 m R+HW)R
69 / 75
Bhattacharjee & Huang, 2018)
Compression rate: SPC, E2E Compression rate: t-SPC, Seq. Method 5% 10% 20% 40% 2% 5% 10% 20% CP 84.02 86.93 88.75 88.75 85.7 89.86 91.28
83.57 86.00 88.03 89.35 61.06 71.34 81.59 87.11 TT 77.44 82.92 84.13 86.64 78.95 84.26 87.89
70 / 75
Bhattacharjee & Huang, 2018)
Uncompressed SPC-TT t-SPC-TT Epochs
71 / 75
1
2
3
4
5
6
7
72 / 75
◮ Exploit distributional properties, multi-view structure, and other
structure to determine usable moments tensors.
◮ Some efficient algorithms for carrying out the tensor decomposition to
◮ Handle model misspecification, increase robustness. ◮ Learning deep neural network parameters using tensor decomposition? 73 / 75
“A Method of Moments for Mixture Models and Hidden Markov Models”, by Anima Anandkumar, Daniel Hsu and Sham Kakade. In Conference on Learning Theory, 2012. “Tensor Decompositions for Learning Latent Variable Models”, by Anima Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade and Matus Telgarsky. In Journal of Machine Learning Research, 2014. “Escaping from saddle pointsonline stochastic gradient for tensor decomposition”, Rong Ge, Furong Huang, Chi Jin and Yang Yuan. In Conference on Learning Theory, 2015. “Online tensor methods for learning latent variable models”, Furong Huang, Niranjan U. N., Mohammad Umar Hakeem and Anima Anandkumar. The Journal of Machine Learning Research, 2016. “Guaranteed Simultaneous Asymmetric Tensor Decomposition via Orthogonalized Alternating Least Squares”, by Jialin Li and Furong Huang, 2018. “Tensorized Spectrum Preserving Compression for Neural Networks”, by Jiahao Su, Jingling Li, Bobby Bhattacharjee and Furong Huang, 2018.
74 / 75
75 / 75