Tensor Methods for Signal Processing and Machine Learning
Qibin Zhao Tensor Learning Unit RIKEN AIP
1
2018-6-9 @ Waseda University
Tensor Methods for Signal Processing and Machine Learning Qibin - - PowerPoint PPT Presentation
Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN AIP 2018-6-9 @ Waseda University 1 Monographs Tensor networks for dimensionality reduction and large optimization Andrzej Cichocki, Namgil
Qibin Zhao Tensor Learning Unit RIKEN AIP
1
2018-6-9 @ Waseda University
Tensor networks for dimensionality reduction and large optimization
Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao and Danilo P.Mandic
2
P e
l e Expressions Views Illumination
(c)
illumination x views)
time)
frequency)
coordinate)
frame)
3
epoch time-frequency c h a n n e l
Matricization causes loss of useful multiway information. It is favorable to analyze multi-dimensional data in their own domain.
4
5
✓ classification target y represents a category or class ✓ regression target y is real-value number
✓ density estimation model the probability distribution of input x ✓ clustering, dimensionality reduction discover underlying structure in input x
6
p( ) X Unsupervised learning p(y ) X Supervised learning p(y ) X
Semi-supervised learning
Labeled data No data labels , Labeled data D Unlabeled D ~ and D data (Find hidden structure) D ~ D , D D ~ ,
✓ predict one or more responses (dependent variables, outputs) from a set of
predictors (independent variables, inputs)
✓ identify the key predictors (independent variables, inputs)
7
✓ linear model: simple regression, multiple regression, multivariate regression,
generalized linear model, partial least squares (PLS)
✓ nonlinear model: Gaussian process (GP), artificial neural networks (ANN),
support vector regression (SVR)
image credit Leard statistics
8
✓ is the input vector of independent variables ✓ is the vector of regression coefficients ✓ is the bias ✓ is the regression output or dependent/target variable
y f x; w, b x, w b wTx b,
ere x RI
s, w RI
ts, b t
d y the
✓ MRI data x-coordinate y-coordinate z-coordinate ✓ fMRI data time x-coordinate y-coordinate z-coordinate
✓ EEG data time frequency channel
✓ video data frame x-coordinate y-coordinate ✓ face image data pixel illumination expression viewpoint identity
✓ climate forecast data month location variable
✓ fluorescence excitation-emission data sample excitation emission
9
× × × × × × × × × × × × × × × × ×
10
✓ predictor 3rd-order tensor MRI images ✓ response scaler clinical diagnosis indicating one has some disease or not
11
✓ predictor 4th-order tensor RGB video (or depth video) ✓ response 3rd-order tensor human motion capture data
12
✓ predictor 4th-order tensor ECoG signals of monkey ✓ response 3rd-order tensor limb movement trajectories
✓ vectorizing operations destroy underlying multiway structures
i.e. spatial and temporal correlations are ignored among pixels in a fMRI
✓ ultrahigh tensor dimensionality produces huge parameters
i.e. a fMRI of size 100 256 256 256 yields 167 millions!
✓ difficulty of interpretation, sensitivity to noise, absence of uniqueness
13
models and multiway analysis techniques
✓ naturally preserve multiway structural knowledge which is useful in mitigating
small sample size problem
✓ compactly represent regression coefficients using only a few parameters ✓ ease of interpretation, robust to noise, uniqueness property
× × ×
14
✓ is the input tensor predictor or tensor regressor ✓ is the regression coefficients tensor ✓ is the bias ✓ is the regression output or dependent/target variable ✓ is the inner product of two tensors ✓ sparse regularization like lasso penalty on further improves the performance
ts, b t
d y the
y f X; W, b X, W b,
ere X RI1
IN
sor of weights (also
RI1
IN the
the
t, X, W vec X T vec W
minimization of following squared cost function
J X, y W, b
M m 1
ym W, Xm b
2
✓ are the M pairs of training samples
les Xm, ym f , the TR mod
m 1, . . . , M. is used to make
W
15
✓ substantial reduction in dimensionality
y f X; W, b X, W b,
W
R r 1
u 1
r
u 2
r
u N
r
I
1 U 1 2 U 2 N U N ,
where the coefficient tensor is assumed to follow a CP decomposition
W
× ×
✓ low rank CP model could provide a sound recovery of many low rank signals
i.e. a 128 128 128 MRI image, the parameters reduce from 2,097,157 to 1157 via rank-3 decomposition
16
✓ substantially reduce the dimensionality ✓ provide a sound low rank approximation to potentially high rank signal
y f X; W, b X, W b,
where the coefficient tensor is assumed to follow a Tucker decomposition
W
W G
1 U 1 2 U 2 N U N ,
✓ offer freedom in choice of different ranks when tensor data is skewed in
dimensions
✓ explicitly model the interactions between factor matrices
coefficient tensor is high-order than the input tensors , leading to
17
✓ is the Nth-order predictor tensor ✓ is the Pth-order regression coefficient tensor with ✓ is the (P-N)th-order response tensor ✓ denotes a tensor contraction along the first N modes
i.e., CP regression, Tucker regression, etc
general
Ym Xm W Em, m 1, . . . , M,
RI1
IN ,
, with , whi
RI1
IP , w
dual tensor and Y
d Y RIP
1
IP
with P N, with entries
ere Xm W d an th-order
matrix Y from the predictor matrix X, and describe their common latent structure
18
i) extract a set of latent variables of X and Y by performing a simultaneous decomposition of X and Y, such that maximum pairwise covariance is between the latent variables of X and the latent variables of Y ii) use the extracted latent variables to predict Y
19
X TPT E
R r 1
trpT
r
E, Y TDCT F
R r 1
drrtrcT
r
F,
✓ is the matrix predictor and is the matrix response ✓ contains R latent variables from ✓ represents R latent variables from ✓ and represent loadings or PLS regression coefficients
le X RI
J an
ir simultaneou le Y RI
M
ere T t1, t2, . . . , tR RI
R
ables from X, and a matrix U
t r cr t r pr
T
T
R ix U TD u1, u2, . . . , uR RI
R
es P an d C re
variables and as well as all the loadings
20
ere T t1, t2, . . . , tR RI
R
ables from X, and a matrix U R ix U TD u1, u2, . . . , uR RI
R
partial least squares PLS regression algorithm (NIPALS-PLS) [Wold, 1984]
can be performed by
here is some weight matrix obtained from NIPALS-PLS algorithm
by Y X WDCT.
et X c
X WDC
allows to predict the response tensor Y from the predictor tensor X and describe their common latent structure
subspace but using block Tucker decomposition [De Lathauwer, 2008]
21
i) extract a set of latent variables of tensor X and tensor Y by performing a simultaneous block Tucker decomposition of both tensor X and tensor Y, such that maximum pairwise covariance is between the latent variables of X and the latent variables of Y ii) use the extracted latent variables to predict tensor Y
predictor tensor and response tensor by
22
✓ is the (N+1)th-order predictor tensor by concatenating M samples ✓ is the (N+1)th-order response tensor having the same size M ✓ is the latent variable for the r-th component ✓ and are the loadings for r-th component ✓ and are the core tensors for r-th component
X
R r 1
Gxr
1 tr 2 P 1 r N 1 P N r
ER Y
R r 1
Gyr
1 tr 2 Q 1 r N 1Q N r
FR,
+ +
RM
I1 IN ,
, which
+ +
RM
J1 JN , w
RM is
P n
r N n 1
RIn
Ln an
ices in mode- , and G
nt vectors, tr R is d Q n
r N n 1
RJn
Kn ar 1
and G
n 1
and Gxr R1
L1 LN an
fining a latent matrix T and Gyr R1
K1 KN
t t , mode-
23
X
R r 1
Gxr
1 tr 2 P 1 r N 1 P N r
ER Y
R r 1
Gyr
1 tr 2 Q 1 r N 1Q N r
FR,
+ + . . . X t1
I × L 2
2
M ×1 1× L × L 2
1
L × I
1 1
P(2)
1
P(2)
R
tR
1× L × L2
1
L
× I 1 1
I L2
2 ( ) ( ) ( ) ( ) ( ) ( ) ( )
M ×1
( )
P(1)
1
P(1)
R T T
( )
M ×I × I
1 2 ×
= + E
( )
M ×I × I
1 2 × ×
+ + . . . = Y
M×1 1× K × K2
1
K × J
1 1
Q
(2) 1 (2)
M ×1 J × K2
2
1× K × K2
1
K × J
1 1
J × K2
2
Q
(2) R
Q
(1)T (1)T
( ) ( ) ( ) ( ) ( ) ( ) ( )
1 R
( )
M × J × J2
1 ( )
t1 +
( )
M × J × J2
1
tR F Q
× ×
24
✓ is the latent matrix ✓ and are the loading matrix ✓ is the core tensor for input ✓ is the core tensor for output
X Gx
1 T 2 P 1 N 1 P N
ER, Y Gy
1 T 2 Q 1 N 1 Q N
FR,
ix T t1, . . . , tR , m de- loading matrix
ix P n P n
1 , . . . , P n R
, m and core tensors
rix Q
n
Q n
1 , . . . , Q n R
an Gx blockdiag Gx1, . . . , GxR RR
RL1 RLN
Gy blockdiag Gy1, . . . , GyR RR
RK1 RKN
=
M×R
T
I ×RL
2 2
P(2) P(1)T
RL ×I
1 1
Gx . . .
( ) ) ( ) ( ( )
+ E
( )
M ×I × I
1 2
R×R L1×RL 2
X
( )
M ×I × I
1 2 × ×
. . . T Q
(2)
Gy =
R ×RK × RK
1 2
RK × J
1 1
Q
(1)T
J ×RK
2 2 ( ) ( )
+
( ) ( )
M × R
( )
M × J × J2
1
F
× ×
Y
( )
M × J × J2
1 × ×
25
✓ dataset ECoG food tracking data ✓ predictor 4th-order tensor sample time frequency channel ✓ response 3rd-order tensor sample time 3D positions marker
figure credit [Zhao et. al 2013]
× × × × × ×
26
large-scale machine learning applications
✓ i.e. computation vision, speech recognition and text processing etc
trained using millions of images on GPUs
the memory
even 100% [Xue et al, 2013] memory occupied by the weight matrices of the fully- connected layers
27
28
typical DNN like VGGNet [Simonyan and Zisserman, 2015]
represent the dense weight matrix of the fully-connected layers using fewer parameters while keeping enough flexibility to perform signal transformations
29
✓ compatible with the existing training algorithms for neural networks ✓ match the performance of the uncompressed counterparts with compression
factor of the weights of FC layer up to 200, 000 times leading to the compression factor of the whole network up to 7 times
✓ able to use more hidden units than was available before
30
represented by X(i1, i2, ..., id) ≈ P
α0,...,αd G1[i1](α0, α1)G2[i2](α1, α2) · · · Gd[id](αd−1, αd)
✓ i.e. an illustration of TTD of 5th-order tensor
G1
G2
G3 G4
G5
31
✓ vector where ✓ coordinate of vector ✓ d-dimensional vector-index of tensorized ,
where
✓ holds ✓ TT-format of is called TT-vector
b ∈ RN N = Qd
k=1 nk
` ∈ {1, ..., N} b ∈ RN µ(`) = (µ1(`), µ2(`), ..., µd(`))
32
✓ matrix where and ✓ row coordinate and column coordinate of ✓ d-dimensional vector-indices of
tensorized , where and
✓ holds ✓ TT-format of is called TT-matrix
vector
33
the output
where the weight matrix and bias vector
complexity of forward pass
compute gradients w.r.t the tensor cores
✓ FC stands for a fully-connected layer ✓ TT‘$’ stands for a TT-layer with all the TT-ranks equal ‘$’ ✓ MR‘$’ stands for a fully-connected layer with the matrix ranks restricted to ‘$’ ✓ The experiments report the compression factor of TT-layers; the resulting
compression factor of the whole network; the top1 and top5 classification errors 34
Architecture TT-layers compr. vgg-16 compr. vgg-19 compr. vgg-16 top 1 vgg-16 top 5 vgg-19 top 1 vgg-19 top 5 FC FC FC 1 1 1 30.9 11.2 29.0 10.1 TT4 FC FC 50 972 3.9 3.5 31.2 11.2 29.8 10.4 TT2 FC FC 194 622 3.9 3.5 31.5 11.5 30.4 10.9 TT1 FC FC 713 614 3.9 3.5 33.3 12.8 31.9 11.8 TT4 TT4 FC 37 732 7.4 6 32.2 12.3 31.6 11.7 MR1 FC FC 3 521 3.9 3.5 99.5 97.6 99.8 99 MR5 FC FC 704 3.9 3.5 81.7 53.9 79.1 52.4 MR50 FC FC 70 3.7 3.4 36.7 14.9 34.5 15.8
35
? ? ? ? ? ?
Missing entry Observed entry Incomplete tensor Completed tensor
Tensor completion problem:
Tensor completion is to apply tensor method to infer a tensor with missing entries from partial observations.
Recommender system Collaborative filtering
Movie ratings (Netflix)
Social network analysis
37
Y = UV Singular Value Decomposition (SVD) Non-negative Matrix Factorization (NMF) Probabilistic Matrix Factorization (PMF) Gaussian Process Latent Variable Models (GPLVM)
38
Challenges:
Regularizations:
Solving scheme 1: low-rank assumption on tensor
Low-rank assumption ? ? ? ? ? ?
Missing entry Observed entry Incomplete tensor Completed tensor
Example: High accuracy LRTC (HaLRTC)
min
X
: ⌦X⌦∗ s.t. : XΩ = TΩ
min
X
:
n
X
i=1
αi⌦X(i)⌦∗ s.t. : XΩ = TΩ.
Assume the tensor matricization
HaLRTC
[Liu, et al., 2013]
[Kolda, et al., 2009]
Mode-n matricization of a three-order tensor:
3-order tensor
mode-2 slices mode-3 slices mode-1 slices mode-1 mode-2 mode-3
… … …
mode-2 matricization mode-3 matricization mode-1 matricization
X(1) X(2) X(3) low-rank
low-rank low-rank
but slow convergence; no analytic solution)
is, Y = X + ε, be an i.i.d. Gaussian
Ω indicates observed indices
O is a indicator tensor
Q X =
R
X
r=1
a(1)
r
· · · a(N)
r
= [ [A(1), . . . , A(N)] ],
is assumed to be an i.i.d. ε ⇠ Q
i1,...,iN N(0, τ −1),
be exactly represented by
T (x|0, λ, ν) = Z N(x|0, τ)Ga(τ|a, b)dτ
Y
A(1) A(n) A(N)
τ · · · · · · λ c d a b
p ⇣ YΩ
n=1, τ
⌘ =
I1
Y
i1=1
· · ·
IN
Y
iN=1
N ⇣ Yi1i2...iN
a(1)
i1 , a(2) i2 , · · · , a(N) iN
E , τ −1⌘Oi1···in , p
λ
In
Y
in=1
N
in
, ∀n ∈ [1, N],
p(λ) =
R
Y
r=1
Ga(r|cr
0, dr 0),
p(⌧) = Ga(⌧|a0, b0).
| Ga(x|a, b) = baxa−1e−bx Γ(a)
e Λ = diag(λ) denotes matrix, also known as
p(Θ|YΩ) = p(Θ, YΩ) R p(Θ, YΩ) dΘ.
Y\ p(Y\Ω|YΩ) = Z p(Y\Ω|Θ)p(Θ|YΩ) dΘ,
Θ = {A(1), . . . , A(N), λ, τ}
distributions
KL (q(Θ)||p(Θ|Y)) =
q(Θ) p(Θ|Y) = ln p(Y) −
p(Y, Θ) q(Θ) dΘ
45
p(Y )
KL(q||p)
L(q, θ)
8 2 qn(A(n)) =
In
Y
in=1
N ⇣ a(n)
in
a(n)
in , V(n) in
⌘ ,
qτ(τ) = Ga(τ|aM, bM),
qλ(λ) =
R
Y
r=1
Ga(λr|cr
M, dr M),
q(Θ) = qλ(λ)qτ(τ)
N
Y
n=1
qn ⇣ A(n)⌘ .
8 2 qn(A(n)) =
In
Y
in=1
N ⇣ a(n)
in
a(n)
in , V(n) in
⌘ ,
˜ a(n)
in = Eq[τ]V(n) in Eq
⇥ A(\n)T
in
⇤ vec
in =
⇣ Eq[τ]Eq ⇥ A(\n)T
in
A(\n)
in
⇤ + Eq[Λ] ⌘1 ,
A(\n)T
in
= ⇣ K
k6=n
A(k)⌘T
I(Oin=1),
J Q
Y
A(1) A(n) A(N)
τ · · · · · · λ c d a b
Variational Message Passing
46
factor
qλ(λ) =
R
Y
r=1
Ga(λr|cr
M, dr M),
cr
M = cr 0 + 1
2
N
X
n=1
In dr
M = dr 0 + 1
2
N
X
n=1
Eq h a(n)T
·r
a(n)
·r
i .
qτ(τ) = Ga(τ|aM, bM), posterior parameters can be updated
aM = a0 + 1 2 X
i1,...,iN
Oi1,...,iN bM = b0 + 1 2Eq
⇣ Y − [ [A(1), . . . , A(N)] ] ⌘
F
(29)
Y
A(1) A(n) A(N)
τ · · · · · · λ c d a b
Variational Message Passing
47
i.e., ∀n, ∀in, a(n)
in ∼ N(0, IR),
tensor is constructed by
where ε ∼ Q
i1,...,iN N(0, σ2) denotes
whose parameter contr
σ−2 = 1000
48
Observation FBCP FBCP−MP CPWOPT STDC HaLRTC FaLRTC FCSA HardC. KTD
Missing rate 70% 80% 90% 95%
49
illuminations
from surveillance video
Method 36/270 49/270 64/270 81/270 T M T M T M T M FBCP 0.06 0.10 0.06 0.10 0.09 0.15 0.12 0.20 CPWOPT 0.53 0.65 0.56 0.61 0.58 0.59 0.65 0.73 FaLRTC 0.11 0.28 0.13 0.30 0.15 0.31 0.19 0.34 HardC. 0.37 0.37 0.37 0.40 0.37 0.40 0.37 0.40
Matrix factorization does not work when one entire row or column is missing
50
Ground truth FBCP FALRTC CPWOPT
51
52
X = G ×1 U(1) ×2 U(2) × · · · ×N U(N). the latent tensor i.e., Y = X + ε, ucker representation
τ ∼ Ga
0, bτ
vec(G)
λ(n)o , β ∼ N @0, β O
n
Λ(n) !−11 A , β ∼ Ga
0, bβ
u(n)
in
⇣ 0, Λ(n)−1⌘ , ∀n, ∀in, Student-t: λ(n)
rn
∼ Ga
0, bλ
∀n, ∀rn, Laplace: λ(n)
rn
∼ IG
2
∀n, ∀rn, γ ∼ Ga(aγ
0, bγ 0).
p(Y, Θ) = p
Y
n
p
λ(n) × p
Y
n
p
Bayesian Sparse Tucker Decomposition
first consider the Bayesian tensor Y 2 RI1⇥···⇥IN that a measurement of
vec(Y)
U(n)o , G, τ ⇠ N ✓O
n
U(n)
◆ vec(G), τ 1I
! . (7)
❖ Group Sparsity priors over factors ❖ Slice sparsity priors over cores ❖ Shared sparsity patterns between cores and factors
Joint distribution of the model
I1 I2 I3 I3 I1 I2 R2 R3 R1 R2 R3 R1 G U (1) U (3) U (2)
Model Inference
=
In
Y
in=1
N ⇣ u(n)
in
u(n)
in , Ψ(n)⌘
, n = 1, . . . , N, (16)
q(Θ) = q(G)q(β) Y
n
q
n
q(λ(n))q(γ)q(τ).
q(G) = N ⇣ vec(G)
G), ΣG ⌘ , vec(e G) = E[τ] ΣG O
n
E h U(n)T i! vec (Y) , ΣG = ( E[β] O
n
E h Λ(n)i + E[τ] O
n
E h U(n)T U(n)i)−1 .
e U(n) = E[τ] Y(n) @O
k6=n
E h U(k)i 1 A E h GT
(n)
i Ψ(n), (17) Ψ(n) = 8 < :E ⇥ Λ(n)⇤ + E[τ]E 2 4G(n) @O
k6=n
U(k)T U(k) 1 A GT
(n)
3 5 9 = ;
1
. (18)
λ(n), n = 1, . . . , N
q
=
Rn
Y
rn=1
Ga
rn
a(n)
rn ,˜
b(n)
rn
˜ a(n)
rn = aλ 0 + 1
2 @In + Y
k6=n
Rk 1 A , ˜ b(n)
rn = bλ 0 + 1
2E h u(n)T
·rn u(n) ·rn
i + 1 2E[β]E ⇥ vec(G2
···rn···)T ⇤ O k6=n
E h λ(k)i .
aτ
M = aτ 0 + 1
2 Y
n
In, bτ
M = bτ 0 + 1
2E "
O
n
U(n) ! vec(G)
F
#
ariational posterior distrib is q(τ) = Ga(aτ
M, bτ M)
τ
53
U, and noise precision
Bayesian Sparse Tucker Completion
first consider the Bayesian tensor Y 2 RI1⇥···⇥IN that a measurement of
model YΩ = X Ω + ε ted exactly by a
X = G ×1 U(1) ×2 U(2) × · · · ×N U(N).
O
···
∈
denotes only observed entries. Similar
denotes a binary tensor indicating i.e., Oi1···iN = 1 if (i1, . . . , iN) ∈ Ω denotes a set of
8
Yi1···iN
u(n)
in
O
n
u(n)T
in
! vec(G), τ 1 !Oi1···iN (31)
τ
p(Yi1···iN |YΩ) = Z p(Yi1···iN |Θ)p(Θ|YΩ) dΘ ⇣
54
Demonstration of Learning Procedure
55
TABLE III
THE PERFORMANCE OF MRI COMPLETION EVALUATED BY PSNR AND RRSE. FOR NOISY MRI, THE STANDARD DERIVATION OF GAUSSIAN NOISE IS 3% OF BRIGHTEST TISSUE. MRI TENSOR IS OF SIZE 181 × 217 × 165 AND EACH BLOCK TENSOR IS OF SIZE 50 × 50 × 10.
50% 60% 70% 80% Original Noisy Original Noisy Original Noisy Original Noisy BSTC-T 27.32 0.11 26.18 0.12 25.30 0.14 24.60 0.15 22.81 0.18 22.35 0.19 20.14 0.25 20.00 0.25 BSTC-L 26.91 0.11 25.57 0.13 24.84 0.15 23.95 0.16 22.76 0.19 22.09 0.20 20.12 0.25 19.80 0.26 iHOOI 22.69 0.19 21.45 0.22 22.47 0.19 21.16 0.22 21.63 0.21 20.11 0.25 18.65 0.30 17.89 0.32 HaLRTC 24.84 0.15 23.60 0.17 22.35 0.19 21.65 0.21 19.93 0.26 19.55 0.27 17.37 0.34 17.15 0.35
Noisy SNR=20dB, MR = 50%, PSNR= 26dB Missing Estimation
(a) 50% missing
Noisy SNR=20dB, MR = 80%, PSNR= 22dB Missing Estimation
(b) 80% missing
56
YΩ
A(1) A(n) A(N)
SΩ τ · · · · · · λ γ c0 d0 a0γ b0
γ
aτ bτ
p ⇣ YΩ
n=1, SΩ, τ
⌘ =
I1
Y
i1=1
· · ·
IN
Y
iN =1
N ⇣ Yi1...iN
a(1)
i1 , · · · , a(N) iN
E + Si1...iN , τ −1⌘Oi1···iN , (6)
p
λ
In
Y
in=1
N
in
, ∀n ∈ [1, N] p(λ) =
R
Y
r=1
Ga(λr|c0, d0), (7)
p(SΩ|γ) = Y
i1,...,iN
N(Si1...iN |0, γ−1
i1...iN )Oi1...iN ,
p(γ) = Y
i1,...,iN
Ga(γi1...iN |aγ
0, bγ 0).
(8)
p(τ) = Ga(τ|aτ
0, bτ 0).
(8) Y
p ⇣ YΩ
n=1, SΩ, τ
⌘
N
Y
n=1
p
λ
57
20dB
10*std(X);
Demo of the model learning procedure
58
59
60
Low-rank approximation
G(1)
G(N)
<latexit sha1_base64="xpd2F7ipH9vN3KLb/kN1jCr8g9k=">AB7XicbZDLSsNAFIZP6q3GW9Wlm8Ei1E1JuqkuxILXUkFYwtLJPpB06mYSZiVBCwVdw40LFrY/ge7jzbZxeFtr6w8DH/5/DnHOChDOlHefbyi0tr6yu5dftjc2t7Z3C7t6dilNJqEdiHstmgBXlTFBPM81pM5EURwGnjWBwMc4bD1QqFotbPUyoH+GeYCEjWBurcXmfla6PR51C0Sk7E6FcGdQP+0zx4BoN4pfLW7MUkjKjThWKmW6yTaz7DUjHA6stupogkmA9yjLYMCR1T52WTcEToyTheFsTRPaDRxf3dkOFJqGAWmMsK6r+azsflf1kp1eOJnTCSpoJMPwpTjnSMxrujLpOUaD40gIlkZlZE+lhios2FbHMEd37lRfAq5dOyc+MUaxWYKg8HcAglcKEKNbiCOnhAYABP8AKvVmI9W2/W+7Q0Z8169uGPrI8fNamQeA=</latexit><latexit sha1_base64="WlBTyA5NXo6JEQ5CV+SOUOhMxP8=">AB7XicbZC7SgNBFIbPxltcb1FLm8EgxCbsplELMWChlURwTSBZw+xkNhkyO7vMzAphyUPYWKjYWPgIvoeN+DZOLoUm/jDw8f/nMOecIOFMacf5tnILi0vLK/lVe219Y3OrsL1zq+JUEuqRmMeyEWBFORPU0xz2kgkxVHAaT3on4/y+j2VisXiRg8S6ke4K1jICNbGql/cZaWrw2G7UHTKzlhoHtwpFM8+7NPk7cutQufrU5M0ogKThWquk6ifYzLDUjnA7tVqpogkfd2nToMARVX42HneIDozTQWEszRMajd3fHRmOlBpEgamMsO6p2Wxk/pc1Ux0e+xkTSaqpIJOPwpQjHaPR7qjDJCWaDwxgIpmZFZEelphocyHbHMGdXkevEr5pOxcO8VqBSbKwx7sQwlcOIqXEINPCDQhwd4gmcrsR6tF+t1Upqzpj278EfW+w8lxJHs</latexit><latexit sha1_base64="WlBTyA5NXo6JEQ5CV+SOUOhMxP8=">AB7XicbZC7SgNBFIbPxltcb1FLm8EgxCbsplELMWChlURwTSBZw+xkNhkyO7vMzAphyUPYWKjYWPgIvoeN+DZOLoUm/jDw8f/nMOecIOFMacf5tnILi0vLK/lVe219Y3OrsL1zq+JUEuqRmMeyEWBFORPU0xz2kgkxVHAaT3on4/y+j2VisXiRg8S6ke4K1jICNbGql/cZaWrw2G7UHTKzlhoHtwpFM8+7NPk7cutQufrU5M0ogKThWquk6ifYzLDUjnA7tVqpogkfd2nToMARVX42HneIDozTQWEszRMajd3fHRmOlBpEgamMsO6p2Wxk/pc1Ux0e+xkTSaqpIJOPwpQjHaPR7qjDJCWaDwxgIpmZFZEelphocyHbHMGdXkevEr5pOxcO8VqBSbKwx7sQwlcOIqXEINPCDQhwd4gmcrsR6tF+t1Upqzpj278EfW+w8lxJHs</latexit>G(2) G(N−1)
<latexit sha1_base64="VJvd/W3SJc9C4OUTHiWmCD58h7c=">AB+3icbVDLSsNAFL2pr1pf0S7dBItQF5akG3VlwYWupIKxhTaWyXTSDp08mJkIcRfcaELFbf+iDt3/ojgpO1CWw8MHM65l3vmuBGjQprmp1ZYWFxaXimultbWNza39O2dGxHGHBMbhyzkbRcJwmhAbEklI+2IE+S7jLTc0Vnut+4IFzQMrmUSEcdHg4B6FCOpJ5e7vpIDjFi6Xl2m1YvD62DrKdXzJo5hjFPrCmpnH5Hj18A0OzpH91+iGOfBIzJETHMiPpIhLihnJSt1YkAjhERqQjqIB8olw0nH4zNhXSt/wQq5eI2x+nsjRb4Qie+qyTyqmPVy8T+vE0v2ElpEMWSBHhyIuZIUMjb8LoU06wZIkiCHOqshp4iDjCUvVUiVYs1+eJ3a9dlIzr8xKow4TFGEX9qAKFhxBAy6gCTZgSOABnuFu9etFftbTJa0KY7ZfgD7f0HdwiXHw=</latexit><latexit sha1_base64="kGSl9nhb6jTYarvW0Cw62jBds0=">AB+3icbVDNSsNAGNzUvxr/oj16CRahHixJL+pBLHjQk1QwtDGstlu2qWbTdzdCHEV/GgBxWvogX8WEN20P2jqwMx8H9/seBElQlrWl1aYm19YXCou6yura+sbxubWtQhjrCDQhrylgcFpoRhRxJcSviGAYexU1veJr7zTvMBQnZlUwi7Aawz4hPEJRK6hqlTgDlAEGanmU3aeVi397LukbZqlojmLPEnpDyXf0+Kkf3za6xkenF6I4wEwiCoVo21Yk3RySRDFmd6JBY4gGsI+bivKYICFm47CZ+auUnqmH3L1mDRH6u+NFAZCJIGnJvOoYtrLxf+8diz9QzclLIolZmh8yI+pKUMzb8LsEY6RpIkiEHGispoADlEUvWlqxLs6S/PEqdWPapal1a5XgNjFME2AEVYIMDUAfnoAEcgEACHsAzeNHutSftVXsbjxa0yU4J/IH2/gNfuZfO</latexit><latexit sha1_base64="kGSl9nhb6jTYarvW0Cw62jBds0=">AB+3icbVDNSsNAGNzUvxr/oj16CRahHixJL+pBLHjQk1QwtDGstlu2qWbTdzdCHEV/GgBxWvogX8WEN20P2jqwMx8H9/seBElQlrWl1aYm19YXCou6yura+sbxubWtQhjrCDQhrylgcFpoRhRxJcSviGAYexU1veJr7zTvMBQnZlUwi7Aawz4hPEJRK6hqlTgDlAEGanmU3aeVi397LukbZqlojmLPEnpDyXf0+Kkf3za6xkenF6I4wEwiCoVo21Yk3RySRDFmd6JBY4gGsI+bivKYICFm47CZ+auUnqmH3L1mDRH6u+NFAZCJIGnJvOoYtrLxf+8diz9QzclLIolZmh8yI+pKUMzb8LsEY6RpIkiEHGispoADlEUvWlqxLs6S/PEqdWPapal1a5XgNjFME2AEVYIMDUAfnoAEcgEACHsAzeNHutSftVXsbjxa0yU4J/IH2/gNfuZfO</latexit>G(2)
i2
<latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit><latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit><latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit>G(N−1)
iN−1
f(g(1)
i1· , g(N)
iN= 1 , · · · , N ? ? ? ? ? ?
Tensor completion Missing entry Observed entry
Incomplete tensor Tensor decomposition Completed tensor
Find the low-rank tensor decomposition by observed entries.
[Yuan, et al., 2017]
Solving scheme 3: tensor decomposition by gradient-based optimization
Decompose a tensor to TT format:
For each element:
Q xi1···iN =
N
Y
n=1
G(n)
in .
X : G(n) 2 Rrn−1×In×rn, G
, G(n) 2 Rrn−1×rn, n
, n = 1, 2, · · · , N.
TT-rank: Silce: Core tensor:
R
× ×
2 R
×
: {r0, r1, · · · , rN}, r0 = rN = 1,
X 2 RI1×I2×···×IN
f(G(1),
, G(2)
· , G(N−1), G(N)
, G(N)
iN
, G(N−1)
iN−1
, G(2)
i2 ,
G(1)
i1
X :
[Oseledets, et al., 2011]
Tensor train decomposition (TTD)
tensor train:
Tensor train stochastic gradient descent (TT-SGD)
[Yuan, et al., 2018]
For one observed entry:
f(G(1)
im
1 , G(2)
im
2 , · · · , G(N)
im
N ) = 1
2
N
Y
n=1
G(n)
im
n
F
∂f ∂G(n)
im
n
= (xm ym)(G>n
im
n G<n
im
n )T ,
where G>n
im
n
=
N
Q
n=n+1
G(n)
im
n , G<n
im
n
=
n−1
Q
n=1
G(n)
im
n .
the incomplete tensor as a sparse tensor, only
The gradient for according slice of core tensor: Loss function: Where The approximation of TTD:
Y xm =
N
Y
n=1
G(n)
im
n .
{ 1
2
· · e ym = Y(im
1 , im 2 , · · · , im N),
According to equation (2), x
? ? ? ? ? ?
Recovered missing data
Approximated
xi1···iN =
N
Y
n=1
G(n)
in .
G(1) G(N) G(2) G(N−1) G(2)
i2
G(N−1)
iN−1
f(g(1)
i1· , g(N)
iN= 1, · · · , N
P r e d i c t i
yi1···iN
Observed data High-order tensorization
i2 = 1 = 1 i2 = 2 = 2 i2 = 3 = 3 i2 = 4Higher-order tensor TT-SGD algorithm
f(G(1), G(2), · · · , G(N)) = 1 2
M
X
m=1
kym xmk2
F .
[Yuan, et al., 2018]
Algorithm 2 Tensor-train Stochastic Gradient Descent (TT-SGD) 1: Input: incomplete tensor Y and TT − rank r. 2: Initialization: core tensors G(1), G(2), · · · , G(N)of approximated tensor X. 3: While the optimization stopping condition is not satisfied 4: Randomly sample one observed entry from Y w.r.t. index {i1, i2, · · · iN}. 5: For n=1:N 6: Compute the gradients of the according tensor slices by equation (11). 7: End 8: Update G(1)
i1 , G(2) i2 , · · · , G(N) iN
by gradient descent method. 9: End while 10: Output: G(1), G(2), · · · , G(N).
TT-SGD overview
[Yuan, et al., 2018]
Experiment results
99% random block line scratch
High-order tensorization
i2 = 1 = 1 i2 = 2 = 2 i2 = 3 = 3 i2 = 4
…
256×256
128×128
64×64
Tensorization for a 256×256×3 image From 3-way to 9-way 1.Reshape 256×256×3 to 2×2×…×2×3 (17-way tensor). 2.Permute by {1 9 2 10 3 11 4 12 5 13 6 14 7 15 8 16 17}. 3.Reshape to 4×4×4×4×4×4×4×4×3 (9-way tensor).
Better data structure The first order represent a 2×2 pixel block. The second order represent four 2×2 pixel block. … This can catch more structure relation of data.
High-order tensorization
[Yuan, et al., 2017]
Comparison of applying tensorization
3D tensor 9D tensor 9D tensor with proposed tensorization TT-WOPT TT-SGD CP-WOPT FBCP TLnR HaLRTC
90% random missing
[Yuan, et al., 2017]
68
69
R
Grouping by cube-matching In
R
R
Grouping by cube-matching I
Tensor Factorization (HOSVD, BCPF)
Noisy Denoised
IEEE TIP 2013; IEEE TPAMI 2013
Noise variance is required!!! Automatic noise estimation
69
Noisy MRI (T1)
max value
Denoised MRI
70
Learning efficient tensor representations with ring structure networks (ICLR Workshop 2018)
Motivation: Tensor train is too strict due to TT-ranks are bounded by the rank of k-unfolding matricization Inconsistent solution from permutation of data Proposed model: More generalized model without constraint Sum of TT with partially shared core tensors Tensor ring ranks:
71
r1 = rd+1 = 1
T(i1, i2, . . . , id) = Tr {Z1(i1)Z2(i2) · · · Zd(id)} = Tr ( d Y
k=1
Zk(ik) ) .
r1 = rd+1 = 1
ki has Rank(Thki) =
that 9k, r1rk+1 Rk+1.
em 2. Let us assume T can has Rank(Thki) = Rk+1, k, r r R .
Algorithms:
Tensor Ring Decomposition
T n1 nd · · · nk · · · n2
=
Z1 Zd · · · Zk · · · Z2 n1 nd · · · nk · · · n2 r1 r2 rd rk+1 rk r3
T(i1, i2, . . . , id) =
r1,...,rd
X
α1,...,αd=1 d
Y
k=1
Zk(αk, ik, αk+1).
Note that due to the trace operation.
T(i1, i2, . . . , id) =Tr {Z1(i1)Z2(i2) · · · Zd(id)} ,
( )
T(i1, i2, . . . , id) = Tr(Z2(i2), Z3(i3), . . . , Zd(id), Z1(i1)) = · · · = Tr(Zd(id), Z1(i1), . . . , Zd1(id1)).
(4)
permutation invariance
T[k] = Zk(2)
⇣
Z6=k
[2]
⌘T
,
Z
=
Z1 Zd · · · Zk · · · Z2 n1 nd · · · nk · · · n2 r1 r2 rd rk+1 rk r3
2 R
Q
Z>1(↵2i2, i3 · · · id↵1) =
X
↵3
Z2(↵2i2, ↵3)Z>2(↵3, i3 · · · id↵1).
(17)
Th1i(i1, i2 · · · id) =
X
↵1,↵2
Z1(i1, ↵1↵2)Z>1(↵1↵2, i2 · · · id).
(15)
72
⇥ · · · ⇥ nd. If the TR decompositions
e T 1 = <(Z1, . . . , Zd)
Y Y
, wher are T 1 = <(Z1, . . . , Z
T 2 = <(Y1, . . . , Yd),
addition of these two tensors,
Yk 2 R
tensors, T 3 = T 1 + T 2, given by
= T 1 + T 2, can be also
by T 3 = <(X 1, . . . , X d), . Each core X can
Xk(ik) =
✓ Zk(ik)
Yk(ik)
◆
, ik = 1, . . . , nk, k = 1, . . . , d.
T 2 R
×···×
be a th-order is T = <(Z1, . . . , Zd) vectors, then the multilinear
1, . . . , d be a set of vectors, then the
by c = T ⇥1 uT
1 ⇥2 · · · ⇥d uT d ,
product on each cores, which is
c = <(X1, . . . , Xd) where Xk =
nk
X
ik=1
Zk(ik)uk(ik).
R
× ×
, then the tensors, T 3 = T 1 ~ T 2, given by T
= <(X
~ T
= <(X 1, . . . , X d),
. Each core X can
Xk(ik) = Zk(ik) ⌦ Yk(ik), k = 1, . . . , d.
73
slice diagonal
TR is a sum of TT representation
T =
r
X
α=1
u(1)
α · · · u(d) α ,
Hence, CPD can be viewed sition T = <(V1, . . . , Vd) wher
R defining Vk(ik) = diag(u(k)
ik )
each fixed i and k, where
T = G ⇥1 U(1) ⇥2 · · · ⇥d U(d)
Hence, Tucker model can be sition T = <(Z1, . . . , Zd) the multilinear products
Zk = Vk ⇥2 U(k), k
the core tensor G can be decomposition G = <(V1, . . . , Vd), the element-wise form can
T(i1, . . . , id) = Tr {Z1(i1)Z2(i2) · · · Zd(id)} =
r1
X
α1=1
z1(α1, i1, :)T Z2(i2) · · · Zd−1(id−1)zd(:, id, α1)
∃n, rn = 1
74
16×16 block image to 4×4×4×4 block format:
2 4 8 x 8 16×
16×16 block The first order represent a 2×2 pixel block. The second order represent four above block. …
Learning efficient tensor representations with ring structure networks (ICLR Workshop 2018)
Representation
parameters Tensorization is important and unexplored
76 Table 4: Image representation by using tensorization and TR decomposition. The number of parameters is compared for SVD, TT and TR given the same approximation errors. Data ✏ = 0.1 ✏ = 0.01 ✏ = 9e − 4 ✏ = 2e − 15 n = 256, d = 2 SVD TT/TR SVD TT/TR SVD TT/TR SVD TT/TR 9.7e3 9.7e3 7.2e4 7.2e4 1.2e5 1.2e5 1.3e5 1.3e5 Tensorization ✏ = 0.1 ✏ = 0.01 ✏ = 2e − 3 ✏ = 1e − 14 TT TR TT TR TT TR TT TR n = 16, d = 4 5.1e3 3.8e3 6.8e4 6.4e4 1.0e5 7.3e4 1.3e5 7.4e4 n = 4, d = 8 4.8e3 4.3e3 7.8e4 7.8e4 1.1e5 9.8e4 1.3e5 1.0e5 n = 2, d = 16 7.4e3 7.4e3 1.0e5 1.0e5 1.5e5 1.5e5 1.7e5 1.7e5 block addressing procedure to cast an
2 3 4 5 6Ranks
0.2 0.4 0.6 0.8 1 1.2 1.4Training error (%) TT-layer TR-layer
2 3 4 5 6Ranks
1.9 2 2.1 2.2 2.3 2.4Testing error (%) TT-layer TR-layer Figure 7: The classification performances of tensorizing neural networks by using TR representation.
methods?