Tensor Methods for Signal Processing and Machine Learning Qibin - - PowerPoint PPT Presentation

tensor methods for signal processing and machine learning
SMART_READER_LITE
LIVE PREVIEW

Tensor Methods for Signal Processing and Machine Learning Qibin - - PowerPoint PPT Presentation

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN AIP 2018-6-9 @ Waseda University 1 Monographs Tensor networks for dimensionality reduction and large optimization Andrzej Cichocki, Namgil


slide-1
SLIDE 1

Tensor Methods for Signal Processing and Machine Learning

Qibin Zhao Tensor Learning Unit RIKEN AIP

1

2018-6-9 @ Waseda University

slide-2
SLIDE 2

Monographs

Tensor networks for dimensionality reduction and large optimization

Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao and Danilo P.Mandic

2

slide-3
SLIDE 3

P e

  • p

l e Expressions Views Illumination

(c)

Multidimensional structured data

  • Data ensemble affected by multiple factors
  • Facial images (expression x people x

illumination x views)

  • Collaborative filtering (user x item x

time)

  • Multidimensional structured data, e.g.,
  • EEG, ECoG (channel x time x

frequency)

  • fMRI (3D volume indexed by cartesian

coordinate)

  • Video sequences (width x height x

frame)

3

slide-4
SLIDE 4

Tensor Representation of EEG Signals

epoch time-frequency c h a n n e l

Matricization causes loss of useful multiway information. It is favorable to analyze multi-dimensional data in their own domain.

4

slide-5
SLIDE 5

Outline

  • Tensor Regression and Classification
  • TensorNets for Deep Neural Networks Compression
  • (Multi-)Tensor Completion
  • Tensor Denoising

5

slide-6
SLIDE 6

Machine Learning Tasks

  • Supervised (and semi-supervised) learning predict a target y from an input x

✓ classification target y represents a category or class ✓ regression target y is real-value number

  • Unsupervised learning no explicit prediction target y

✓ density estimation model the probability distribution of input x ✓ clustering, dimensionality reduction discover underlying structure in input x

6

p( ) X Unsupervised learning p(y ) X Supervised learning p(y ) X

Semi-supervised learning

Labeled data No data labels , Labeled data D Unlabeled D ~ and D data (Find hidden structure) D ~ D , D D ~ ,

slide-7
SLIDE 7

Classical Regression Models

  • Regression models

✓ predict one or more responses (dependent variables, outputs) from a set of

predictors (independent variables, inputs)

✓ identify the key predictors (independent variables, inputs)

7

  • Linear and nonlinear regression models

✓ linear model: simple regression, multiple regression, multivariate regression,

generalized linear model, partial least squares (PLS)

✓ nonlinear model: Gaussian process (GP), artificial neural networks (ANN),

support vector regression (SVR)

image credit Leard statistics

slide-8
SLIDE 8

Basic Linear Regression Model

  • A basic linear regression model in vector form is defined as

8

✓ is the input vector of independent variables ✓ is the vector of regression coefficients ✓ is the bias ✓ is the regression output or dependent/target variable

y f x; w, b x, w b wTx b,

ere x RI

s, w RI

ts, b t

d y the

slide-9
SLIDE 9

Tensor Data in Real-world Applications

  • Medical imaging data analysis

✓ MRI data x-coordinate y-coordinate z-coordinate ✓ fMRI data time x-coordinate y-coordinate z-coordinate

  • Neural signal processing

✓ EEG data time frequency channel

  • Computer vision

✓ video data frame x-coordinate y-coordinate ✓ face image data pixel illumination expression viewpoint identity

  • Climate data analysis

✓ climate forecast data month location variable

  • Chemistry

✓ fluorescence excitation-emission data sample excitation emission

9

× × × × × × × × × × × × × × × × ×

slide-10
SLIDE 10

Real-world Regression Tasks with Tensors

  • Goal is to find association between brain images and clinical outcomes

10

✓ predictor 3rd-order tensor MRI images ✓ response scaler clinical diagnosis indicating one has some disease or not

slide-11
SLIDE 11

Real-world Regression Tasks with Tensors Cont

  • Goal is to estimate 3D human pose positions from video sequences

11

✓ predictor 4th-order tensor RGB video (or depth video) ✓ response 3rd-order tensor human motion capture data

slide-12
SLIDE 12

Real-world Regression Tasks with Tensors Cont

  • Goal is to reconstruct motion trajectories from brain signals

12

✓ predictor 4th-order tensor ECoG signals of monkey ✓ response 3rd-order tensor limb movement trajectories

slide-13
SLIDE 13

Motivations from New Regression Challenges

  • Classical regression models transform tensors into vectors via vectorization
  • perations, then feed them to two-way data analysis techniques for solutions

✓ vectorizing operations destroy underlying multiway structures

i.e. spatial and temporal correlations are ignored among pixels in a fMRI

✓ ultrahigh tensor dimensionality produces huge parameters

i.e. a fMRI of size 100 256 256 256 yields 167 millions!

✓ difficulty of interpretation, sensitivity to noise, absence of uniqueness

13

  • Tensor-based regression models directly model tensors using multiway factor

models and multiway analysis techniques

✓ naturally preserve multiway structural knowledge which is useful in mitigating

small sample size problem

✓ compactly represent regression coefficients using only a few parameters ✓ ease of interpretation, robust to noise, uniqueness property

× × ×

slide-14
SLIDE 14

Basic Tensor Regression Model

  • A basic linear tensor regression model can be formulated as

14

✓ is the input tensor predictor or tensor regressor ✓ is the regression coefficients tensor ✓ is the bias ✓ is the regression output or dependent/target variable ✓ is the inner product of two tensors ✓ sparse regularization like lasso penalty on further improves the performance

ts, b t

d y the

y f X; W, b X, W b,

ere X RI1

IN

sor of weights (also

  • r, W

RI1

IN the

  • r model tensor),

the

t, X, W vec X T vec W

  • The learning of the tensor regression model is typically formulated as the

minimization of following squared cost function

J X, y W, b

M m 1

ym W, Xm b

2

✓ are the M pairs of training samples

les Xm, ym f , the TR mod

m 1, . . . , M. is used to make

W

slide-15
SLIDE 15

CP Regression Model

  • The linear CP tensor regression [Zhou et. al 2013] model given by

15

✓ substantial reduction in dimensionality

y f X; W, b X, W b,

W

R r 1

u 1

r

u 2

r

u N

r

I

1 U 1 2 U 2 N U N ,

where the coefficient tensor is assumed to follow a CP decomposition

W

  • The advantages of CP regression

× ×

✓ low rank CP model could provide a sound recovery of many low rank signals

i.e. a 128 128 128 MRI image, the parameters reduce from 2,097,157 to 1157 via rank-3 decomposition

slide-16
SLIDE 16

Tucker Regression Model

  • The linear Tucker tensor regression [Li et. al 2013] model given by

16

✓ substantially reduce the dimensionality ✓ provide a sound low rank approximation to potentially high rank signal

y f X; W, b X, W b,

where the coefficient tensor is assumed to follow a Tucker decomposition

W

  • The shared advantages of Tucker regression with CP regression

W G

1 U 1 2 U 2 N U N ,

  • The advantages of Tucker regression over CP regression

✓ offer freedom in choice of different ranks when tensor data is skewed in

dimensions

✓ explicitly model the interactions between factor matrices

slide-17
SLIDE 17

General Linear Tensor Regression Model

  • A general tensor regression model can be obtained when regression

coefficient tensor is high-order than the input tensors , leading to

17

✓ is the Nth-order predictor tensor ✓ is the Pth-order regression coefficient tensor with ✓ is the (P-N)th-order response tensor ✓ denotes a tensor contraction along the first N modes

  • This model allows response to be a high-order tensor
  • This model includes many linear tensor regression models as special cases

i.e., CP regression, Tucker regression, etc

  • r, W,

general

  • rs, Xm,

Ym Xm W Em, m 1, . . . , M,

  • r, Xm

RI1

IN ,

, with , whi

  • r, W

RI1

IP , w

dual tensor and Y

d Y RIP

1

IP

with P N, with entries

ere Xm W d an th-order

slide-18
SLIDE 18

PLS for Matrix Regression

  • Goal of partial least squares (PLS) regression is to predict the response

matrix Y from the predictor matrix X, and describe their common latent structure

18

  • The PLS regression consists of two steps

i) extract a set of latent variables of X and Y by performing a simultaneous decomposition of X and Y, such that maximum pairwise covariance is between the latent variables of X and the latent variables of Y ii) use the extracted latent variables to predict Y

slide-19
SLIDE 19

PLS for Matrix Regression Cont

  • The standard PLS regression takes the form of

19

X TPT E

R r 1

trpT

r

E, Y TDCT F

R r 1

drrtrcT

r

F,

✓ is the matrix predictor and is the matrix response ✓ contains R latent variables from ✓ represents R latent variables from ✓ and represent loadings or PLS regression coefficients

le X RI

J an

ir simultaneou le Y RI

M

ere T t1, t2, . . . , tR RI

R

ables from X, and a matrix U

t r cr t r pr

T

T

R ix U TD u1, u2, . . . , uR RI

R

  • m Y which have maximum covariance
  • m Y
  • m X,

es P an d C re

slide-20
SLIDE 20

PLS for Matrix Regression Cont

  • The PLS typically applies a deflation strategy to extract the latent

variables and as well as all the loadings

20

ere T t1, t2, . . . , tR RI

R

ables from X, and a matrix U R ix U TD u1, u2, . . . , uR RI

R

  • m Y which have maximum covariance
  • A classical algorithm for the extraction process is called nonlinear iterative

partial least squares PLS regression algorithm (NIPALS-PLS) [Wold, 1984]

  • Having extracted all the factors, the prediction for the new test point

can be performed by

here is some weight matrix obtained from NIPALS-PLS algorithm

by Y X WDCT.

et X c

X WDC

slide-21
SLIDE 21

High-order PLS for Tensor Regression

  • Goal of high-order partial least squares (HOPLS) [Zhao et. al 2011] regression

allows to predict the response tensor Y from the predictor tensor X and describe their common latent structure

  • HOPLS extends PLS by projecting tensorial data onto a common latent

subspace but using block Tucker decomposition [De Lathauwer, 2008]

21

  • Similarly, HOPLS regression consists of two steps

i) extract a set of latent variables of tensor X and tensor Y by performing a simultaneous block Tucker decomposition of both tensor X and tensor Y, such that maximum pairwise covariance is between the latent variables of X and the latent variables of Y ii) use the extracted latent variables to predict tensor Y

slide-22
SLIDE 22

HOPLS Framework

  • The standard HOPLS performs joint block Tucker decomposition of both

predictor tensor and response tensor by

22

✓ is the (N+1)th-order predictor tensor by concatenating M samples ✓ is the (N+1)th-order response tensor having the same size M ✓ is the latent variable for the r-th component ✓ and are the loadings for r-th component ✓ and are the core tensors for r-th component

X

R r 1

Gxr

1 tr 2 P 1 r N 1 P N r

ER Y

R r 1

Gyr

1 tr 2 Q 1 r N 1Q N r

FR,

+ +

  • r, X

RM

I1 IN ,

, which

+ +

  • r, Y

RM

J1 JN , w

  • rs, tr

RM is

  • r,

P n

r N n 1

RIn

Ln an

ices in mode- , and G

nt vectors, tr R is d Q n

r N n 1

RJn

Kn ar 1

and G

n 1

and Gxr R1

L1 LN an

fining a latent matrix T and Gyr R1

K1 KN

t t , mode-

slide-23
SLIDE 23

HOPLS Framework Cont

23

X

R r 1

Gxr

1 tr 2 P 1 r N 1 P N r

ER Y

R r 1

Gyr

1 tr 2 Q 1 r N 1Q N r

FR,

+ + . . . X t1

I × L 2

2

M ×1 1× L × L 2

1

L × I

1 1

P(2)

1

P(2)

R

tR

1× L × L2

1

L

× I 1 1

I L2

2 ( ) ( ) ( ) ( ) ( ) ( ) ( )

M ×1

( )

P(1)

1

P(1)

R T T

( )

M ×I × I

1 2 ×

= + E

( )

M ×I × I

1 2 × ×

+ + . . . = Y

M×1 1× K × K2

1

K × J

1 1

Q

(2) 1 (2)

M ×1 J × K2

2

1× K × K2

1

K × J

1 1

J × K2

2

Q

(2) R

Q

(1)T (1)T

( ) ( ) ( ) ( ) ( ) ( ) ( )

1 R

( )

M × J × J2

1 ( )

t1 +

( )

M × J × J2

1

tR F Q

× ×

slide-24
SLIDE 24

HOPLS Framework A Compact Formulation

  • The standard HOPLS can be rewritten in a more compact form

24

✓ is the latent matrix ✓ and are the loading matrix ✓ is the core tensor for input ✓ is the core tensor for output

X Gx

1 T 2 P 1 N 1 P N

ER, Y Gy

1 T 2 Q 1 N 1 Q N

FR,

ix T t1, . . . , tR , m de- loading matrix

ix P n P n

1 , . . . , P n R

, m and core tensors

rix Q

n

Q n

1 , . . . , Q n R

an Gx blockdiag Gx1, . . . , GxR RR

RL1 RLN

Gy blockdiag Gy1, . . . , GyR RR

RK1 RKN

=

M×R

T

I ×RL

2 2

P(2) P(1)T

RL ×I

1 1

Gx . . .

( ) ) ( ) ( ( )

+ E

( )

M ×I × I

1 2

R×R L1×RL 2

X

( )

M ×I × I

1 2 × ×

. . . T Q

(2)

Gy =

R ×RK × RK

1 2

RK × J

1 1

Q

(1)T

J ×RK

2 2 ( ) ( )

+

( ) ( )

M × R

( )

M × J × J2

1

F

× ×

Y

( )

M × J × J2

1 × ×

slide-25
SLIDE 25

HOPLS Experimental Results

25

✓ dataset ECoG food tracking data ✓ predictor 4th-order tensor sample time frequency channel ✓ response 3rd-order tensor sample time 3D positions marker

  • Goal to decode limb movement trajectories based on ECoG signals of monkey

figure credit [Zhao et. al 2013]

× × × × × ×

slide-26
SLIDE 26

Outline

  • Tensor Regression
  • TensorNets for Deep Neural Networks Compression
  • (Multi-)Tensor Completion
  • Tensor Denoising

26

slide-27
SLIDE 27

Background

  • Deep Neural Networks (DNNs) archives the state-of-art performance in many

large-scale machine learning applications

✓ i.e. computation vision, speech recognition and text processing etc

  • DNNs have thousands of nodes and millions of learnable parameters and are

trained using millions of images on GPUs

  • DNNs reaches the hardware limits both in terms the computational power and

the memory

  • DNNs reaches the memory limit with 89% [Simonyan and Zisserman, 2015] or

even 100% [Xue et al, 2013] memory occupied by the weight matrices of the fully- connected layers 


27

slide-28
SLIDE 28

VGGNet Example

28

  • The huge number of parameters of FC layers is the bottleneck in a

typical DNN like VGGNet [Simonyan and Zisserman, 2015]

slide-29
SLIDE 29

TensorNet for DNN Compression

  • TensorNet [Novikov et. al, 2015] applies tensor train (TT) [Oseledet, 2011] format to

represent the dense weight matrix of the fully-connected layers using fewer parameters while keeping enough flexibility to perform signal transformations

29

  • The advantages of TensorNet

✓ compatible with the existing training algorithms for neural networks ✓ match the performance of the uncompressed counterparts with compression

factor of the weights of FC layer up to 200, 000 times leading to the compression factor of the whole network up to 7 times

✓ able to use more hidden units than was available before

slide-30
SLIDE 30

Tensor Train Decomposition

30

  • Recall that in index form, tensor train decomposition (TTD) can be

represented by X(i1, i2, ..., id) ≈ P

α0,...,αd G1[i1](α0, α1)G2[i2](α1, α2) · · · Gd[id](αd−1, αd)

✓ i.e. an illustration of TTD of 5th-order tensor

G1

G2

G3 G4

G5

slide-31
SLIDE 31

TT-vector

31

  • TT-vector converts a long vector into a TT-format

✓ vector where ✓ coordinate of vector ✓ d-dimensional vector-index of tensorized ,

where

✓ holds ✓ TT-format of is called TT-vector

b ∈ RN N = Qd

k=1 nk

` ∈ {1, ..., N} b ∈ RN µ(`) = (µ1(`), µ2(`), ..., µd(`))

slide-32
SLIDE 32

TT-matrix

32

  • TT-matrix converts a big matrix into a TT-format

✓ matrix where and ✓ row coordinate and column coordinate of ✓ d-dimensional vector-indices of

tensorized , where and

✓ holds ✓ TT-format of is called TT-matrix

slide-33
SLIDE 33

TT-layer

  • Fully connected layers apply a linear transformation to N-dimensional input

vector

33

  • TT-layer transforms input (in TT-vector) by the weight (in TT-matrix), to

the output

where the weight matrix and bias vector

  • The application of TT-matrix-by-vector operation yields low computational

complexity of forward pass

  • The learning can be performed by applying back-propagation to FC layers to

compute gradients w.r.t the tensor cores

slide-34
SLIDE 34

TensorNet Experimental Results Cont

  • Substitution of FC layers with the TT-layers in VGG-16 and VGG-19 networks

✓ FC stands for a fully-connected layer ✓ TT‘$’ stands for a TT-layer with all the TT-ranks equal ‘$’ ✓ MR‘$’ stands for a fully-connected layer with the matrix ranks restricted to ‘$’ ✓ The experiments report the compression factor of TT-layers; the resulting

compression factor of the whole network; the top1 and top5 classification errors
 34

Architecture TT-layers compr. vgg-16 compr. vgg-19 compr. vgg-16 top 1 vgg-16 top 5 vgg-19 top 1 vgg-19 top 5 FC FC FC 1 1 1 30.9 11.2 29.0 10.1 TT4 FC FC 50 972 3.9 3.5 31.2 11.2 29.8 10.4 TT2 FC FC 194 622 3.9 3.5 31.5 11.5 30.4 10.9 TT1 FC FC 713 614 3.9 3.5 33.3 12.8 31.9 11.8 TT4 TT4 FC 37 732 7.4 6 32.2 12.3 31.6 11.7 MR1 FC FC 3 521 3.9 3.5 99.5 97.6 99.8 99 MR5 FC FC 704 3.9 3.5 81.7 53.9 79.1 52.4 MR50 FC FC 70 3.7 3.4 36.7 14.9 34.5 15.8

slide-35
SLIDE 35

Outline

  • Tensor Regression
  • TensorNets for Deep Neural Networks Compression
  • (Multi-)Tensor Completion
  • Tensor Denoising

35

slide-36
SLIDE 36

Tensor Completion

?

? ? ? ? ? ?

Missing entry Observed entry Incomplete tensor Completed tensor

Tensor completion problem:

Tensor completion is to apply tensor method to infer a tensor with missing entries from partial observations.

slide-37
SLIDE 37

Motivation

Recommender system Collaborative filtering

Movie ratings (Netflix)

Social network analysis

37

slide-38
SLIDE 38

Matrix Factorization for Incomplete Data

Y = UV Singular Value Decomposition (SVD) Non-negative Matrix Factorization (NMF) Probabilistic Matrix Factorization (PMF) Gaussian Process Latent Variable Models (GPLVM)

38

Challenges:

  • ill-posed problem
  • infinite solutions

Regularizations:

  • Low-rank assumption
  • Smoothness, non-negativity
slide-39
SLIDE 39

Tensor Completion

Solving scheme 1: low-rank assumption on tensor

Low-rank assumption ? ? ? ? ? ?

Missing entry Observed entry Incomplete tensor Completed tensor

Example: High accuracy LRTC (HaLRTC)

min

X

: ⌦X⌦∗ s.t. : XΩ = TΩ

min

X

:

n

X

i=1

αi⌦X(i)⌦∗ s.t. : XΩ = TΩ.

Assume the tensor matricization

  • f each mode is low -rank

HaLRTC

[Liu, et al., 2013]

slide-40
SLIDE 40

Tensor Completion

[Kolda, et al., 2009]

Mode-n matricization of a three-order tensor:

3-order tensor

mode-2 slices mode-3 slices mode-1 slices mode-1 mode-2 mode-3

… … …

mode-2 matricization mode-3 matricization mode-1 matricization

X(1) X(2) X(3) low-rank

low-rank low-rank

slide-41
SLIDE 41

Technical problems

  • Model selection problem
  • Rank determination; tuning parameter selection
  • Uncertainty information (confidence region)
  • Point estimation by ML, MAP, or optimisation methods
  • Overfitting problem
  • Efficiency (MCMC, Gibbs inference - easy derivation

but slow convergence; no analytic solution)

slide-42
SLIDE 42

Tensor factorization with missing values

  • Problem: Nth-order tensor is partially observed.
  • True latent tensor is represented by a CP model with the minimum R
  • Sparsity imposed on latent dimensions of factors

is, Y = X + ε, be an i.i.d. Gaussian

Ω indicates observed indices

O is a indicator tensor

Q X =

R

X

r=1

a(1)

r

· · · a(N)

r

= [ [A(1), . . . , A(N)] ],

is assumed to be an i.i.d. ε ⇠ Q

i1,...,iN N(0, τ −1),

be exactly represented by

T (x|0, λ, ν) = Z N(x|0, τ)Ga(τ|a, b)dτ

slide-43
SLIDE 43
  • Observation model (likelihood)
  • Priors of latent factors
  • Priors of hyper parameters

Bayesian CP factorization

Y

A(1) A(n) A(N)

τ · · · · · · λ c d a b

p ⇣ YΩ

  • {A(n)}N

n=1, τ

⌘ =

I1

Y

i1=1

· · ·

IN

Y

iN=1

N ⇣ Yi1i2...iN

  • D

a(1)

i1 , a(2) i2 , · · · , a(N) iN

E , τ −1⌘Oi1···in , p

  • A(n)

λ

  • =

In

Y

in=1

N

  • a(n)

in

  • 0, Λ−1

, ∀n ∈ [1, N],

p(λ) =

R

Y

r=1

Ga(r|cr

0, dr 0),

p(⌧) = Ga(⌧|a0, b0).

| Ga(x|a, b) = baxa−1e−bx Γ(a)

  • Y

e Λ = diag(λ) denotes matrix, also known as

  • Q. Zhao et al, IEEE TPAMI 2015
slide-44
SLIDE 44
  • The posterior distribution of all unknowns
  • Predictive distribution for missing entries
  • Analytic intractable and resort to approximate inference
  • Variation Bayesian inference; Expectation propagation
  • Sampling methods such as MCMC gibbs

Our objective

p(Θ|YΩ) = p(Θ, YΩ) R p(Θ, YΩ) dΘ.

Y\ p(Y\Ω|YΩ) = Z p(Y\Ω|Θ)p(Θ|YΩ) dΘ,

Θ = {A(1), . . . , A(N), λ, τ}

slide-45
SLIDE 45
  • KL divergence between approximation and true posterior

distributions

  • Factorization of approximation distributions
  • Approximation for posterior distributions

Model learning via Bayesian Inference

KL (q(Θ)||p(Θ|Y)) =

  • q(Θ) ln

q(Θ) p(Θ|Y) = ln p(Y) −

  • q(Θ) ln

p(Y, Θ) q(Θ) dΘ

45

p(Y )

KL(q||p)

L(q, θ)

8 2 qn(A(n)) =

In

Y

in=1

N ⇣ a(n)

in

  • ˜

a(n)

in , V(n) in

⌘ ,

qτ(τ) = Ga(τ|aM, bM),

qλ(λ) =

R

Y

r=1

Ga(λr|cr

M, dr M),

q(Θ) = qλ(λ)qτ(τ)

N

Y

n=1

qn ⇣ A(n)⌘ .

slide-46
SLIDE 46
  • Posterior of latent factors

Model learning

8 2 qn(A(n)) =

In

Y

in=1

N ⇣ a(n)

in

  • ˜

a(n)

in , V(n) in

⌘ ,

˜ a(n)

in = Eq[τ]V(n) in Eq

⇥ A(\n)T

in

⇤ vec

  • YI(Oin=1)
  • V(n)

in =

⇣ Eq[τ]Eq ⇥ A(\n)T

in

A(\n)

in

⇤ + Eq[Λ] ⌘1 ,

A(\n)T

in

= ⇣ K

k6=n

A(k)⌘T

I(Oin=1),

J Q

Y

A(1) A(n) A(N)

τ · · · · · · λ c d a b

Variational Message Passing

46

slide-47
SLIDE 47
  • Posterior of hyper parameters- precision of latent

factor

  • Posterior of noise precision

Model learning

qλ(λ) =

R

Y

r=1

Ga(λr|cr

M, dr M),

cr

M = cr 0 + 1

2

N

X

n=1

In dr

M = dr 0 + 1

2

N

X

n=1

Eq h a(n)T

·r

a(n)

·r

i .

qτ(τ) = Ga(τ|aM, bM), posterior parameters can be updated

aM = a0 + 1 2 X

i1,...,iN

Oi1,...,iN bM = b0 + 1 2Eq 

  • O ~

⇣ Y − [ [A(1), . . . , A(N)] ] ⌘

  • 2

F

  • .

(29)

Y

A(1) A(n) A(N)

τ · · · · · · λ c d a b

Variational Message Passing

47

slide-48
SLIDE 48

Demonstration of learning procedure

  • Size 10x10x10
  • Rank =5
  • ×R are drawn from a standar

i.e., ∀n, ∀in, a(n)

in ∼ N(0, IR),

tensor is constructed by

where ε ∼ Q

i1,...,iN N(0, σ2) denotes

whose parameter contr

σ−2 = 1000

48

slide-49
SLIDE 49

Image Completion

Observation FBCP FBCP−MP CPWOPT STDC HaLRTC FaLRTC FCSA HardC. KTD

Missing rate 70% 80% 90% 95%

49

slide-50
SLIDE 50

Facial image synthesis

  • 3D basel face model
  • image size 68 x 68
  • 10 people x 9 poses x 3

illuminations

  • large variants of faces captured

from surveillance video

  • Robust face recognition

Method 36/270 49/270 64/270 81/270 T M T M T M T M FBCP 0.06 0.10 0.06 0.10 0.09 0.15 0.12 0.20 CPWOPT 0.53 0.65 0.56 0.61 0.58 0.59 0.65 0.73 FaLRTC 0.11 0.28 0.13 0.30 0.15 0.31 0.19 0.34 HardC. 0.37 0.37 0.37 0.40 0.37 0.40 0.37 0.40

Matrix factorization does not work when one entire row or column is missing

50

slide-51
SLIDE 51

Ground truth FBCP FALRTC CPWOPT

51

slide-52
SLIDE 52
  • Model assumption: Observed Nth-order tensor
  • Likelihood function:
  • Priors over model parameters:

52

X = G ×1 U(1) ×2 U(2) × · · · ×N U(N). the latent tensor i.e., Y = X + ε, ucker representation

τ ∼ Ga

0, bτ

  • ,

vec(G)

  • n

λ(n)o , β ∼ N @0, β O

n

Λ(n) !−11 A , β ∼ Ga

0, bβ

  • ,

u(n)

in

  • λ(n) ∼ N

⇣ 0, Λ(n)−1⌘ , ∀n, ∀in, Student-t: λ(n)

rn

∼ Ga

0, bλ

  • ,

∀n, ∀rn, Laplace: λ(n)

rn

∼ IG

  • 1, γ

2

  • ,

∀n, ∀rn, γ ∼ Ga(aγ

0, bγ 0).

p(Y, Θ) = p

  • Y|{U(n)}, G, τ

Y

n

p

  • U(n)

λ(n) × p

  • G
  • {Λ(n)}, β

Y

n

p

  • λ(n)|γ
  • p(γ)p(β)p(τ).

Bayesian Sparse Tucker Decomposition

first consider the Bayesian tensor Y 2 RI1⇥···⇥IN that a measurement of

vec(Y)

  • n

U(n)o , G, τ ⇠ N ✓O

n

U(n)

◆ vec(G), τ 1I

! . (7)

❖ Group Sparsity priors over factors ❖ Slice sparsity priors over cores ❖ Shared sparsity patterns between cores and factors

Joint distribution of the model

I1 I2 I3 I3 I1 I2 R2 R3 R1 R2 R3 R1 G U (1) U (3) U (2)

slide-53
SLIDE 53
  • Variational Bayesian
  • Posterior of the core tensor
  • Posterior of noise precision
  • Posterior of factor matrices
  • Posterior of

Model Inference

  • q
  • U(n)

=

In

Y

in=1

N ⇣ u(n)

in

  • e

u(n)

in , Ψ(n)⌘

, n = 1, . . . , N, (16)

q(Θ) = q(G)q(β) Y

n

q

  • U(n) Y

n

q(λ(n))q(γ)q(τ).

q(G) = N ⇣ vec(G)

  • vec(e

G), ΣG ⌘ , vec(e G) = E[τ] ΣG O

n

E h U(n)T i! vec (Y) , ΣG = ( E[β] O

n

E h Λ(n)i + E[τ] O

n

E h U(n)T U(n)i)−1 .

e U(n) = E[τ] Y(n) @O

k6=n

E h U(k)i 1 A E h GT

(n)

i Ψ(n), (17) Ψ(n) = 8 < :E ⇥ Λ(n)⇤ + E[τ]E 2 4G(n) @O

k6=n

U(k)T U(k) 1 A GT

(n)

3 5 9 = ;

1

. (18)

λ(n), n = 1, . . . , N

q

  • λ(n)

=

Rn

Y

rn=1

Ga

  • λ(n)

rn

  • ˜

a(n)

rn ,˜

b(n)

rn

  • ,

˜ a(n)

rn = aλ 0 + 1

2 @In + Y

k6=n

Rk 1 A , ˜ b(n)

rn = bλ 0 + 1

2E h u(n)T

·rn u(n) ·rn

i + 1 2E[β]E ⇥ vec(G2

···rn···)T ⇤ O k6=n

E h λ(k)i .

M = aτ 0 + 1

2 Y

n

In, bτ

M = bτ 0 + 1

2E "

  • vec(Y)

O

n

U(n) ! vec(G)

  • 2

F

#

ariational posterior distrib is q(τ) = Ga(aτ

M, bτ M)

τ

53

slide-54
SLIDE 54
  • Model assumption: Nth-order tensor
  • Likelihood function:
  • Priors over model parameters are same as BSTD
  • Model inference are different for core tensor G, factor matrices

U, and noise precision

  • Predictive distribution over missing entries

Bayesian Sparse Tucker Completion

first consider the Bayesian tensor Y 2 RI1⇥···⇥IN that a measurement of

model YΩ = X Ω + ε ted exactly by a

X = G ×1 U(1) ×2 U(2) × · · · ×N U(N).

O

···

  • zero. Ω denotes a set of N-tuple indices

denotes only observed entries. Similar

denotes a binary tensor indicating i.e., Oi1···iN = 1 if (i1, . . . , iN) ∈ Ω denotes a set of

  • tuple indices of

8

Yi1···iN

  • n

u(n)

in

  • , G, τ ∼ N

O

n

u(n)T

in

! vec(G), τ 1 !Oi1···iN (31)

τ

p(Yi1···iN |YΩ) = Z p(Yi1···iN |Θ)p(Θ|YΩ) dΘ ⇣

54

slide-55
SLIDE 55

Demonstration of Learning Procedure

  • Tensor: 20 x 20 x 20 with 70% missing elements
  • Multilinear rank: 2 x 3 x 3

55

slide-56
SLIDE 56

MRI Dataset

TABLE III

THE PERFORMANCE OF MRI COMPLETION EVALUATED BY PSNR AND RRSE. FOR NOISY MRI, THE STANDARD DERIVATION OF GAUSSIAN NOISE IS 3% OF BRIGHTEST TISSUE. MRI TENSOR IS OF SIZE 181 × 217 × 165 AND EACH BLOCK TENSOR IS OF SIZE 50 × 50 × 10.

50% 60% 70% 80% Original Noisy Original Noisy Original Noisy Original Noisy BSTC-T 27.32 0.11 26.18 0.12 25.30 0.14 24.60 0.15 22.81 0.18 22.35 0.19 20.14 0.25 20.00 0.25 BSTC-L 26.91 0.11 25.57 0.13 24.84 0.15 23.95 0.16 22.76 0.19 22.09 0.20 20.12 0.25 19.80 0.26 iHOOI 22.69 0.19 21.45 0.22 22.47 0.19 21.16 0.22 21.63 0.21 20.11 0.25 18.65 0.30 17.89 0.32 HaLRTC 24.84 0.15 23.60 0.17 22.35 0.19 21.65 0.21 19.93 0.26 19.55 0.27 17.37 0.34 17.15 0.35

Noisy SNR=20dB, MR = 50%, PSNR= 26dB Missing Estimation

(a) 50% missing

Noisy SNR=20dB, MR = 80%, PSNR= 22dB Missing Estimation

(b) 80% missing

56

slide-57
SLIDE 57

Bayesian Robust Tensor Factorization

YΩ

A(1) A(n) A(N)

SΩ τ · · · · · · λ γ c0 d0 a0γ b0

γ

aτ bτ

  • Model specification
  • Priors
  • Joint distribution

p ⇣ YΩ

  • {A(n)}N

n=1, SΩ, τ

⌘ =

I1

Y

i1=1

· · ·

IN

Y

iN =1

N ⇣ Yi1...iN

  • D

a(1)

i1 , · · · , a(N) iN

E + Si1...iN , τ −1⌘Oi1···iN , (6)

p

  • A(n)

λ

  • =

In

Y

in=1

N

  • a(n)

in

  • 0, Λ−1

, ∀n ∈ [1, N] p(λ) =

R

Y

r=1

Ga(λr|c0, d0), (7)

p(SΩ|γ) = Y

i1,...,iN

N(Si1...iN |0, γ−1

i1...iN )Oi1...iN ,

p(γ) = Y

i1,...,iN

Ga(γi1...iN |aγ

0, bγ 0).

(8)

p(τ) = Ga(τ|aτ

0, bτ 0).

(8) Y

p ⇣ YΩ

  • {A(n)}N

n=1, SΩ, τ

N

Y

n=1

p

  • A(n)

λ

  • p(SΩ|γ)p(λ)p(γ)p(τ).

57

  • Q. Zhao et al, IEEE TNNLS,2016
slide-58
SLIDE 58
  • Tensor size: 30 x 30 x 30
  • CP rank: R = 3
  • Gaussian noise: SNR =

20dB

  • Missing rate: 80%
  • Outliers: rate = 5%, M=

10*std(X);

  • Maximal rank is set to 10.

Demo of the model learning procedure

58

slide-59
SLIDE 59

59

slide-60
SLIDE 60

Videos with 90% missing pixels

60

slide-61
SLIDE 61

Tensor Completion

Low-rank approximation

G(1)

G(N)

<latexit sha1_base64="xpd2F7ipH9vN3KLb/kN1jCr8g9k=">AB7XicbZDLSsNAFIZP6q3GW9Wlm8Ei1E1JuqkuxILXUkFYwtLJPpB06mYSZiVBCwVdw40LFrY/ge7jzbZxeFtr6w8DH/5/DnHOChDOlHefbyi0tr6yu5dftjc2t7Z3C7t6dilNJqEdiHstmgBXlTFBPM81pM5EURwGnjWBwMc4bD1QqFotbPUyoH+GeYCEjWBurcXmfla6PR51C0Sk7E6FcGdQP+0zx4BoN4pfLW7MUkjKjThWKmW6yTaz7DUjHA6stupogkmA9yjLYMCR1T52WTcEToyTheFsTRPaDRxf3dkOFJqGAWmMsK6r+azsflf1kp1eOJnTCSpoJMPwpTjnSMxrujLpOUaD40gIlkZlZE+lhios2FbHMEd37lRfAq5dOyc+MUaxWYKg8HcAglcKEKNbiCOnhAYABP8AKvVmI9W2/W+7Q0Z8169uGPrI8fNamQeA=</latexit><latexit sha1_base64="WlBTyA5NXo6JEQ5CV+SOUOhMxP8=">AB7XicbZC7SgNBFIbPxltcb1FLm8EgxCbsplELMWChlURwTSBZw+xkNhkyO7vMzAphyUPYWKjYWPgIvoeN+DZOLoUm/jDw8f/nMOecIOFMacf5tnILi0vLK/lVe219Y3OrsL1zq+JUEuqRmMeyEWBFORPU0xz2kgkxVHAaT3on4/y+j2VisXiRg8S6ke4K1jICNbGql/cZaWrw2G7UHTKzlhoHtwpFM8+7NPk7cutQufrU5M0ogKThWquk6ifYzLDUjnA7tVqpogkfd2nToMARVX42HneIDozTQWEszRMajd3fHRmOlBpEgamMsO6p2Wxk/pc1Ux0e+xkTSaqpIJOPwpQjHaPR7qjDJCWaDwxgIpmZFZEelphocyHbHMGdXkevEr5pOxcO8VqBSbKwx7sQwlcOIqXEINPCDQhwd4gmcrsR6tF+t1Upqzpj278EfW+w8lxJHs</latexit><latexit sha1_base64="WlBTyA5NXo6JEQ5CV+SOUOhMxP8=">AB7XicbZC7SgNBFIbPxltcb1FLm8EgxCbsplELMWChlURwTSBZw+xkNhkyO7vMzAphyUPYWKjYWPgIvoeN+DZOLoUm/jDw8f/nMOecIOFMacf5tnILi0vLK/lVe219Y3OrsL1zq+JUEuqRmMeyEWBFORPU0xz2kgkxVHAaT3on4/y+j2VisXiRg8S6ke4K1jICNbGql/cZaWrw2G7UHTKzlhoHtwpFM8+7NPk7cutQufrU5M0ogKThWquk6ifYzLDUjnA7tVqpogkfd2nToMARVX42HneIDozTQWEszRMajd3fHRmOlBpEgamMsO6p2Wxk/pc1Ux0e+xkTSaqpIJOPwpQjHaPR7qjDJCWaDwxgIpmZFZEelphocyHbHMGdXkevEr5pOxcO8VqBSbKwx7sQwlcOIqXEINPCDQhwd4gmcrsR6tF+t1Upqzpj278EfW+w8lxJHs</latexit>

G(2) G(N−1)

<latexit sha1_base64="VJvd/W3SJc9C4OUTHiWmCD58h7c=">AB+3icbVDLSsNAFL2pr1pf0S7dBItQF5akG3VlwYWupIKxhTaWyXTSDp08mJkIcRfcaELFbf+iDt3/ojgpO1CWw8MHM65l3vmuBGjQprmp1ZYWFxaXimultbWNza39O2dGxHGHBMbhyzkbRcJwmhAbEklI+2IE+S7jLTc0Vnut+4IFzQMrmUSEcdHg4B6FCOpJ5e7vpIDjFi6Xl2m1YvD62DrKdXzJo5hjFPrCmpnH5Hj18A0OzpH91+iGOfBIzJETHMiPpIhLihnJSt1YkAjhERqQjqIB8olw0nH4zNhXSt/wQq5eI2x+nsjRb4Qie+qyTyqmPVy8T+vE0v2ElpEMWSBHhyIuZIUMjb8LoU06wZIkiCHOqshp4iDjCUvVUiVYs1+eJ3a9dlIzr8xKow4TFGEX9qAKFhxBAy6gCTZgSOABnuFu9etFftbTJa0KY7ZfgD7f0HdwiXHw=</latexit><latexit sha1_base64="kGSl9nhb6jTYarvW0Cw62jBds0=">AB+3icbVDNSsNAGNzUvxr/oj16CRahHixJL+pBLHjQk1QwtDGstlu2qWbTdzdCHEV/GgBxWvogX8WEN20P2jqwMx8H9/seBElQlrWl1aYm19YXCou6yura+sbxubWtQhjrCDQhrylgcFpoRhRxJcSviGAYexU1veJr7zTvMBQnZlUwi7Aawz4hPEJRK6hqlTgDlAEGanmU3aeVi397LukbZqlojmLPEnpDyXf0+Kkf3za6xkenF6I4wEwiCoVo21Yk3RySRDFmd6JBY4gGsI+bivKYICFm47CZ+auUnqmH3L1mDRH6u+NFAZCJIGnJvOoYtrLxf+8diz9QzclLIolZmh8yI+pKUMzb8LsEY6RpIkiEHGispoADlEUvWlqxLs6S/PEqdWPapal1a5XgNjFME2AEVYIMDUAfnoAEcgEACHsAzeNHutSftVXsbjxa0yU4J/IH2/gNfuZfO</latexit><latexit sha1_base64="kGSl9nhb6jTYarvW0Cw62jBds0=">AB+3icbVDNSsNAGNzUvxr/oj16CRahHixJL+pBLHjQk1QwtDGstlu2qWbTdzdCHEV/GgBxWvogX8WEN20P2jqwMx8H9/seBElQlrWl1aYm19YXCou6yura+sbxubWtQhjrCDQhrylgcFpoRhRxJcSviGAYexU1veJr7zTvMBQnZlUwi7Aawz4hPEJRK6hqlTgDlAEGanmU3aeVi397LukbZqlojmLPEnpDyXf0+Kkf3za6xkenF6I4wEwiCoVo21Yk3RySRDFmd6JBY4gGsI+bivKYICFm47CZ+auUnqmH3L1mDRH6u+NFAZCJIGnJvOoYtrLxf+8diz9QzclLIolZmh8yI+pKUMzb8LsEY6RpIkiEHGispoADlEUvWlqxLs6S/PEqdWPapal1a5XgNjFME2AEVYIMDUAfnoAEcgEACHsAzeNHutSftVXsbjxa0yU4J/IH2/gNfuZfO</latexit>

G(2)

i2

<latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit><latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit><latexit sha1_base64="K8hAwmoH/k3c7GZPuG9xrxAfVQ=">AB83icbVBNS8NAEJ34WetX1aOXxSLUS0mKoN4KHvRYwdhCG8Nmu2mXbjZxd1MoIb/DiwcVr/4Zb/4bt20O2vpg4PHeDPzgoQzpW3721pZXVvf2Cxtlbd3dvf2KweHDypOJaEuiXksOwFWlDNBXc0p51EUhwFnLaD0fXUb4+pVCwW93qSUC/CA8FCRrA2knfjZ8xv5I9ZrXGW+5WqXbdnQMvEKUgVCrT8ylevH5M0okITjpXqOnaivQxLzQinebmXKpgMsID2jVU4IgqL5sdnaNTo/RGEtTQqOZ+nsiw5FSkygwnRHWQ7XoTcX/vG6qw0svYyJNRVkvihMOdIxmiaA+kxSovnEwkM7ciMsQSE21yKpsQnMWXl4nbqF/V7bvzarNRpFGCYziBGjhwAU24hRa4QOAJnuEV3qyx9WK9Wx/z1hWrmDmCP7A+fwAoe5Ew</latexit>

G(N−1)

iN−1

f(g(1)

i1

· , g(N)

iN

= 1 , · · · , N ? ? ? ? ? ?

Tensor completion Missing entry Observed entry

Incomplete tensor Tensor decomposition Completed tensor

Find the low-rank tensor decomposition by observed entries.

[Yuan, et al., 2017]

Solving scheme 3: tensor decomposition by gradient-based optimization

slide-62
SLIDE 62

Decompose a tensor to TT format:

For each element:

Q xi1···iN =

N

Y

n=1

G(n)

in .

X : G(n) 2 Rrn−1×In×rn, G

, G(n) 2 Rrn−1×rn, n

, n = 1, 2, · · · , N.

TT-rank: Silce: Core tensor:

R

× ×

2 R

×

: {r0, r1, · · · , rN}, r0 = rN = 1,

Tensor Completion

X 2 RI1×I2×···×IN

f(G(1),

, G(2)

· , G(N−1), G(N)

, G(N)

iN

, G(N−1)

iN−1

, G(2)

i2 ,

G(1)

i1

X :

[Oseledets, et al., 2011]

Tensor train decomposition (TTD)

tensor train:

slide-63
SLIDE 63

Tensor Completion

Tensor train stochastic gradient descent (TT-SGD)

[Yuan, et al., 2018]

For one observed entry:

f(G(1)

im

1 , G(2)

im

2 , · · · , G(N)

im

N ) = 1

2

  • ym

N

Y

n=1

G(n)

im

n

  • 2

F

∂f ∂G(n)

im

n

= (xm ym)(G>n

im

n G<n

im

n )T ,

where G>n

im

n

=

N

Q

n=n+1

G(n)

im

n , G<n

im

n

=

n−1

Q

n=1

G(n)

im

n .

the incomplete tensor as a sparse tensor, only

The gradient for according slice of core tensor: Loss function: Where The approximation of TTD:

Y xm =

N

Y

n=1

G(n)

im

n .

  • r n = 1, ..., N,

{ 1

2

· · e ym = Y(im

1 , im 2 , · · · , im N),

According to equation (2), x

? ? ? ? ? ?

slide-64
SLIDE 64

Tensor Completion

Recovered missing data

Approximated

xi1···iN =

N

Y

n=1

G(n)

in .

  • Low-rank TT approximation

G(1) G(N) G(2) G(N−1) G(2)

i2

G(N−1)

iN−1

f(g(1)

i1

· , g(N)

iN

= 1, · · · , N

P r e d i c t i

  • n

yi1···iN

Observed data High-order 
 tensorization

i2 = 1 = 1 i2 = 2 = 2 i2 = 3 = 3 i2 = 4

Higher-order tensor TT-SGD algorithm

f(G(1), G(2), · · · , G(N)) = 1 2

M

X

m=1

kym xmk2

F .

[Yuan, et al., 2018]

Algorithm 2 Tensor-train Stochastic Gradient Descent (TT-SGD) 1: Input: incomplete tensor Y and TT − rank r. 2: Initialization: core tensors G(1), G(2), · · · , G(N)of approximated tensor X. 3: While the optimization stopping condition is not satisfied 4: Randomly sample one observed entry from Y w.r.t. index {i1, i2, · · · iN}. 5: For n=1:N 6: Compute the gradients of the according tensor slices by equation (11). 7: End 8: Update G(1)

i1 , G(2) i2 , · · · , G(N) iN

by gradient descent method. 9: End while 10: Output: G(1), G(2), · · · , G(N).

TT-SGD overview

slide-65
SLIDE 65

Tensor Completion

[Yuan, et al., 2018]

Experiment results

99% random block line scratch

slide-66
SLIDE 66

Tensor Completion

High-order 
 tensorization

i2 = 1 = 1 i2 = 2 = 2 i2 = 3 = 3 i2 = 4

256×256

128×128

64×64

Tensorization for a 256×256×3 image
 From 3-way to 9-way 1.Reshape 256×256×3 to 2×2×…×2×3 (17-way tensor). 2.Permute by {1 9 2 10 3 11 4 12 5 13 6 14 7 15 8 16 17}. 3.Reshape to 4×4×4×4×4×4×4×4×3 (9-way tensor).

Better data structure
 The first order represent a 2×2 pixel block. The second order represent four 2×2 pixel block. … This can catch more structure relation of data.

High-order tensorization

[Yuan, et al., 2017]

slide-67
SLIDE 67

Tensor Completion

Comparison of applying tensorization

3D tensor 9D tensor 9D tensor with proposed tensorization TT-WOPT TT-SGD CP-WOPT FBCP TLnR HaLRTC

90% random missing

[Yuan, et al., 2017]

slide-68
SLIDE 68

Outline

  • Tensor Regression
  • TensorNets for Deep Neural Networks Compression
  • (Multi-)Tensor Completion
  • Tensor Denoising

68

slide-69
SLIDE 69

69

R

Grouping by cube-matching In

R

R

Grouping by cube-matching I

Tensor Factorization (HOSVD, BCPF)

Noisy Denoised

IEEE TIP 2013; IEEE TPAMI 2013

Noise variance is required!!! Automatic noise estimation

Tensor denoising

69

slide-70
SLIDE 70

Noisy MRI (T1)

  • 181 x 217 x 165
  • Noise std = 10%

max value

  • PSNR = 22dB

Denoised MRI

  • PSNR = 36dB

70

slide-71
SLIDE 71

Learning efficient tensor representations with ring structure networks (ICLR Workshop 2018)

Motivation: Tensor train is too strict due to TT-ranks are bounded by the rank of k-unfolding matricization Inconsistent solution from permutation of data Proposed model: More generalized model without constraint Sum of TT with partially shared core tensors Tensor ring ranks:

71

r1 = rd+1 = 1

T(i1, i2, . . . , id) = Tr {Z1(i1)Z2(i2) · · · Zd(id)} = Tr ( d Y

k=1

Zk(ik) ) .

r1 = rd+1 = 1

ki has Rank(Thki) =

that 9k, r1rk+1  Rk+1.

em 2. Let us assume T can has Rank(Thki) = Rk+1, k, r r  R .

slide-72
SLIDE 72

Algorithms:

  • Sequential SVDs
  • ALS algorithm
  • Scalar representation
  • Slice representation

Tensor Ring Decomposition

T n1 nd · · · nk · · · n2

=

Z1 Zd · · · Zk · · · Z2 n1 nd · · · nk · · · n2 r1 r2 rd rk+1 rk r3

T(i1, i2, . . . , id) =

r1,...,rd

X

α1,...,αd=1 d

Y

k=1

Zk(αk, ik, αk+1).

Note that due to the trace operation.

T(i1, i2, . . . , id) =Tr {Z1(i1)Z2(i2) · · · Zd(id)} ,

( )

T(i1, i2, . . . , id) = Tr(Z2(i2), Z3(i3), . . . , Zd(id), Z1(i1)) = · · · = Tr(Zd(id), Z1(i1), . . . , Zd1(id1)).

(4)

  • Circular dimensional

permutation invariance

T[k] = Zk(2)

Z6=k

[2]

⌘T

,

  • Block-wise ALS algorithm

Z

=

Z1 Zd · · · Zk · · · Z2 n1 nd · · · nk · · · n2 r1 r2 rd rk+1 rk r3

2 R

Q

Z>1(↵2i2, i3 · · · id↵1) =

X

↵3

Z2(↵2i2, ↵3)Z>2(↵3, i3 · · · id↵1).

(17)

Th1i(i1, i2 · · · id) =

X

↵1,↵2

Z1(i1, ↵1↵2)Z>1(↵1↵2, i2 · · · id).

(15)

72

slide-73
SLIDE 73
  • Sum of tensors
  • Multilinear products
  • Hadamard product of tensors
  • Inner product of two tensors
  • Apply Hadamard product followed by multilinear products with vectors
  • f all ones.

Properties of TR Representation

⇥ · · · ⇥ nd. If the TR decompositions

e T 1 = <(Z1, . . . , Zd)

Y Y

, wher are T 1 = <(Z1, . . . , Z

T 2 = <(Y1, . . . , Yd),

addition of these two tensors,

Yk 2 R

tensors, T 3 = T 1 + T 2, given by

= T 1 + T 2, can be also

by T 3 = <(X 1, . . . , X d), . Each core X can

Xk(ik) =

✓ Zk(ik)

Yk(ik)

, ik = 1, . . . , nk, k = 1, . . . , d.

T 2 R

×···×

be a th-order is T = <(Z1, . . . , Zd) vectors, then the multilinear

1, . . . , d be a set of vectors, then the

by c = T ⇥1 uT

1 ⇥2 · · · ⇥d uT d ,

product on each cores, which is

c = <(X1, . . . , Xd) where Xk =

nk

X

ik=1

Zk(ik)uk(ik).

R

× ×

, then the tensors, T 3 = T 1 ~ T 2, given by T

= <(X

~ T

= <(X 1, . . . , X d),

. Each core X can

Xk(ik) = Zk(ik) ⌦ Yk(ik), k = 1, . . . , d.

73

slide-74
SLIDE 74
  • CP decomposition is a special case of TR when cores are

slice diagonal

  • Tucker decomposition
  • TT decomposition is a special case of TR when

TR is a sum of TT representation

Relation to Other Models

T =

r

X

α=1

u(1)

α · · · u(d) α ,

Hence, CPD can be viewed sition T = <(V1, . . . , Vd) wher

  • f size r ⇥ n ⇥ r and each

R defining Vk(ik) = diag(u(k)

ik )

each fixed i and k, where

T = G ⇥1 U(1) ⇥2 · · · ⇥d U(d)

Hence, Tucker model can be sition T = <(Z1, . . . , Zd) the multilinear products

Zk = Vk ⇥2 U(k), k

the core tensor G can be decomposition G = <(V1, . . . , Vd), the element-wise form can

T(i1, . . . , id) = Tr {Z1(i1)Z2(i2) · · · Zd(id)} =

r1

X

α1=1

z1(α1, i1, :)T Z2(i2) · · · Zd−1(id−1)zd(:, id, α1)

∃n, rn = 1

74

slide-75
SLIDE 75

16×16 block image to 4×4×4×4 block format:

  • 1. Reshape to 2×2×2×2×2×2×2×2.
  • 2. Permute by {1,5,2,6,3,7,4,8}.
  • 3. Reshape to 4×4×4×4.

2 4 8 x 8 16×

16×16 block The first order represent a 2×2 pixel block. The second order represent four above block. …

Data Structure Reconstruction

slide-76
SLIDE 76

Learning efficient tensor representations with ring structure networks (ICLR Workshop 2018)

Representation

  • f original data
  • r model

parameters Tensorization is important and unexplored

76 Table 4: Image representation by using tensorization and TR decomposition. The number of parameters is compared for SVD, TT and TR given the same approximation errors. Data ✏ = 0.1 ✏ = 0.01 ✏ = 9e − 4 ✏ = 2e − 15 n = 256, d = 2 SVD TT/TR SVD TT/TR SVD TT/TR SVD TT/TR 9.7e3 9.7e3 7.2e4 7.2e4 1.2e5 1.2e5 1.3e5 1.3e5 Tensorization ✏ = 0.1 ✏ = 0.01 ✏ = 2e − 3 ✏ = 1e − 14 TT TR TT TR TT TR TT TR n = 16, d = 4 5.1e3 3.8e3 6.8e4 6.4e4 1.0e5 7.3e4 1.3e5 7.4e4 n = 4, d = 8 4.8e3 4.3e3 7.8e4 7.8e4 1.1e5 9.8e4 1.3e5 1.0e5 n = 2, d = 16 7.4e3 7.4e3 1.0e5 1.0e5 1.5e5 1.5e5 1.7e5 1.7e5 block addressing procedure to cast an

2 3 4 5 6

Ranks

0.2 0.4 0.6 0.8 1 1.2 1.4

Training error (%) TT-layer TR-layer

2 3 4 5 6

Ranks

1.9 2 2.1 2.2 2.3 2.4

Testing error (%) TT-layer TR-layer Figure 7: The classification performances of tensorizing neural networks by using TR representation.

slide-77
SLIDE 77

Discussions

  • What are the most important advantages of tensor

methods?

  • Which kind of tensor methods is promising in the future?