Novel tensor framework for neural networks and model reduction - - PowerPoint PPT Presentation

novel tensor framework for neural networks and model
SMART_READER_LITE
LIVE PREVIEW

Novel tensor framework for neural networks and model reduction - - PowerPoint PPT Presentation

Novel tensor framework for neural networks and model reduction Shashanka Ubaru 1 Lior Horesh 1 Misha Kilmer 2 Elizabeth Newman 2 Haim Avron 3 Osman Malik 4 1 IBM TJ Watson Research Center 2 Tufts University 3 Tel Aviv University 4 University of


slide-1
SLIDE 1

Novel tensor framework for neural networks and model reduction

Shashanka Ubaru1 Lior Horesh1 Misha Kilmer2 Elizabeth Newman2 Haim Avron3 Osman Malik4

1IBM TJ Watson Research Center 2Tufts University 3Tel Aviv University 4University of Colorado, Boulder

ICERM Workshop on Algorithms for Dimension and Complexity Reduction

IBM Research / March, 2020 / c 2020 IBM Corporation Shashanka Ubaru (IBM) Tensor NNs 1 / 35

slide-2
SLIDE 2

Outline

Brief introduction to tensors Tensor based graph neural networks Tensor neural networks Numerical Results Model reduction for NNs?

Shashanka Ubaru (IBM) Tensor NNs 2 / 35

slide-3
SLIDE 3

Introduction

Much of real-world data is inherently multidimensional Many operators and models are natively multi-way

Shashanka Ubaru (IBM) Tensor NNs 3 / 35

slide-4
SLIDE 4

Tensor Applications

Machine vision Latent semantic tensor indexing Medical imaging Video surveillance, streaming

Ivanov, Mathies, Vasilescu, Tensor subspace analysis for viewpoint recognition, ICCV, 2009 Shi, Ling, Hu, Yuan, Xing, Multi-target tracking with motion context in tensor power iteration, CVPR, 2014 Shashanka Ubaru (IBM) Tensor NNs 4 / 35

slide-5
SLIDE 5

Background and Notation

Notation : An1×n2...,×nd - dth order tensor

◮ 0th order tensor - scalar ◮ 1st order tensor - vector ◮ 2nd order tensor - matrix ◮ 3rd order tensor ... Shashanka Ubaru (IBM) Tensor NNs 5 / 35

slide-6
SLIDE 6

Inside the Box

Fiber - a vector defined by fixing all but one index while varying the rest Slice - a matrix defined by fixing all but two indices while varying the rest

Shashanka Ubaru (IBM) Tensor NNs 6 / 35

slide-7
SLIDE 7

Tensor Multiplication

Definition

The k - mode multiplication of a tensor A ∈ Rn1×n2×···×nd with a matrix U ∈ Rj×nk is denoted by A×kU and is of size n1 × · · · × nk−1 × j × nk+1 × · · · × nd Element-wise (A×kU)i1···ik−1jik+1···id =

nd

  • ik=1

ai1i2···idujik k-mode multiplication

Shashanka Ubaru (IBM) Tensor NNs 7 / 35

slide-8
SLIDE 8

The ⋆M-Product

Given A ∈ Rℓ×p×n, B ∈ Rp×m×n, and an invertible n × n matrix M, then C = A ⋆M B =

  • ˆ

A ❆ ˆ B

  • ×3 M−1

where C ∈ Rℓ×m×n, ˆ A = A ×3 M, and ❆ multiplies the frontal slices in parallel Useful properties: tensor transpose, identity tensor, connection to Fourier transform, invariance to circulant shifts, . . .

Shashanka Ubaru (IBM) Tensor NNs 8 / 35

slide-9
SLIDE 9

Tensor Graph Convolutional Networks

Shashanka Ubaru (IBM) Tensor NNs 9 / 35

slide-10
SLIDE 10

Dynamic Graphs

Graphs are ubiquitous data structures - represent interactions and structural relationships. In many real-world applications, underlying graph changes over time. Learning representations of dynamic graphs is essential.

Shashanka Ubaru (IBM) Tensor NNs 10 / 35

slide-11
SLIDE 11

Dynamic Graphs - Applications

Corporate/financial networks, Natural Language Understanding (NLU), Social networks, Neural activity networks, Traffic predictions.

Shashanka Ubaru (IBM) Tensor NNs 11 / 35

slide-12
SLIDE 12

Graph Convolutional Networks

Graph Neural Networks (GNN) popular tools to explore graph structured data. Graph Convolutional Networks (GCN) - based on graph convolution filters - extend convolutional neural networks (CNNs) to irregular graph domains. These GNN models operate on a given, static graph.

Courtesy: Image by (Kipf & Welling, 2016). Shashanka Ubaru (IBM) Tensor NNs 12 / 35

slide-13
SLIDE 13

Graph Convolutional Networks

Motivation: Convolution of two signals x and y: x ⊗ y = F−1(Fx ⊙ Fy), F is Fourier transform (DFT matrix). Convolution of two node signals x and y on a graph with Laplacian L = UΛUT : x ⊗ y = U(UT x ⊙ UT y). Filtered convolution: x ⊗filt y = h(L)x ⊙ h(L)y, with matrix filter function h(L) = Uh(Λ)UT .

Shashanka Ubaru (IBM) Tensor NNs 13 / 35

slide-14
SLIDE 14

Graph Convolutional Neural Networks

Layer of initial convolution based GNNs (Bruna et. al, 2016): Given graph Laplacian L ∈ RN×N and node features X ∈ RN×F : Hi+1 = σ(hθ(L)HiW(i)), hθ filter function parametrized by θ, σ a nonlinear function (e.g., RELU), and W(i) a weight matrix with H0 = X. Defferrard et al., (2016) used Chebyshev approximation: hθ(L) =

K

  • k=0

θkTk(L). GCN (Kipf & Welling, 2016): Each layer takes form: σ(LXW). 2-layer example: Z = softmax(L σ(LXW(0)) W(1))

Shashanka Ubaru (IBM) Tensor NNs 14 / 35

slide-15
SLIDE 15

GCN for dynamic graphs

We consider time varying, or dynamic, graphs Goal: Extend GCN framework to the dynamic setting for tasks such as node and edge classification, link prediction. Our approach: Use the tensor framework T adjacency matrices A::t ∈ RN×N stacked into tensor A ∈ RN×N×T T node feature matrices X::t ∈ RN×F stacked into tensor X ∈ RN×F×T

Shashanka Ubaru (IBM) Tensor NNs 15 / 35

slide-16
SLIDE 16

TensorGCN

X1

TensorGCN

T 2 Embedding 1 Time A1

Graph tasks Link prediction Edge classification Node classification

Dynamic graph Adjacency tensor Feature Tensor

Shashanka Ubaru (IBM) Tensor NNs 16 / 35

slide-17
SLIDE 17

TensorGCN

We use the ⋆M-Product to extend the std. GCN to dynamic graphs. We propose tensor GCN model σ(A ⋆M X ⋆M W). 2-layer example: Z = softmax(A ⋆M σ(A ⋆M X ⋆M W(0)) ⋆M W(1)) (1) We choose M to be lower triangular and banded: Mtk =

  • 1

min(b,t)

if max(1, t − b + 1) ≤ k ≤ t,

  • therwise,

Can be shown to be consistent with a spatio-temporal message passing model.

  • O. Malik, S. Ubaru, L. Horesh, M. Kilmer, and H. Avron, Tensor graph convolutional networks for prediction on

dynamic graphs, 2020 Shashanka Ubaru (IBM) Tensor NNs 17 / 35

slide-18
SLIDE 18

Tensor Neural Networks

Shashanka Ubaru (IBM) Tensor NNs 18 / 35

slide-19
SLIDE 19

Neural Networks

Let a0 be a feature vector with an associated target vector c Let f be a function which propagates a0 though connected layers: aj+1 = σ(Wj · aj + bj) for j = 0, . . . , N − 1, where σ is some nonlinear, monotonic activation function

Shashanka Ubaru (IBM) Tensor NNs 19 / 35

slide-20
SLIDE 20

Neural Networks

Let a0 be a feature vector with an associated target vector c Let f be a function which propagates a0 though connected layers: aj+1 = σ(Wj · aj + bj) for j = 0, . . . , N − 1, where σ is some nonlinear, monotonic activation function Goal: Learn the function f which optimizes: min

f∈H E(f) ≡ 1

m

m

  • i=1

V (c(i), f(a(i)

0 ))

  • loss function

+ R(f)

regularizer

H - hypothesis space of functions rich, restrictive, efficient

Shashanka Ubaru (IBM) Tensor NNs 19 / 35

slide-21
SLIDE 21

Reduced Parameterization

Given an n × n image A0, stored as a0 ∈ Rn2×1 and A0 ∈ Rn×1×n. Matrix: aj+1 = σ(Wj · aj + bj) n4 + n2 parameters Tensor:

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) n3 + n2 parameters

Shashanka Ubaru (IBM) Tensor NNs 20 / 35

slide-22
SLIDE 22

Improved Parametrization

Given an n × n image A0, stored as a0 ∈ Rn2×1 and A0 ∈ Rn×1×n.

Shashanka Ubaru (IBM) Tensor NNs 21 / 35

slide-23
SLIDE 23

Tensor Neural Networks (tNNs)

Forward propagation Objective function Backward propagation Update parameters

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-24
SLIDE 24

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) Objective function Backward propagation Update parameters

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-25
SLIDE 25

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) E = 1

2||WN · unfold(

AN) − c||2

F

Backward propagation Update parameters

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-26
SLIDE 26

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) E = 1

2||WN · unfold(

AN) − c||2

F

δ Aj = W⊤

j ⋆M (δ

Aj+1 ⊙ σ′( Zj+1)) Update parameters

where Zj+1 = Wj ⋆M Aj + Bj and ⊙ is the pointwise product δ Aj := ∂E

∂ Aj

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-27
SLIDE 27

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) E = 1

2||WN · unfold(

AN) − c||2

F

δ Aj = W⊤

j ⋆M (δ

Aj+1 ⊙ σ′( Zj+1)) Update parameters

where Zj+1 = Wj ⋆M Aj + Bj and ⊙ is the pointwise product δ Aj := ∂E

∂ Aj = ∂E ∂ Aj+1 ∂ Aj+1 ∂ Zj+1 ∂ Zj+1 ∂ Aj

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-28
SLIDE 28

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) E = 1

2||WN · unfold(

AN) − c||2

F

δ Aj = W⊤

j ⋆M (δ

Aj+1 ⊙ σ′( Zj+1)) δWj = (δ Aj+1 ⊙ σ′( Zj+1)) ⋆M A⊤

j

δ Bj = δ Aj+1 ⊙ σ′( Zj+1)

where Zj+1 = Wj ⋆M Aj + Bj and ⊙ is the pointwise product

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-29
SLIDE 29

Tensor Neural Networks (tNNs)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj)

  • Aj+1 = σ(Wj ⋆M

Aj + Bj) E = 1

2||WN · unfold(

AN) − c||2

F

δ Aj = W⊤

j ⋆M (δ

Aj+1 ⊙ σ′( Zj+1)) δWj = (δ Aj+1 ⊙ σ′( Zj+1)) ⋆M A⊤

j

δ Bj = δ Aj+1 ⊙ σ′( Zj+1)

where Zj+1 = Wj ⋆M Aj + Bj and ⊙ is the pointwise product Update parameters = Gradient descent!

  • M. Nielsen, Neural networks and deep learning, 2017

Shashanka Ubaru (IBM) Tensor NNs 22 / 35

slide-30
SLIDE 30

Numerical Results

Shashanka Ubaru (IBM) Tensor NNs 23 / 35

slide-31
SLIDE 31

TensorGCN - Datasets

Table: Dataset statistics. By partitioning the data into windows of the specified length results in the given number of graphs.

Partitioning Dataset Nodes Edges

  • No. graphs

Window length Classes Strain Sval Stest Bitcoin OTC 6,005 35,569 135 14 days 2 95 20 20 Bitcoin Alpha 7,604 24,173 135 14 days 2 95 20 20 Reddit 3,818 163,008 86 14 days 2 66 10 10 Chess 7,301 64,958 100 31 days 3 80 10 10

N N T

T r a i n i n g Validation Testing

N N T

T r a i n i n g Validation Testing

N N T

T r a i n i n g Validation Testing Partitioning of A into training, validation and testing data. Shashanka Ubaru (IBM) Tensor NNs 24 / 35

slide-32
SLIDE 32

TensorGCN - Edge classification results

Table: Results for edge classification. Performance measures is F1 score.

Dataset Method Bitcoin OTC Bitcoin Alpha Reddit Chess WD-GCN 0.2062 0.1920 0.2337 0.4311 EvolveGCN 0.3284 0.1609 0.2012 0.4351 GCN 0.3317 0.2100 0.1805 0.4342 TensorGCN (Proposal) 0.3529 0.2331 0.2028 0.4708 F1 score = 2 · precision · recall precision + recall

Shashanka Ubaru (IBM) Tensor NNs 25 / 35

slide-33
SLIDE 33

TensorGCN - Link Prediction results

Table: Results for link prediction. Performance measure is Mean Average Precision (MAP).

Dataset Method Bitcoin OTC Bitcoin Alpha Reddit Chess WD-GCN 0.6979 0.8067 0.1818 0.1077 EvolveGCN 0.6019 0.3474 0.1730 0.0655 GCN 0.6872 0.7392 0.1788 0.0852 TensorGCN (Proposal) 0.7817 0.8094 0.1601 0.1736 precision = true positive true positive + false positive recall = true positive true positive + false negative

Shashanka Ubaru (IBM) Tensor NNs 26 / 35

slide-34
SLIDE 34

Tensor vs. Matrix Learning: MNIST Database Results

Data: 28 × 28 grayscale images of handwritten digits, 60000 train, 10000 test Fixed parameters: h = 0.1, α = 0.1, σ = tanh, batch size = 20, 100 epochs Learnable parameters: matrix - 284N + 282N, tensor - 283N + 282N

  • L. Newman, L. Horesh, H. Avron, M. Kilmer, Stable tensor neural networks for rapid deep learning, (2019)

Shashanka Ubaru (IBM) Tensor NNs 27 / 35

slide-35
SLIDE 35

Tensor vs. Matrix Learning: CIFAR-10 Database Results

Data: 32 × 32 × 3 RGB images from 10 classes, 50000 train, 10000 test Fixed parameters: h = 0.1, α = 0.01, σ = tanh, batch = 100, 300 epochs, M = DCT matrix. Learnable params: mat-(32 · 324)N + 3 · 322N, ten-(32 · 323)N + 3 · 322N

  • A. Krizhevsky, Learning multiple layers of features from tiny images, 2009
  • L. Newman, L. Horesh, H. Avron, M. Kilmer, Stable tensor neural networks for rapid deep learning, (2019)

Shashanka Ubaru (IBM) Tensor NNs 28 / 35

slide-36
SLIDE 36

Model reduction for NN?

Shashanka Ubaru (IBM) Tensor NNs 29 / 35

slide-37
SLIDE 37

Recall - Proper Orthogonal Decomposition

Dynamical system (scalar nonlinear PDE): ∂y(t) ∂t = Ay(t) + F(y(t)), t ∈ [0, T] denotes time, y(t) = [y1(t), . . . , yn(t)]T ∈ Rn, A ∈ Rn×n a constant matrix, and F a nonlinear function F = [F(y1(t)), . . . , F(yn(t))]T . The discretized system: Ay(µ) + F(y(µ)) = 0, Corresponding Jacobian, J(y(µ)) := A + JF(y(µ)), with JF(y(µ)) = diag{F ′(y1(µ)), . . . , F ′(yn(µ))} ∈ Rn×n, F ′ denotes the first derivative of F.

Shashanka Ubaru (IBM) Tensor NNs 30 / 35

slide-38
SLIDE 38

Proper Orthogonal Decomposition

POD uses first k left singular vectors of the snapshot matrix defined as: Y = {y1, . . . , yns}. Given the SVD of Y, Y = VΣWT Projecting the system as: ∂˜ y(t) ∂t = VT

k AVk˜

y(t) + VT

k F(Vk˜

y(t)). The reduced order system becomes: ˜ A˜ y(µ) + VT

k F(Vk˜

y(µ)) = 0, and the corresponding Jacobian, ˜ J(˜ y(µ)) := ˜ A + VT

k JF(Vk˜

y(µ))Vk, where ˜ A = VT

k AVk.

Shashanka Ubaru (IBM) Tensor NNs 31 / 35

slide-39
SLIDE 39

Discrete empirical interpolation method

DEIM approximates the nonlinear function by projecting onto a space generated by function with basis of dimension m ≪ n. Considering a subspace U = {u1, . . . , um}, we approximate F(τ) ≈ Uc(τ), where c(τ) is corresponding coefficient vector. An interpolation matrix: P = [eφ1, . . . , eφm] ∈ Rn×m where eφi is a basis vector, i.e., φith column of identity matrix. The nonlinear function in the PDE and the Jacobian approximated as: F(Vk˜ y(µ)) ≈ U(PT U)−1F(PT Vk˜ y(µ)), ˜ JF(˜ y(µ)) ≈ VT

k U(PT U)−1JF(PT Vk˜

y(µ))PT Vk.

Chaturantabut and Sorensen , Discrete Empirical Interpolation for nonlinear model reduction, 2009 Saibaba, Randomized Discrete Empirical Interpolation Method for Nonlinear Model Reduction, (2019) Shashanka Ubaru (IBM) Tensor NNs 32 / 35

slide-40
SLIDE 40

PDE based NNs

Residual networks (ResNets): Given training data by Y = [y1, y2, . . . , ys] ∈ Rn×s and target C = [c1, c2, . . . , cs] ∈ Rd×s, N layers ResNet is given by: Yj+1 = Yj + σ(AjYj + bj) for j = 0, . . . , N − 1. General formulation: F(θ, Y) = A2(θ(3))σ(N(A1(θ(1))Y, θ(2))). with forward propagation as Yj+1 = Yj + F(θ(j), Yj) for j = 0, . . . , N − 1. Forward Euler discretization of the initial value problem ∂tY(θ, t) = F(θ(t), Y(t)), for t ∈ (0, T] Y(θ, 0) = Y0.

Ruthotto and Haber, Deep Neural Networks Motivated by Partial Differential Equations, 2019 Shashanka Ubaru (IBM) Tensor NNs 33 / 35

slide-41
SLIDE 41

Model reduction for PDE-NN?

Consider stable variant of Convolutional ResNets with symmetric layer: Fsym(θ, Y) = −A(θ)T σ(N(A(θ)Y, θ)). The Jacobian of this function with respect to the features will be: JF(Y) = A(θ)T diag(σ′(A(θ)Y))A(θ), σ′ derivative of pointwise nonlinearity. Reduced parameters via. DEIM: Precompute the projection basis U and interpolation matrix P, Fsym(θ, Y) ≈ U(PT U)−1 ˜ Fsym(θ, PT Y), where ˜ Fsym(θ, PT Y) = − ˜ A(θ)T σ(N( ˜ A(θ)PT Y, θ)) where ˜ A ∈ Rd×m is the reduced weight matrix to be learned. Effective approach to compute U and P, as the function F depends on θ.

Shashanka Ubaru (IBM) Tensor NNs 34 / 35

slide-42
SLIDE 42

Thank you!

Shashanka Ubaru (IBM) Tensor NNs 35 / 35