Machine Learning with Quantum-Inspired Tensor Networks E.M. - - PowerPoint PPT Presentation

machine learning with quantum inspired tensor networks
SMART_READER_LITE
LIVE PREVIEW

Machine Learning with Quantum-Inspired Tensor Networks E.M. - - PowerPoint PPT Presentation

Machine Learning with Quantum-Inspired Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 RIKEN AICS - Mar 2017 Collaboration with David J. Schwab, Northwestern and CUNY Graduate


slide-1
SLIDE 1

Machine Learning with Quantum-Inspired Tensor Networks

E.M. Stoudenmire and David J. Schwab RIKEN AICS - Mar 2017 Advances in Neural Information Processing 29 arxiv:1605.05775

slide-2
SLIDE 2

Collaboration with David J. Schwab, Northwestern and CUNY Graduate Center

Quantum Machine Learning, Perimeter Institute, Aug 2016

slide-3
SLIDE 3

Exciting time for machine learning

Self-driving cars Language Processing Medicine Materials Science / Chemistry

slide-4
SLIDE 4

Progress in neural networks and deep learning

neural network diagram

slide-5
SLIDE 5

Convolutional neural network "MERA" tensor network

slide-6
SLIDE 6

Are tensor networks useful for machine learning?

This Talk

Tensor networks fit naturally into kernel learning Many benefits for learning

  • Linear scaling
  • Adaptive
  • Feature sharing

(Also very strong connections to graphical models)

slide-7
SLIDE 7

Machine Learning Physics

Neural Nets

Phase Transitions Topological Phases Quantum Monte Carlo Sign Problem

Boltzmann Machines Supervised Learning Tensor Networks

Materials Science & Chemistry

Unsupervised Learning Kernel Learning

slide-8
SLIDE 8

Machine Learning Physics

Neural Nets

Phase Transitions Topological Phases Quantum Monte Carlo Sign Problem

Boltzmann Machines Supervised Learning Tensor Networks

Materials Science & Chemistry

Unsupervised Learning (this talk) Kernel Learning

slide-9
SLIDE 9

What are Tensor Networks?

slide-10
SLIDE 10

How do tensor networks arise in physics? Quantum systems governed by Schrödinger equation: It is just an eigenvalue problem.

ˆ H~ Ψ = E~ Ψ

slide-11
SLIDE 11

The problem is that is a 2N x 2N matrix

ˆ H

= E ·

= ⇒ wavefunction has 2N components

ˆ H ~ Ψ ~ Ψ

~ Ψ

slide-12
SLIDE 12

Natural to view wavefunction as order-N tensor

|Ψi = X

{s}

Ψs1s2s3···sN |s1s2s3 · · · sNi

slide-13
SLIDE 13

Natural to view wavefunction as order-N tensor

s1 s2 s3 s4

Ψs1s2s3···sN =

sN

slide-14
SLIDE 14

Tensor components related to probabilities of e.g. Ising model spin configurations

↓ ↓ ↑ ↑ ↑ ↑ ↑

↑↑ ↑↑ ↑

Ψ

=

slide-15
SLIDE 15

Tensor components related to probabilities of e.g. Ising model spin configurations

↓ ↓ ↓ ↓ ↓ ↑ ↑

↓ ↓↓ ↑↑ ↑

Ψ

=

slide-16
SLIDE 16

Must find an approximation to this exponential problem

s1 s2 s3 s4

Ψs1s2s3···sN =

sN

slide-17
SLIDE 17

Simplest approximation (mean field / rank-1) Let spins "do their own thing" s1 s2 s3 s4 s5 s6 Expected values of individual spins ok No correlations

Ψs1s2s3s4s5s6 ' ψs1 ψs2 ψs3 ψs4 ψs5 ψs6

slide-18
SLIDE 18

s1 s2 s3 s4 s5 s6 Restore correlations locally

Ψs1s2s3s4s5s6 ' ψs1 ψs2 ψs3 ψs4 ψs5 ψs6

slide-19
SLIDE 19

s1 s2 s3 s4 s5 s6 Restore correlations locally

i1 i1

Ψs1s2s3s4s5s6 ' ψs1 ψs2 ψs3 ψs4 ψs5 ψs6

slide-20
SLIDE 20

s1 s2 s3 s4 s5 s6 matrix product state (MPS) Local expected values accurate Correlations decay with spatial distance Restore correlations locally

i3 i3 i4

i4 i5 i5

i2

i2 i1 i1

Ψs1s2s3s4s5s6 ' ψs1 ψs2 ψs3 ψs4 ψs5 ψs6

slide-21
SLIDE 21

"Matrix product state" because

↑ ↓ ↓ ↑

↑ ↓

retrieving an element product of matrices

=

slide-22
SLIDE 22

Ψ↑

↑↑

↓↓ ↓

=

"Matrix product state" because retrieving an element product of matrices

=

slide-23
SLIDE 23

Tensor diagrams have rigorous meaning

vj

j j

Mij

i

j

i

Tijk

k

slide-24
SLIDE 24

Joining lines implies contraction, can omit names

X

j

Mijvj

j

i

AijBjk = AB AijBji = Tr[AB]

slide-25
SLIDE 25

MPS approximation controlled by bond dimension "m" (like SVD rank) Compress parameters into parameters 2N N ·2·m2 can represent any tensor m ∼ 2

N 2

MPS = matrix product state

slide-26
SLIDE 26

m=8 m=4 m=2

Friendly neighborhood of "quantum state space"

m=1

Ψ

slide-27
SLIDE 27

MPS lead to powerful optimization techniques (DMRG algorithm)

MPS = matrix product state

White, PRL 69, 2863 (1992) Stoudenmire, White, PRB 87, 155137 (2013)

slide-28
SLIDE 28

Evenbly, Vidal, PRB 79, 144108 (2009)

PEPS

(2D systems)

Besides MPS, other successful tensor are PEPS and MERA

Verstraete, Cirac, cond-mat/0407066 (2004) Orus, Ann. Phys. 349, 117 (2014)

MERA

(critical systems)

slide-29
SLIDE 29

Supervised Kernel Learning

slide-30
SLIDE 30

Input vector e.g. image pixels Very common task: Labeled training data (= supervised) Find decision function Supervised Learning f(x) f(x) > 0 f(x) < 0 x x ∈ A x ∈ B

slide-31
SLIDE 31

Use training data to build model ML Overview x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16

slide-32
SLIDE 32

Use training data to build model ML Overview x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16

slide-33
SLIDE 33

Use training data to build model ML Overview Generalize to unseen test data

slide-34
SLIDE 34

Popular approaches ML Overview Neural Networks Non-Linear Kernel Learning f(x) = W · Φ(x) f(x) = Φ2 ⇣ M2Φ1

  • M1x

slide-35
SLIDE 35

Non-linear kernel learning Want to separate classes Linear classifier

  • ften insufficient

? ? f(x) = W · x f(x)

slide-36
SLIDE 36

Non-linear kernel learning Apply non-linear "feature map" x → Φ(x) Φ

slide-37
SLIDE 37

Non-linear kernel learning Apply non-linear "feature map" x → Φ(x) Φ Decision function f(x) = W · Φ(x)

slide-38
SLIDE 38

Non-linear kernel learning

Φ

Decision function f(x) = W · Φ(x) Linear classifier in feature space

slide-39
SLIDE 39

Non-linear kernel learning

Φ

Example of feature map

Φ(x) = (1, x1, x2, x3, x1x2, x1x3, x2x3) x = (x1, x2, x3)

is "lifted" to feature space x

slide-40
SLIDE 40

Proposal for Learning

slide-41
SLIDE 41

Grayscale image data

slide-42
SLIDE 42

Map pixels to "spins"

slide-43
SLIDE 43

Map pixels to "spins"

slide-44
SLIDE 44

Map pixels to "spins"

slide-45
SLIDE 45

Local feature map, dimension d=2 φ(xj) = h cos ⇣π 2 xj ⌘ , sin ⇣π 2 xj ⌘i Crucially, grayscale values not orthogonal xj ∈ [0, 1]

x = input

slide-46
SLIDE 46

Total feature map Φs1s2···sN (x) = φs1(x1) ⊗ φs2(x2) ⊗ · · · ⊗ φsN (xN)

  • Tensor product of local feature maps / vectors
  • Just like product state wavefunction of spins
  • Vector in dimensional space

φ = local feature map x = input

2N Φ(x)

slide-47
SLIDE 47

Total feature map

φ = local feature map x = input

raw inputs Φ(x) = x = [x1, x2, x3, . . . , xN]

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

x1 x1 x2 x2 x3 x

N

x3 x

N

. . .

feature vector More detailed notation Φ(x)

slide-48
SLIDE 48

Total feature map

φ = local feature map x = input

raw inputs x = [x1, x2, x3, . . . , xN] feature vector Tensor diagram notation

s1 s2 s3 s4 s5 s6

=

φs1 φs2 φs3 φs4 φs5 φs6 · · · sN φsN Φ(x) Φ(x)

slide-49
SLIDE 49

f(x) = W · Φ(x) Construct decision function

Φ(x)

slide-50
SLIDE 50

f(x) = W · Φ(x) Construct decision function

Φ(x) W

slide-51
SLIDE 51

f(x) = W · Φ(x) Construct decision function

Φ(x) W

=

f(x)

slide-52
SLIDE 52

f(x) = W · Φ(x) Construct decision function

Φ(x) W

=

f(x)

W =

slide-53
SLIDE 53

Main approximation

W =

  • rder-N tensor

matrix product state (MPS)

slide-54
SLIDE 54

MPS form of decision function

=

Φ(x) W f(x)

slide-55
SLIDE 55

Linear scaling

=

Φ(x) W f(x) Can use algorithm similar to DMRG to optimize Scaling is N · NT · m3

N = size of input NT = size of training set m = MPS bond dimension

slide-56
SLIDE 56

Linear scaling

=

Φ(x) W f(x) Can use algorithm similar to DMRG to optimize Scaling is N · NT · m3

N = size of input NT = size of training set m = MPS bond dimension

slide-57
SLIDE 57

Linear scaling

=

Φ(x) W f(x) Can use algorithm similar to DMRG to optimize Scaling is N · NT · m3

N = size of input NT = size of training set m = MPS bond dimension

slide-58
SLIDE 58

Linear scaling

=

Φ(x) W f(x) Can use algorithm similar to DMRG to optimize Scaling is N · NT · m3

N = size of input NT = size of training set m = MPS bond dimension

slide-59
SLIDE 59

Linear scaling

=

Φ(x) W f(x) Can use algorithm similar to DMRG to optimize Scaling is N · NT · m3

N = size of input NT = size of training set m = MPS bond dimension

Could improve with stochastic gradient

slide-60
SLIDE 60

` Decision function

=

Φ(x)

=

Φ(x) Multi-class extension of model f `(x) = W ` · Φ(x) Index runs over possible labels ` ` W ` W ` Predicted label is argmax`|f `(x)| f `(x)

slide-61
SLIDE 61

MNIST is a benchmark data set of grayscale handwritten digits (labels = 0,1,2,...,9) MNIST Experiment 60,000 labeled training images 10,000 labeled test images `

slide-62
SLIDE 62

MNIST Experiment One-dimensional mapping

slide-63
SLIDE 63

Results MNIST Experiment Bond dimension Test Set Error ~5% (500/10,000 incorrect) ~2% (200/10,000 incorrect) 0.97% (97/10,000 incorrect) m = 120 m = 20 m = 10 State of the art is < 1% test set error

slide-64
SLIDE 64

Demo MNIST Experiment http://itensor.org/miles/digit/index.html Link:

slide-65
SLIDE 65

Understanding Tensor Network Models

=

Φ(x) W f(x)

slide-66
SLIDE 66

=

Φ(x) W f(x) Again assume is an MPS W Many interesting benefits

  • 1. Adaptive
  • 2. Feature sharing

Two are:

slide-67
SLIDE 67
  • 1. Tensor networks are adaptive

grayscale training data

{

boundary pixels not useful for learning

slide-68
SLIDE 68

=

Φ(x) ` W `

  • Different central tensors
  • "Wings" shared between models
  • Regularizes models

f `(x)

  • 2. Feature sharing

`

=

slide-69
SLIDE 69

=

f `(x)

  • 2. Feature sharing

` Progressively learn shared features

slide-70
SLIDE 70

=

f `(x)

  • 2. Feature sharing

` Progressively learn shared features

slide-71
SLIDE 71

=

f `(x)

  • 2. Feature sharing

` Progressively learn shared features

slide-72
SLIDE 72

=

f `(x)

  • 2. Feature sharing

` Progressively learn shared features Deliver to central tensor `

slide-73
SLIDE 73

Nature of Weight Tensor Representer theorem says exact Density plots of trained for each label W ` ` = 0, 1, . . . , 9 W = X

j

αjΦ(xj)

slide-74
SLIDE 74

Nature of Weight Tensor Representer theorem says exact W = X

j

αjΦ(xj) Tensor network approx. can violate this condition for any

  • Tensor network learning not interpolation
  • Interesting consequences for generalization?

{αj} WMPS 6= X

j

αjΦ(xj)

slide-75
SLIDE 75

Some Future Directions

  • Apply to 1D data sets (audio, time series)
  • Other tensor networks: TTN, PEPS, MERA
  • Useful to interpret as probability?

Could import even more physics insights.

  • Features extracted by elements of tensor network?

|W · Φ(x)|2

slide-76
SLIDE 76

What functions realized for arbitrary ? Instead of "spin" local feature map, use* φ(x) = (1, x)

*Novikov, et al., arxiv:1605.03795

Φ(x) =

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

φ1( ) φ2( )

[ [

x1 x1 x2 x2 x3 x

N

x3 x

N

. . .

Recall total feature map is

W

slide-77
SLIDE 77

N=2 case φ(x) = (1, x) Φ(x) =[

[

1 x1 [

[

1 x2 = (1, x1, x2, x1x2) f(x) = W · Φ(x) = W11 + W21 x1 + W12 x2 + W22 x1x2 ( 1, x1, x2, x1x2) = ·

(W11, W21, W12, W22)

slide-78
SLIDE 78

N=3 case φ(x) = (1, x) Φ(x) =[

[

1 x1 [

[

1 x2 f(x) = W · Φ(x)

⊗[

[

1

x3

= W111 + W211 x1 + W121 x2 + W112 x3 + W221 x1x2 + W212 x1x3 + W122 x1x3 + W222 x1x2x3 = (1, x1, x2, x3, x1x2, x1x3, x2x3, x1x2x3)

slide-79
SLIDE 79

Novikov, Trofimov, Oseledets, arxiv:1605.03795 (2016)

f(x) = W · Φ(x) + W211···1 x1 + W121···1 x2 + W112···1 x3 + . . . + W221···1 x1x2 + W212···1 x1x3 + . . . + W222···2 x1x2x3 · · · xN + . . . + W222···1 x1x2x3 + . . . = W111···1 General N case

constant singles doubles triples N-tuple

x ∈ RN Model has exponentially many formal parameters

slide-80
SLIDE 80

Related Work

(1410.0781, 1506.03059, 1603.00162, 1610.04167)

Cohen, Sharir, Shashua

  • tree tensor networks
  • expressivity of tensor network models
  • correlations of data (analogue of entanglement entropy)
  • generative proposal

(1605.03795)

Novikov, Trofimov, Oseledets

  • matrix product states + kernel learning
  • stochastic gradient descent
slide-81
SLIDE 81

Other MPS related work ( = "tensor trains")

Novikov et al., Proceedings of 31st ICML (2014)

Markov random field models

Lee, Cichocki, arxiv: 1410.6895 (2014)

Large scale PCA

Bengua et al., IEEE Congress on Big Data (2015)

Feature extraction of tensor data

Novikov et al., Advances in Neural Information Processing (2015)

Compressing weights of neural nets