Contrastive Divergence by Accelerated Langevin Dynamics . . . - - PowerPoint PPT Presentation

contrastive divergence by accelerated langevin dynamics
SMART_READER_LITE
LIVE PREVIEW

Contrastive Divergence by Accelerated Langevin Dynamics . . . - - PowerPoint PPT Presentation

. Contrastive Divergence by Accelerated Langevin Dynamics . . . Masayuki Ohzeki Kyoto University 2015/08/11 Japan-France Joint Seminar New Frontiers in Non-equilibrium Physics of Glassy Materials This work is in collaboration with


slide-1
SLIDE 1

. . . . . .

. . . .

Contrastive Divergence by Accelerated Langevin Dynamics

Masayuki Ohzeki

Kyoto University

2015/08/11 Japan-France Joint Seminar “New Frontiers in Non-equilibrium Physics of Glassy Materials” This work is in collaboration with

  • M. Yasuda (Yamagata Univ.) and A. Ichiki (Nagoya Univ.)
  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 1 / 20

slide-2
SLIDE 2

. . . . . .

. . .

1

Accelerated Langevin dynamics Formulation Example: double-valley potential Example: XY model . . .

2

Boltzmann Machine Learning Basic Contrastive divergence Preliminary result . . .

3

Summary

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 2 / 20

slide-3
SLIDE 3

. . . . . .

What is the accelerated stochastic dynamics?

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 3 / 20

slide-4
SLIDE 4

. . . . . .

.

Ordinary Langevin dynamics

. . . . . The over-damped N-dimensional Langevin dynamics is given by dx = −∂E(x) ∂x + √ 2TdW, where T is the temperature and W is the Wiener process. .

Equilibrium distribution

. . . . . The equilibrium state is Peq(x) = 1 Z exp ( −E(x) T ) . Why do you use this dynamics? Investigation of the probability distribution in the dynamics Simulation of the natural stochastic dynamics

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 4 / 20

slide-5
SLIDE 5

. . . . . .

.

Ordinary Langevin dynamics

. . . . . The over-damped N-dimensional Langevin dynamics is given by dx = −∂E(x) ∂x + √ 2TdW, where T is the temperature and W is the Wiener process. .

Equilibrium distribution

. . . . . The equilibrium state is Peq(x) = 1 Z exp ( −E(x) T ) . Why do you use this dynamics? Investigation of the probability distribution in the dynamics Simulation of the natural stochastic dynamics

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 4 / 20

slide-6
SLIDE 6

. . . . . .

.

Ordinary Langevin dynamics

. . . . . The over-damped N-dimensional Langevin dynamics is given by dx = −∂E(x) ∂x + √ 2TdW, where T is the temperature and W is the Wiener process. .

Equilibrium distribution

. . . . . The equilibrium state is Peq(x) = 1 Z exp ( −E(x) T ) . Why do you use this dynamics? Investigation of the probability distribution in the dynamics Simulation of the natural stochastic dynamics

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 4 / 20

slide-7
SLIDE 7

. . . . . .

.

Ordinary Langevin dynamics

. . . . . The over-damped N-dimensional Langevin dynamics is given by dx = −∂E(x) ∂x + √ 2TdW, where T is the temperature and W is the Wiener process. .

Equilibrium distribution

. . . . . The equilibrium state is Peq(x) = 1 Z exp ( −E(x) T ) . Why do you use this dynamics? Investigation of the probability distribution in the dynamics Simulation of the natural stochastic dynamics

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 4 / 20

slide-8
SLIDE 8

. . . . . .

In order to evaluate the distribution quickly, we do not necessarily use the natural force

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 5 / 20

slide-9
SLIDE 9

. . . . . .

.

Accelerated Langevin dynamics

. . . . . Let us find the accelerated Langevin dynamics with the simple form of dx = −∂E(x) ∂x + F(x) + √ 2TdW, where T is the temperature and dW is the Wiener process. .

Condition

. . . . . The steady state has the Gibbs-Boltzmann distribution Pss(x) = 1 Z exp ( −E(x) T ) What force can hold the same steady state?

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 6 / 20

slide-10
SLIDE 10

. . . . . .

.

Accelerated Langevin dynamics

. . . . . Let us find the accelerated Langevin dynamics with the simple form of dx = −∂E(x) ∂x + F(x) + √ 2TdW, where T is the temperature and dW is the Wiener process. .

Condition

. . . . . The steady state has the Gibbs-Boltzmann distribution Pss(x) = 1 Z exp ( −E(x) T ) What force can hold the same steady state?

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 6 / 20

slide-11
SLIDE 11

. . . . . .

.

Accelerated Langevin dynamics

. . . . . Let us find the accelerated Langevin dynamics with the simple form of dx = −∂E(x) ∂x + F(x) + √ 2TdW, where T is the temperature and dW is the Wiener process. .

Condition

. . . . . The steady state has the Gibbs-Boltzmann distribution Pss(x) = 1 Z exp ( −E(x) T ) What force can hold the same steady state?

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 6 / 20

slide-12
SLIDE 12

. . . . . .

.

Nontrivial force [M.Ohzeki and A. Ichiki (2015)]

. . . . . Find solution of the Fokker-Planck equation ∂Pt(x) ∂t = − ∂ ∂x ( −∂E(x) ∂x + F(x) − T ∂ ∂x ) Pt(x) The condition is reduced to 0 = − ∂ ∂x (F(x)Pss(x)) Equilibrium force F(x) = 0 Exponential force F(x) ∝ γ exp (E(x)/T) Rotational force [F(x)]P(i) = γ ([∂E(x) ∂x ]

P(i−1)

− [∂E(x) ∂x ]

P(i+1)

) where P(i) is the permutation of indices.

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 7 / 20

slide-13
SLIDE 13

. . . . . .

.

Nontrivial force [M.Ohzeki and A. Ichiki (2015)]

. . . . . Find solution of the Fokker-Planck equation ∂Pt(x) ∂t = − ∂ ∂x ( −∂E(x) ∂x + F(x) − T ∂ ∂x ) Pt(x) The condition is reduced to 0 = − ∂ ∂x (F(x)Pss(x)) Equilibrium force F(x) = 0 Exponential force F(x) ∝ γ exp (E(x)/T) Rotational force [F(x)]P(i) = γ ([∂E(x) ∂x ]

P(i−1)

− [∂E(x) ∂x ]

P(i+1)

) where P(i) is the permutation of indices.

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 7 / 20

slide-14
SLIDE 14

. . . . . .

.

Nontrivial force [M.Ohzeki and A. Ichiki (2015)]

. . . . . Find solution of the Fokker-Planck equation ∂Pt(x) ∂t = − ∂ ∂x ( −∂E(x) ∂x + F(x) − T ∂ ∂x ) Pt(x) The condition is reduced to 0 = − ∂ ∂x (F(x)Pss(x)) Equilibrium force F(x) = 0 Exponential force F(x) ∝ γ exp (E(x)/T) Rotational force [F(x)]P(i) = γ ([∂E(x) ∂x ]

P(i−1)

− [∂E(x) ∂x ]

P(i+1)

) where P(i) is the permutation of indices.

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 7 / 20

slide-15
SLIDE 15

. . . . . .

.

Nontrivial force in duplicate system [M.Ohzeki and A. Ichiki (2015)]

. . . . . Find solution of the Fokker-Planck equation for a duplicate system ∂Pt(x1, x2) ∂t = − ∂ ∂x1 ( −∂E(x1) ∂x1 + F1(x1, x2) − T ∂ ∂x1 ) Pt(x1, x2) − ∂ ∂x2 ( −∂E(x2) ∂x2 + F2(x1, x2) − T ∂ ∂x2 ) Pt(x1, x2) The condition is reduced to 0 = − ∂ ∂x1 (F1(x1, x2)Pss(x1)Pss(x2)) − ∂ ∂x2 (F1(x1, x2)Pss(x1)Pss(x2)) Nontrivial force in the duplicate system F1(x1, x2) = γ ∂E(x2) ∂x2 F2(x1, x2) = −γ ∂E(x1) ∂x1 .

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 8 / 20

slide-16
SLIDE 16

. . . . . .

.

What does the nontrivial force yield?

. . . . . Violation of the detailed balance condition (γ = 0) Convergence to nonequilibrium steady state Faster convergence than equilibrium system

in analytical way by matrix analysis [A. Ichiki and M. Ohzeki (2013)]

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 9 / 20

slide-17
SLIDE 17

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 1 1.5 2

x E(x) x0 xeq

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-18
SLIDE 18

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4 at t = 5 in T = 1. γ = 0 (red) vs γ = 1 (blue and purple).

1 2 3 4 5 −0.2 0.2 0.4 0.6 0.8 1

<x1> <x2> <x> time

0.5 1 −0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 (x0,x0) (xeq,xeq)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-19
SLIDE 19

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4 at t = 5 in T = 1. γ = 0 (red) vs γ = 2 (blue and purple).

1 2 3 4 5 −0.2 0.2 0.4 0.6 0.8 1

<x1> <x2> <x> time

0.5 1 −0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 (x0,x0) (xeq,xeq)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-20
SLIDE 20

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4 at t = 5 in T = 1. γ = 0 (red) vs γ = 5 (blue and purple).

1 2 3 4 5 −0.2 0.2 0.4 0.6 0.8 1

<x1> <x2> <x> time

0.5 1 −0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 (x0,x0) (xeq,xeq)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-21
SLIDE 21

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4 at t = 5 in T = 1. γ = 0 (red) vs γ = 10 (blue and purple).

1 2 3 4 5 −0.2 0.2 0.4 0.6 0.8 1

<x1> <x2> <x> time

0.5 1 −0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 (x0,x0) (xeq,xeq)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-22
SLIDE 22

. . . . . .

.

Example: double-valley potential [M. Ohzeki and A. Ichiki (2015)]

. . . . . We set N = 1000 particles in a double-valley potential E(x) = −1 2x2 + 1 4x4 We confirm reduction of correlation time of x by τ = ∑∞

t=1 OiOi+t−Oi2 O2

i −Oi2 2 4 6 8 10 0.2 0.4 0.6 0.8 1.0 1.2

τ γ

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 10 / 20

slide-23
SLIDE 23

. . . . . .

.

Example: XY model [M. Ohzeki and A. Ichiki (2015)]

. . . . . We employ the XY model as an interacting many-body system E(x) = − ∑

i=1

j∈∂i

cos (xi − xj) , Note that xi here denotes the spin direction such that xi ∈ [0, 2π). We set N = 10 × 10 spins of independent N = 1000 runs and γ = 0 (Red) and 10 (Blue and Purple) at T = 0.5 below TKT.

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100

time magnetization

  • 2
  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1

0 10 20 30 40 50 60 70 80 90 100

internal energy time

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 11 / 20

slide-24
SLIDE 24

. . . . . .

Other accelerated stochastic dynamics in MCMC by Suwa-Todo method (optimization of transition matrix)

[H. Suwa and S. Todo (2010)]

in MCMC by Skewed DBC (global flow in a duplicate system)

[Y. Sakai and K. Hukushima (2013)]

in analytical way by optimization of master equation (Brachistochrone)

[K. Takahashi and M. Ohzeki, to be submitted]

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 12 / 20

slide-25
SLIDE 25

. . . . . .

Our study Nonequilibrium physics → Machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 13 / 20

slide-26
SLIDE 26

. . . . . .

What is Boltzmann machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 14 / 20

slide-27
SLIDE 27

. . . . . .

.

Aim

. . . . . Clarify a generative model of the given high-dimensional data x(d) ∈ RN(d = 1, 2, · · · , D) Maximum Likelihood Estimation: Learning model P(x|θ) = 1 Z(θ) exp (−E(x|θ)) P(x) x x(1) x(2) x(3)x(4) x(5)x(6)x(7)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 15 / 20

slide-28
SLIDE 28

. . . . . .

.

Aim

. . . . . Clarify a generative model of the given high-dimensional data x(d) ∈ RN(d = 1, 2, · · · , D) Maximum Likelihood Estimation: Learning model P(x|θ) = 1 Z(θ) exp (−E(x|θ)) P(x) x P(x = x |) x(1) x(2) x(3)x(4) x(5)x(6)x(7)

(d)

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 15 / 20

slide-29
SLIDE 29

. . . . . .

How to perform the maximum likelihood estimation? Compute logarithm of likelihood function LD(θ) = 1 D

D

d=1

log P(x = x(d)|θ) Use gradient method ∂LD(θ) ∂θ = − 1 D

D

d=1

∂E(x = x(d)|θ) ∂θ + ∂E(x|θ) ∂θ

  • θ

first term =empirical mean of data second term= thermal average of model · · · θ = ∑

x P(x|θ)×

Iterative update to achieve the maximum θt+1 = θt + η∂LD(θ) ∂θ

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 16 / 20

slide-30
SLIDE 30

. . . . . .

How to perform the maximum likelihood estimation? Compute logarithm of likelihood function LD(θ) = 1 D

D

d=1

log P(x = x(d)|θ) Use gradient method ∂LD(θ) ∂θ = − 1 D

D

d=1

∂E(x = x(d)|θ) ∂θ + ∂E(x|θ) ∂θ

  • θ

first term =empirical mean of data second term= thermal average of model · · · θ = ∑

x P(x|θ)×

Iterative update to achieve the maximum θt+1 = θt + η∂LD(θ) ∂θ

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 16 / 20

slide-31
SLIDE 31

. . . . . .

How to perform the maximum likelihood estimation? Compute logarithm of likelihood function LD(θ) = 1 D

D

d=1

log P(x = x(d)|θ) Use gradient method ∂LD(θ) ∂θ = − 1 D

D

d=1

∂E(x = x(d)|θ) ∂θ + ∂E(x|θ) ∂θ

  • θ

first term =empirical mean of data second term= thermal average of model · · · θ = ∑

x P(x|θ)×

Iterative update to achieve the maximum θt+1 = θt + η∂LD(θ) ∂θ

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 16 / 20

slide-32
SLIDE 32

. . . . . .

.

How to evaluate thermal average

. . . . . Approximation or Monte-Carlo simulation Markov-Chain Monte-Carlo method xt=0 − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Slow but asymptotically exact in T → ∞ Contrastive divergence x(d) − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Early stop! but good performance Pseudo likelihood estimation, etc Asymptotically exact in D → ∞, and less flexibility

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 17 / 20

slide-33
SLIDE 33

. . . . . .

.

How to evaluate thermal average

. . . . . Approximation or Monte-Carlo simulation Markov-Chain Monte-Carlo method xt=0 − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Slow but asymptotically exact in T → ∞ Contrastive divergence x(d) − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Early stop! but good performance Pseudo likelihood estimation, etc Asymptotically exact in D → ∞, and less flexibility

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 17 / 20

slide-34
SLIDE 34

. . . . . .

.

How to evaluate thermal average

. . . . . Approximation or Monte-Carlo simulation Markov-Chain Monte-Carlo method xt=0 − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Slow but asymptotically exact in T → ∞ Contrastive divergence x(d) − − − − →

MCMC xt=1 −

− − − →

MCMC · · · −

− − − →

MCMC xt=T

Early stop! but good performance Pseudo likelihood estimation, etc Asymptotically exact in D → ∞, and less flexibility

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 17 / 20

slide-35
SLIDE 35

. . . . . .

.

Our present study

. . . . . Let us implement the accelerated Langevin dynamics to Contrastive divergence The speedup of the contrastive divergence is essential in machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 18 / 20

slide-36
SLIDE 36

. . . . . .

.

Our present study

. . . . . Let us implement the accelerated Langevin dynamics to Contrastive divergence The speedup of the contrastive divergence is essential in machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 18 / 20

slide-37
SLIDE 37

. . . . . .

.

Preliminary result: Simple Gaussian distribution

. . . . . We assume that the generative model is P(x|θ) ∝ exp ( −1 2xTJx − hTx ) We have D = 1000 data points to infer the original J and h. A test in extremely small system N = 1. We use γ = 5. CD-1 step is defined as the integration time t = 1 (dt = 0.01).

1 2 3 4 5 6 7 8 9 10 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 iteration step Loglikelihood function

MC MC C D

  • 1

v D B C- CD- 1

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 19 / 20

slide-38
SLIDE 38

. . . . . .

.

Preliminary result: Simple Gaussian distribution

. . . . . We assume that the generative model is P(x|θ) ∝ exp ( −1 2xTJx − hTx ) We have D = 1000 data points to infer the original J and h. A test in extremely small system N = 1. We use γ = 5. CD-1 step is defined as the integration time t = 1 (dt = 0.01).

1 2 3 4 5 6 7 8 9 10 0.09 0.095 MCMC C D- 1 vD B C- C D1

iteration step MSE_J

1 2 3 4 5 6 7 8 9 10

0.92 0.93 0.94 0.95 0.96

iteration step MSE_h

v D BC- C D1 CD1 M C M C

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 19 / 20

slide-39
SLIDE 39

. . . . . .

.

Summary of our present study

. . . . . We implement the accelerated stochastic dynamics to Contrastive divergence Utilization of violation of the detailed balance condition Confirm its efficiency in terms of the log-likelihood function The speedup of the contrastive divergence is essential in machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 20 / 20

slide-40
SLIDE 40

. . . . . .

.

Summary of our present study

. . . . . We implement the accelerated stochastic dynamics to Contrastive divergence Utilization of violation of the detailed balance condition Confirm its efficiency in terms of the log-likelihood function The speedup of the contrastive divergence is essential in machine learning

  • M. Ohzeki (KU)

Japan-France Joint Seminar 2015/08/11 20 / 20