On the Fine-Tuning Parameters in Deep Boltzmann Machines Using - - PowerPoint PPT Presentation

on the fine tuning parameters in deep boltzmann machines
SMART_READER_LITE
LIVE PREVIEW

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using - - PowerPoint PPT Presentation

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa papa@fc.unesp.br March


slide-1
SLIDE 1

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions

Jo˜ ao Paulo Papa

papa@fc.unesp.br

March 28, 2016

1 / 35

slide-2
SLIDE 2

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

2 / 35

slide-3
SLIDE 3

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Talk Outline

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

3 / 35

slide-4
SLIDE 4

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts RBMs are probabilistic models composed by two layers: visible v ∈ {0, 1}m (input) and hidden h ∈ {0, 1}n, which are connected by a weight matrix Wm×n. Additionally, we have bias units attached to each visible and hidden layer.

v

1

v

2

... v

m

v

3

h

1

h2 ... hn v h W wij b1 b2 bn a1 a2 a3 ... ... am 4 / 35

slide-5
SLIDE 5

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts The Energy of an RBM is given by: E(v, h) = −

m

  • i=1

n

  • j=1

vihjwij −

m

  • i=1

aivi −

n

  • j=1

bjhj, (1) being the probability of a given configuration (v, h) computed as follows: P(v, h) = e−E(v,h) Z , (2) where Z is the so-called normalizing constant/partition function. Such value is given by: Z =

  • v,h

e−E(v,h). (3)

5 / 35

slide-6
SLIDE 6

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts The probability of a data point v (visible layer) is defined as follows: P(v) =

  • h

P(v, h) =

  • h e−E(v,h)

Z . (4) Let V = {v 1, v 2, . . . , v M} be a training set: in short, the RBM training algorithm aims at decreasing the energy of each training sample v k ∈ ℜm in order to increase its probability.

v E v E training step

6 / 35

slide-7
SLIDE 7

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts The training data likelihood (using just one training point for sake

  • f simplicity), is given by:

φ = log P(v) = φ+ − φ−, (5) where φ+ = log

  • h

e−E(v,h) (6) and φ− = log Z = log

  • v,h

e−E(v,h). (7) Now, the question is: how can we train an RBM?

7 / 35

slide-8
SLIDE 8

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts Basically, the training step aims at updating W in order to maximize the log-likelihood of the training data until a certain convergence criterion is met (usually the number of iterations/epochs). Usually, it is employed the stochastic gradient descent for such purpose, i.e.: Wt+1 → Wt + η ∂φ+ ∂W − ∂φ− ∂W

  • ,

(8) where the positive gradient is given by (easy to be computed): ∂φ+ ∂W = vTP(h|v). (9)

8 / 35

slide-9
SLIDE 9

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts The right term of Equation 9 can be computed as follows: P(h|v) =

n

  • j=1

P(hj = 1|v), (10) where P(hj = 1|v) = σ m

  • i=1

wijvi + bj

  • .

(11) In this case, σ(x) = 1/(1 + exp(−x)). However, the main problem concerns with the negative gradient, which is given by: ∂φ− ∂W = vTP(v|h), (12)

9 / 35

slide-10
SLIDE 10

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts where v denotes the estimative (model) of the input data v, and P(v|h) is given by: P(v|h) =

m

  • i=1

P(vi = 1|h), (13) and P(vi = 1|h) = σ  

n

  • j=1

wijhj + ai   . (14) The problem is to obtain a proper approximation of the model, i.e.,

∂φ− ∂W , which requires a large number of iterations to be computed.

10 / 35

slide-11
SLIDE 11

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts Usually, we can model the task of estimating a conditional probability by means of the Markov Chain Monte Carlo (MCMC) approach, which models each step towards the approximation of the real data as a Markov chain. A Markov chain is basically a directed and weighted graph that

  • beys some properties (Ergodic Theorem):

A ✯ C D

0.1 0.9 0.7 0.3 11 / 35

slide-12
SLIDE 12

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts One of the most famous approach for sampling in Markov chains is the so-called Gibbs sampling, which approaches the likelihood solution when k → ∞, being k the number of iterations.

A ✯

P(B|A)

A

P(A|B)

P(B|A) ...

A ✯

P(B|A)

12 / 35

slide-13
SLIDE 13

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts How can we use Gibbs sampling for RBMs? Let’s say we have a Markov chain C = {v, ˜ v1, ˜ v2, . . . , ˜ vk} compose by the input data (initial state) v and its reconstruction at time step t given by ˜ vt.

random data model approximation

...

...

v

...

h

...

v

...

h

...

v

P(h|v ) P(v |h ) P(h |v )

1 k

1 1 1

P(v |h )

k k

Problem? (High computational burden, since we need k → ∞)

13 / 35

slide-14
SLIDE 14

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts Hinton (2002)a proposed the Contrasttive Divergence (CD), which alleviates the problem of Gibbs sampling.

training data model approximation

...

...

v

...

h

...

v

...

h

...

v

P(h|v ) P(v |h ) P(h |v )

1 k <<

1 1 1

P(v |h )

k k

Usually, k = 1. Problem? (Estimated models tend to stay close to training samples)

aHinton, G. E. “Training products of experts by minimizing contrastive

divergence”, Neural Computation, 14(8), 1771-1800, 2002.

14 / 35

slide-15
SLIDE 15

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Restricted Boltzmann Machines

Main concepts After that, we have two main variations of CD: Persistent Contrastive Divergence (PCD)a Fast Persistent Contrastive Divergence (FPCD)b

aTieleman T. “Training Restricted Boltzmann Machines using

Approximations to the Likelihood Gradient”, Proceedings of the 25th Annual International Conference on Machine Learning, 1064-1071, 2008.

bTieleman T., Hinton G. E. “Using Fast Weights to Improve Persistent

Contrastive Divergence”, Proceedings of the 26th Annual International Conference on Machine Learning, 1033-1040, 2009.

15 / 35

slide-16
SLIDE 16

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Deep Belief Networks

Main concepts Stacked RBMs on top of each other (greedy training).

W1 ... h1 W2 ... h2 ... WL ... hL ... v

16 / 35

slide-17
SLIDE 17

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Deep Boltzmann Machines

Main concepts Inference depends on lower and upper layers (intermediate layers); It usually works better than DBNs. P(h1

j = 1|v, h2) = φ

 

m1

  • i=1

w1

ij vi + n2

  • z=1

w2

jzh2 z

  . (15)

17 / 35

slide-18
SLIDE 18

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Talk Outline

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

18 / 35

slide-19
SLIDE 19

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Harmony Search

Main concepts Harmony Search is a meta-heuristic algorithm inspired in the improvisation process of music players. Each possible solution is modelled as a harmony, and each musician corresponds to one decision variable. Let ϕ = (ϕ1, ϕ2, . . . , ϕN) be a set of harmonies that compose the so-called “Harmony Memory”, such that ϕi ∈ ℜM. The HS algorithm generates after each iteration a new harmony vector ˆ ϕ based on memory considerations, pitch adjustments, and randomization (music improvisation). Further, the new harmony vector ˆ ϕ is evaluated in order to be accepted in the harmony memory: if ˆ ϕ is better than the worst harmony, the latter is then replaced by the new harmony.

19 / 35

slide-20
SLIDE 20

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Harmony Search

Harmony Memory Considering Rate ˆ ϕj =

  • ϕj

A

with probability HMCR θ ∈ ϕj with probability (1-HMCR), (16) where A ∼ U(1, 2, . . . , N), and ϕ = {ϕ1, ϕ2, . . . , ϕM} stands for the set of feasible values for each decision variable.

20 / 35

slide-21
SLIDE 21

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Harmony Search

Pitch Adjusting Rate Further, every component j of the new harmony vector ˆ ϕ is examined to determine whether it should be pitch-adjusted or not, which is controlled by the Pitch Adjusting Rate (PAR) variable, according to Equation 17: ˆ ϕj = ˆ ϕj ± ϕj̺ with probability PAR ˆ ϕj with probability (1-PAR). (17)

21 / 35

slide-22
SLIDE 22

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Harmony Search

Improved Harmony Search PARt = PARmin + PARmax − PARmin T t, (18) where T stands for the number of iterations, and PARmin and PARmax denote the minimum and maximum PAR values,

  • respectively. In regard to the bandwidth value at time step t, it is

computed as follows: ̺t = ̺max exp ln(̺min/̺max) T t, (19) where ̺min and ̺max stand for the minimum and maximum values

  • f ̺, respectively.

22 / 35

slide-23
SLIDE 23

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Talk Outline

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

23 / 35

slide-24
SLIDE 24

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Quaternions

Main concepts A number system that extends the complex numbers (William Hamilton, 1843). Widely used to perform calculations involving three-dimensional rotations (navigation control of aircrafts and computer vision). A quaternion q is composed of real and complex numbers, i.e., q = x0 + x1i + x2j + x3k, where x0, x1, x2, x3 ∈ ℜ and i, j, k are imaginary numbers that satisfy some properties.

24 / 35

slide-25
SLIDE 25

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Quaternions

Basic operations Given two quaternions q1 = x0 + x1i + x2j + x3k and q2 = y0 + y1i + y2j + y3k, the quaternion algebra defines a set of main operations. The addition, for instance, can be defined by: q1 + q2 = (x0 + x1i + x2j + x3k) + (y0 + y1i + y2j + y3k) (20) = (x0 + y0) + (x1 + y1)i + (x2 + y2)j + (x3 + y3)k, while the subtraction is defined as follows: q1 − q2 = (x0 + x1i + x2j + x3k) − (y0 + y1i + y2j + y3k) = (x0 − y0) + (x1 − y1)i + (x2 − y2)j + (x3 − y3)k. (21)

25 / 35

slide-26
SLIDE 26

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Quaternions

Basic operations Another important operation is the norm, which maps a given quaternion to a real-valued number, as follows: N(q1) = N(x0 + x1i + x2j + x3k) =

  • x2

0 + x2 1 + x2 2 + x2 3.

(22)

26 / 35

slide-27
SLIDE 27

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Talk Outline

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

27 / 35

slide-28
SLIDE 28

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Methodology

Fine-Tuning RBMs/DBNs/DBMs

n η λ n η λ ϕ ϕ n η ϕ λ

Layer 1 Layer 2 Layer L … …

v h

1

h

2

… …

h

L

h

L-1

n η λ n η λ n η λ

Layer 1 Layer 2 Layer L q0 ϕ ϕ ϕ q1 q2 q3 q4 q5 q6 q7 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 x2 x2 x2 x2 x2 x2 x2 x2 x2 x2 x2 x2 x3 x3 x3 x3 x3 x3 x3 x3 x3 x3 x3 x3

q

4L-1

q

4L-2

q

4L-3

q

4L-4

(a) (b)

Figure: Models: (a) standarda and (b) quaternion-basedb.

aPapa J.P. “Fine-tuning Deep Belief Networks using Harmony Search”,

Applied Soft Computing, (in press) 2015.

bPapa J.P. “Quaternion-driven Deep Belief Networks Fine-Tuning”, Pattern

Recognition Letters, (submitted) 2016.

28 / 35

slide-29
SLIDE 29

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Methodology

Task: Binary Image Reconstruction (a) (b) (c)

Figure: Some training examples from (a) MNIST, (b) CalTech 101 Silhouettes and (c) Semeion datasets.

29 / 35

slide-30
SLIDE 30

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Experiments

Task: Binary Image Reconstruction

Table: Average MSE values considering MNIST dataset.

1L 2L 3L CD PCD FPCD CD PCD FPCD CD PCD FPCD HS 0.1059 0.1325 0.1324 0.1059 0.1061 0.1057 0.1059 0.1058 0.1057 IHS 0.0903 0.0879 0.0882 0.0885 0.0886 0.0886 0.0887 0.0885 0.0886 GHS 0.1063 0.1062 0.1063 0.1061 0.1063 0.1061 0.1063 0.1065 0.1062 NGHS 0.1066 0.1066 0.1063 0.1065 0.1062 0.1062 0.1069 0.1064 0.1062 SGHS 0.1067 0.1067 0.1062 0.1072 0.1066 0.1063 0.1068 0.1065 0.1064 PSF-HS 0.1005 0.1006 0.0998 0.1032 0.0976 0.1007 0.0992 0.0995 0.0998 RS 0.1105 0.1101 0.1102 0.1105 0.1101 0.1096 0.1108 0.1099 0.1096 Hyper-RS 0.1062 0.1062 0.1060 0.1062 0.1062 0.1060 0.1062 0.1061 0.1062 Hyper-TPE 0.1059 0.1059 0.1058 0.1059 0.1059 0.1057 0.1050 0.1051 0.1051 QHS 0.0876 0.0876 0.0899 0.0876 0.0876 0.0901 0.0876 0.0876 0.0918 QIHS 0.0876 0.0876 0.0888 0.0876 0.0876 0.0882 0.0888 0.0876 0.0888

30 / 35

slide-31
SLIDE 31

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Experiments

Task: Binary Image Reconstruction

Table: Average MSE values considering Semeion Handwritten Digit dataset.

1L 2L 3L CD PCD FPCD CD PCD FPCD CD PCD FPCD HS 0.2128 0.2128 0.2129 0.2202 0.2128 0.2128 0.2199 0.2128 0.2128 IHS 0.2131 0.2130 0.2128 0.2116 0.2114 0.2121 0.2103 0.2109 0.2119 GHS 0.2133 0.2129 0.2128 0.2129 0.2130 0.2129 0.2129 0.2129 0.2128 NGHS 0.2134 0.2132 0.2131 0.2130 0.2131 0.2129 0.2131 0.2132 0.2130 SGHS 0.2135 0.2131 0.2130 0.2131 0.2131 0.2130 0.2132 0.2132 0.2130 PSF-HS 0.2137 0.2130 0.2130 0.2121 0.2120 0.2124 0.2120 0.2120 0.2121 RS 0.2146 0.2143 0.2145 0.2146 0.2144 0.2139 0.2143 0.2140 0.2140 Hyper-RS 0.2127 0.2129 0.2129 0.2129 0.2129 0.2129 0.2129 0.2129 0.2128 Hyper-TPE 0.2128 0.2128 0.2128 0.2128 0.2128 0.2127 0.2128 0.2128 0.2128 QHS 0.2095 0.2096 0.2143 0.2096 0.2096 0.2142 0.2096 0.2096 0.2170 QIHS 0.2096 0.2096 0.2096 0.2096 0.2159 0.1624 0.2096 0.2096 0.2132

31 / 35

slide-32
SLIDE 32

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Experiments

Task: Binary Image Reconstruction

  • 140
  • 130
  • 120
  • 110
  • 100
  • 90
  • 80
  • 70
  • 60
  • 50
  • 40

1 2 3 4 5 6 7 8 9 10 log PL Iteration HS IHS QHS QIHS

Figure: Logarithm of the pseudo-likelihood over MNIST dataset considering HS, IHS, QHS and QIHS optimization techniques.

32 / 35

slide-33
SLIDE 33

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Experiments

Task: Binary Image Reconstruction by means of DBMs.

PSO HS IHS RS 1L 2L 3L 1L 2L 3L 1L 2L 3L 1L 2L 3L CD-DBM 0.19410 0.20956 0.20959 0.19086 0.20958 0.20958 0.19025 0.20956 0.20958 0.19458 0.20960 0.20960 PCD-DBM 0.19130 0.20959 0.20959 0.19048 0.20959 0.20959 0.19078 0.20956 0.20958 0.19463 0.20957 0.20959 CD-DBN 0.19849 0.20963 0.20959 0.20959 0.20960 0.20959 0.19359 0.20961 0.20961 0.19710 0.20962 0.20960 PCD-DBN 0.20309 0.20961 0.20959 0.20015 0.20959 0.20961 0.20009 0.20961 0.20963 0.20361 0.20959 0.20960

Table: Average MSE over the test set considering Semeion Handwritten Digit dataset considering DBMsa.

aPassos-J´

unior “A Meta-Heuristic-driven Approach to Fine-Tune Deep Boltzmann Machines”, International Conference on Pattern Recognition, (submitted) 2016.

33 / 35

slide-34
SLIDE 34

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Talk Outline

1

Restricted Boltzmann Machines

2

Harmony Search

3

Quaternions

4

Methodology and Experiments

5

Conclusions and Future Works

34 / 35

slide-35
SLIDE 35

Talk Outline Restricted Boltzmann Machines Harmony Search Quaternions Methodology and Experiments Conclusions and Futu

Conclusions and Future Works

Conclusions Meta-heuristics seem to be useful to fine-tune RBMs, DBNs and DBMs. One needs less computational effort. Easier than hand-tuning. Future Works To employ quaternion-based optimization considering DBMs. To evaluate quaternion spaces with other meta-heuristics. To combine different spaces during the optimization. To improve chain mixing using meta-heuristics.

35 / 35