[PPT] - Neural Probabilistic Models for Melody Prediction, Sequence PowerPoint Presentation

SLIDE 1

Neural Probabilistic Models for Melody Prediction, Sequence Labelling and Classification

Srikanth Cherla https://cherla.org

September 13, 2017

1 / 47

SLIDE 2

Outline

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

2 / 47

SLIDE 3

Sequences in Notated Music

A wealth of information in notated music
Increasingly available
in different formats (MIDI, Kern, GP4, etc.)
for different kinds of music (classical, rock, pop, etc.)
Analysis of sequences key to extracting information
Melody — Good starting point for a broader analysis

4 / 47

SLIDE 5

Relevance

Scientific:

Computational musicology
Organizing music data
Aiding acoustic models
Music education

Creative:

Automatic music generation
Compositional assistance

5 / 47

SLIDE 6

Task: Melody Prediction

Model a series of musical events sT

1 as follows

p

sT

1

=

T

t=1

p

st|s(t−1)

(t−n+1)

Conditional probabilities learned from a corpus
Information theoretic measure - cross entropy, to

measure a trained model’s prediction uncertainty H (p, pm) = −

T

t=1

p

wt|w(t−1)

(t−n+1)

log2 pm
wt|w(t−1)

(t−n+1)

How well does a model pm approximate p?
Cross entropy to be minimized

6 / 47

SLIDE 7

Motivating Distributed Models

Previous work focused on n-gram models
No comparative results with other prediction models
Thriving neural networks research (Bengio, 2009)
Recent success of neural network language models (Bengio

2003; Collobert et al., 2011; Mikolov et al., 2010) Start with an evaluation of connectionist models on the melody prediction task

7 / 47

SLIDE 8

Restricted Boltzmann Machine (Smolensky, 1986)

Generative, energy-based graphical model.
Data v in visible layer, features h in hidden layer.
Can model joint probability p(v) of data as

p(v) = exp(−FreeEnergy(v))

v∗ exp(−FreeEnergy(v∗))

where, FreeEnergy(v) = − log(

h exp(−Energy(v, h)))

Learned using Contrastive Divergence (Hinton, 2002).

s(t−n+1:t) h v W

9 / 47

SLIDE 10

Discriminative RBM (Larochelle & Bengio, 2008)

Discriminative classifier based on the RBM.
Data x and class-label y in visible layer.
Can model the conditional probability p(y|x) as

p(y|x) = exp(−FreeEnergy(x, y))

y∗ exp(−FreeEnergy(x, y∗))
Exact gradient computation is possible.

s(t−n+1:t−1) s(t) h x y V U

10 / 47

SLIDE 11

Recurrent Temporal RBM (Sutskever et al., 2009)

Generative model for high-dimensional time-series.
RBM at time t conditioned on ˆ

h(t−1)

Models joint probability of a sequence as

p(v(1:T), h(1:T)) =

t

p(v(t)|h(t−1))p(h(t)|v(t), h(t−1))

Learned using Contrastive Divergence and BPTT.

s(0:1) s(1:2) h(0) h(1) h(2) . . . v(1) v(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W W

11 / 47

SLIDE 12

Motivation

Discriminative inference on generative RTRBM
Possible to carry out discriminative learning
Previous work suggested potential improvements

13 / 47

SLIDE 14

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Extend DRBM learning to a recurrent model p(y(t)|x(1:t)) = p(y(t)|x(t), ˆ h(t−1)) = exp(−FreeEnergy(x(t), y(t)))

y∗ exp(−FreeEnergy(x(t), y∗))

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

14 / 47

SLIDE 15

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Apply to an entire sequence to optimize the log-likelihood: O = log p(y(1:T)|x(1:T)) =

T

t=1

log p(y(t)|x(t), ˆ h(t−1)) s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

15 / 47

SLIDE 16

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Recurrent extension of the DRBM.
Identical in structure to the RTRBM.
Exact gradient of cost computable at each time-step.
Back-Propagation Through Time for sequence learning.

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

16 / 47

SLIDE 17

Experiments: Melody Corpus

Corpus

As used in (Pearce & Wiggins, 2004).
A collection of 8 datasets.
Folk songs from the Essen Folk Song Collection.
Chorale melodies.

Dataset

No. events

|χ| Yugoslavian folk songs 2691 25 Alsatian folk songs 4496 32 Swiss folk songs 4586 34 Austrian folk songs 5306 35 German folk songs 8393 27 Canadian folk songs 8553 25 Chorale melodies 9227 21 Chinese folk songs 11056 41

17 / 47

SLIDE 18

Experiments: Melody Corpus

Models

Non-recurrent: n-grams (b), n-grams (u), FNN, RBMs,

DRBMs with context length ∈ {1, 2, 3, 4, 5, 6, 7, 8}.

Recurrent: RNN, RTRBM, RTDRBM over entire

sequences.

Hidden units ∈ {25, 50, 100, 200}
Learning rate ∈ {0.01, 0.05}
Trained for 500 epochs.
Best model determined over a validation set.

Evaluation criterion — cross-entropy Hc(pmod, Dtest) =

−

sn 1 ∈Dtest log2 pmod(sn|s(n−1) 1

) |Dtest|

18 / 47

SLIDE 19

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

In general, performance improves with context length.

19 / 47

SLIDE 20

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

n-gram model performance worsens at lower context length.

20 / 47

SLIDE 21

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

Non-recurrent connectionist models outperform n-grams.

21 / 47

SLIDE 22

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

Recurrent connectionist models outperform non-recurrent.

22 / 47

SLIDE 23

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

RTDRBM outperforms RTRBM.

23 / 47

SLIDE 24

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

With a shorter context: DRBM outperforms RBM.

24 / 47

SLIDE 25

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

With a longer context: RBM outperforms DRBM.

25 / 47

SLIDE 26

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

More details and discussion available in the paper.

26 / 47

SLIDE 27

Motivation

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U ˆ h(t−1) = σ(Wx(t−1) + Uy(t−1) + c(t−1)) = σ(Wx(t−1) + Uy(t−1) + Whhˆ h(t−2) + c) Limitation: Dependence of h(t) on y∗(t−1) which is not suitable for general sequence-labelling problems

28 / 47

SLIDE 29

Motivation

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U ˆ h(t−1) = σ(Wx(t−1) + Uy(t−1) + c(t−1)) = σ(Wx(t−1) + Uy(t−1) + Whhˆ h(t−2) + c) Solution: Replace y∗(t−1) (unavailable at test time) with predicted output y(t−1) of previous time-step.

29 / 47

SLIDE 30

Experiments: OCR

Dataset (Taskar et al., 2004)

6, 877 English sentences with 52, 152 words
Each character a 16 × 8 binary image
ASCII code label for each image (26 categories)
10 cross-validation folds, one hold-out test set

Method

Grid search over model hyperparameters
10-fold cross validation during model selection
Models trained over entire sentences

Evaluation: Average Loss Per Sequence E(y, y∗) = 1 N

N

i=1

  1 Li

Li

j=1

I

(yi)j = (y∗

i )j



 (1)

30 / 47

SLIDE 31

Experiments: OCR

Baseline Models (Nguyen & Guo, 2007)

Multiclass Support Vector Machine (SVMmulticlass)
Structured SVM (SVMstruct)
Max-Margin Markov Network (M3N)
Averaged Perceptron
SEARN
Conditional Random Field (CRF)
Hidden Markov Model (HMM)
Structured Learning Ensemble (SLE)

State-of-the-art

Neural Conditional Random Fields (NCRF) (Do et al.,

2010)

Gradient Boosted Conditional Random Fields (GBCRF)

(Chen et al., 2015)

31 / 47

SLIDE 32

Results: Baseline

Model Error (%) RTDRBM 15.95(±0.0009) SLE 20.58 SVMstruct 21.16 HMM 23.70 M3N 25.08 Perceptron 26.40 SEARN 27.02 SVMmulticlass 28.54 CRF 32.30

Table: Comparison between the prediction error (%) of the RTDRBM and models evaluated in (Nguyen & Guo, 2007).

32 / 47

SLIDE 33

Results: State-of-the-art

Model Error (%) NCRF 4.44 GBCRF 4.64(±0.0027) RTDRBM 15.95(±0.0009)

Table: Comparison between the prediction error (%) of the RTDRBM and state-of-the-art on the OCR dataset which use Neural Conditional Random Fields (NCRF) (Do et al., 2010) and Gradient Boosted Conditional Random Fields (Chen et al., 2015).

33 / 47

SLIDE 34

Motivation

h x y V U

The DRBM is essentially the RBM.
Various variants of the RBM have been proposed
{−1, +1}-binary hidden unit activations.
Integer valued hidden unit activations.
Real-valued hidden unit activations.
How might the same be achieved for the DRBM?

35 / 47

SLIDE 36

Key Intuition

Generalise the expression for the DRBM conditional distribution p(y|x) as a function of the values that its hidden states can assume, then derive the conditional distribution as per the desired values of its hidden states.

36 / 47

SLIDE 37

Generalising the DRBM Conditional Distribution (Cherla et al., 2017)

Begin with the expression for the conditional distribution P(y|x) =

h P (x, y, h)
y∗
h P (x, y∗, h)

=

h exp (−E (x, y, h))
y∗
h exp (−E (x, y∗, h))

(2) where y is the one-hot encoding of a class label y.

37 / 47

SLIDE 38

Generalising the DRBM Conditional Distribution (Cherla et al., 2017)

This can be generalised as follows (details in the paper): P(y|x) = exp (by)

j

k exp (sk
i xiwij + uyj + cj)
y∗ exp (by∗)

j

k exp (sk
i xiwij + uy∗j + cj)

(3) where sk is each of the k states that can be assumed by each hidden unit j of the model.

38 / 47

SLIDE 39

(Re-)Deriving the DRBM (Cherla et al., 2017)

The (Bernoulli) DRBM conditional distribution can be derived when the states sk = {0, 1}. Pber (y|x) = exp (by)

j

sk∈{0,1} exp (skαj)
y∗ exp (by∗)

j

sk∈{0,1} exp
skα∗

j

=

exp (by)

j (1 + exp (αj))

y∗ exp (by∗)

j

1 + exp
α∗

j

(4)

39 / 47

SLIDE 40

The Bipolar DRBM (Cherla et al., 2017)

The Bipolar DRBM conditional distribution can be derived when the states sk = {−1, +1}. Pbip (y|x) = exp (by)

j

sk∈{−1,+1} exp (skαj)
y∗ exp (by∗)

j

sk∈{−1,+1} exp
skα∗

j

=

exp (by)

j (exp (−αj) + exp (αj))

y∗ exp (by∗)

j

exp
−α∗

j

+ exp
α∗

j

. (5)

40 / 47

SLIDE 41

The Binomial DRBM (Cherla et al., 2017)

The Binomial DRBM conditional distribution can be derived when the states sk = {0, . . . , N}. SN =

N

sk=0

exp (skαj) = 1 + exp (αj)

(N−1)

sk=0

exp (skαj) = 1 − exp ((N + 1) αj) 1 − exp (αj) (6)

41 / 47

SLIDE 42

Experiments: ML Benchmarks

Datasets

1 MNIST digit classification. 2 USPS digit classification. 3 20 Newsgroups document classification.

Grid search with each model evaluated over 10 seeded runs.
The value of N (bins) in the Binomial DRBM varied as

{2, 4, 8}.

Maximise log-likelihood on training and validation set.
Report average classification error on test set

E(y, y∗) = 1

N

i=1 I (yi = y∗ i ).

42 / 47

SLIDE 43

Results: MNIST

Model Average Loss(%) DRBM (nhid = 500, ηinit = 0.05) 1.78(±0.0012) Bipolar DRBM (nhid = 500, ηinit = 0.01) 1.84(±0.0007) Binomial DRBM (nhid = 500, ηinit = 0.01) 1.86(±0.0016)

Table: Results on the USPS dataset. The Binomial DRBM in this table is the one with nbins = 2.

nbins nhid ηinit Average Loss (%) 2 500 0.01 1.86 4 500 0.01 1.88 8 500 0.001 1.90

Table: Performance of the Binomial DRBM with different values of

nbins. The difference was within the margin of significance.

43 / 47

SLIDE 44

Results: USPS

Model Average Loss (%) DRBM (n = 50, ηinit = 0.01) 6.90(±0.0047) Bipolar DRBM (n = 500, ηinit = 0.01) 6.49(±0.0026) Binomial DRBM (n = 1000, ηinit = 0.01) 6.09(±0.0014)

Table: Performance on the USPS dataset. The Binomial DRBM in this table is the one with nbins = 8.

nbins ηinit nhid Average Loss (%) 2 0.01 50 6.90(±0.0047) 4 0.01 1000 6.48(±0.0018) 8 0.01 1000 6.09(±0.0014)

Table: Classification average losses of the Binomial DRBM with different values of nbins.

44 / 47

SLIDE 45

Results: 20 Newsgroups

Model Average Loss (%) DRBM (n = 50, ηinit = 0.01) 28.52(±0.0049) Bipolar DRBM (n = 50, ηinit = 0.001) 27.75(±0.0019) Binomial DRBM (n = 100, ηinit = 0.001) 28.17(±0.0028)

Table: Performance on the 20 Newsgroups dataset. The Binomial DRBM in this table is the one with nbins = 2.

nbins ηinit nhidden Average Loss (%) 2 0.001 100 28.17(±0.0028) 4 0.001 50 28.24(±0.0032) 8 0.0001 50 28.76(±0.0040)

Table: Classification performance of the Binomial DRBM with different values of nbins.

45 / 47

SLIDE 46

Acknowledgements

Parts of the above described work were done in collaboration with Son N. Tran (now a researcher at CSIRO) at City, University London.

46 / 47

SLIDE 47

Thank you!

Questions?

47 / 47

Neural Probabilistic Models for Melody Prediction, Sequence Labelling and Classification

Srikanth Cherla https://cherla.org

September 13, 2017

Outline

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

Sequences in Notated Music

Relevance

Scientific:

Creative:

Task: Melody Prediction

1 as follows

p

1

T

p

(t−n+1)

measure a trained model’s prediction uncertainty H (p, pm) = −

T

p

(t−n+1)

(t−n+1)

Motivating Distributed Models

2003; Collobert et al., 2011; Mikolov et al., 2010) Start with an evaluation of connectionist models on the melody prediction task

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

Restricted Boltzmann Machine (Smolensky, 1986)

p(v) = exp(−FreeEnergy(v))

where, FreeEnergy(v) = − log(

h exp(−Energy(v, h)))

s(t−n+1:t) h v W

Discriminative RBM (Larochelle & Bengio, 2008)

p(y|x) = exp(−FreeEnergy(x, y))

s(t−n+1:t−1) s(t) h x y V U

Recurrent Temporal RBM (Sutskever et al., 2009)

h(t−1)

p(v(1:T), h(1:T)) =

p(v(t)|h(t−1))p(h(t)|v(t), h(t−1))

s(0:1) s(1:2) h(0) h(1) h(2) . . . v(1) v(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W W

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

Motivation

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Extend DRBM learning to a recurrent model p(y(t)|x(1:t)) = p(y(t)|x(t), ˆ h(t−1)) = exp(−FreeEnergy(x(t), y(t)))

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Apply to an entire sequence to optimize the log-likelihood: O = log p(y(1:T)|x(1:T)) =

T

log p(y(t)|x(t), ˆ h(t−1)) s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

Discriminative Learning in the RTRBM (Cherla et al., 2015)

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

Experiments: Melody Corpus

Corpus

Dataset

|χ| Yugoslavian folk songs 2691 25 Alsatian folk songs 4496 32 Swiss folk songs 4586 34 Austrian folk songs 5306 35 German folk songs 8393 27 Canadian folk songs 8553 25 Chorale melodies 9227 21 Chinese folk songs 11056 41

Experiments: Melody Corpus

Models

DRBMs with context length ∈ {1, 2, 3, 4, 5, 6, 7, 8}.

sequences.

Evaluation criterion — cross-entropy Hc(pmod, Dtest) =

−

) |Dtest|

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

In general, performance improves with context length.

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n-gram model performance worsens at lower context length.

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

Non-recurrent connectionist models outperform n-grams.

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

Recurrent connectionist models outperform non-recurrent.

Results