Neural Probabilistic Models for Melody Prediction, Sequence - - PowerPoint PPT Presentation

neural probabilistic models for melody prediction
SMART_READER_LITE
LIVE PREVIEW

Neural Probabilistic Models for Melody Prediction, Sequence - - PowerPoint PPT Presentation

Neural Probabilistic Models for Melody Prediction, Sequence Labelling and Classification Srikanth Cherla https://cherla.org September 13, 2017 1 / 47 Outline 1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann


slide-1
SLIDE 1

Neural Probabilistic Models for Melody Prediction, Sequence Labelling and Classification

Srikanth Cherla https://cherla.org

September 13, 2017

1 / 47

slide-2
SLIDE 2

Outline

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

2 / 47

slide-3
SLIDE 3

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

3 / 47

slide-4
SLIDE 4

Sequences in Notated Music

  • A wealth of information in notated music
  • Increasingly available
  • in different formats (MIDI, Kern, GP4, etc.)
  • for different kinds of music (classical, rock, pop, etc.)
  • Analysis of sequences key to extracting information
  • Melody — Good starting point for a broader analysis

4 / 47

slide-5
SLIDE 5

Relevance

Scientific:

  • Computational musicology
  • Organizing music data
  • Aiding acoustic models
  • Music education

Creative:

  • Automatic music generation
  • Compositional assistance

5 / 47

slide-6
SLIDE 6

Task: Melody Prediction

  • Model a series of musical events sT

1 as follows

p

  • sT

1

  • =

T

  • t=1

p

  • st|s(t−1)

(t−n+1)

  • Conditional probabilities learned from a corpus
  • Information theoretic measure - cross entropy, to

measure a trained model’s prediction uncertainty H (p, pm) = −

T

  • t=1

p

  • wt|w(t−1)

(t−n+1)

  • log2 pm
  • wt|w(t−1)

(t−n+1)

  • How well does a model pm approximate p?
  • Cross entropy to be minimized

6 / 47

slide-7
SLIDE 7

Motivating Distributed Models

  • Previous work focused on n-gram models
  • No comparative results with other prediction models
  • Thriving neural networks research (Bengio, 2009)
  • Recent success of neural network language models (Bengio

2003; Collobert et al., 2011; Mikolov et al., 2010) Start with an evaluation of connectionist models on the melody prediction task

7 / 47

slide-8
SLIDE 8

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

8 / 47

slide-9
SLIDE 9

Restricted Boltzmann Machine (Smolensky, 1986)

  • Generative, energy-based graphical model.
  • Data v in visible layer, features h in hidden layer.
  • Can model joint probability p(v) of data as

p(v) = exp(−FreeEnergy(v))

  • v∗ exp(−FreeEnergy(v∗))

where, FreeEnergy(v) = − log(

h exp(−Energy(v, h)))

  • Learned using Contrastive Divergence (Hinton, 2002).

s(t−n+1:t) h v W

9 / 47

slide-10
SLIDE 10

Discriminative RBM (Larochelle & Bengio, 2008)

  • Discriminative classifier based on the RBM.
  • Data x and class-label y in visible layer.
  • Can model the conditional probability p(y|x) as

p(y|x) = exp(−FreeEnergy(x, y))

  • y∗ exp(−FreeEnergy(x, y∗))
  • Exact gradient computation is possible.

s(t−n+1:t−1) s(t) h x y V U

10 / 47

slide-11
SLIDE 11

Recurrent Temporal RBM (Sutskever et al., 2009)

  • Generative model for high-dimensional time-series.
  • RBM at time t conditioned on ˆ

h(t−1)

  • Models joint probability of a sequence as

p(v(1:T), h(1:T)) =

  • t

p(v(t)|h(t−1))p(h(t)|v(t), h(t−1))

  • Learned using Contrastive Divergence and BPTT.

s(0:1) s(1:2) h(0) h(1) h(2) . . . v(1) v(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W W

11 / 47

slide-12
SLIDE 12

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

12 / 47

slide-13
SLIDE 13

Motivation

  • Discriminative inference on generative RTRBM
  • Possible to carry out discriminative learning
  • Previous work suggested potential improvements

13 / 47

slide-14
SLIDE 14

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Extend DRBM learning to a recurrent model p(y(t)|x(1:t)) = p(y(t)|x(t), ˆ h(t−1)) = exp(−FreeEnergy(x(t), y(t)))

  • y∗ exp(−FreeEnergy(x(t), y∗))

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

14 / 47

slide-15
SLIDE 15

Discriminative Learning in the RTRBM (Cherla et al., 2015)

Apply to an entire sequence to optimize the log-likelihood: O = log p(y(1:T)|x(1:T)) =

T

  • t=1

log p(y(t)|x(t), ˆ h(t−1)) s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

15 / 47

slide-16
SLIDE 16

Discriminative Learning in the RTRBM (Cherla et al., 2015)

  • Recurrent extension of the DRBM.
  • Identical in structure to the RTRBM.
  • Exact gradient of cost computable at each time-step.
  • Back-Propagation Through Time for sequence learning.

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U

16 / 47

slide-17
SLIDE 17

Experiments: Melody Corpus

Corpus

  • As used in (Pearce & Wiggins, 2004).
  • A collection of 8 datasets.
  • Folk songs from the Essen Folk Song Collection.
  • Chorale melodies.

Dataset

  • No. events

|χ| Yugoslavian folk songs 2691 25 Alsatian folk songs 4496 32 Swiss folk songs 4586 34 Austrian folk songs 5306 35 German folk songs 8393 27 Canadian folk songs 8553 25 Chorale melodies 9227 21 Chinese folk songs 11056 41

17 / 47

slide-18
SLIDE 18

Experiments: Melody Corpus

Models

  • Non-recurrent: n-grams (b), n-grams (u), FNN, RBMs,

DRBMs with context length ∈ {1, 2, 3, 4, 5, 6, 7, 8}.

  • Recurrent: RNN, RTRBM, RTDRBM over entire

sequences.

  • Hidden units ∈ {25, 50, 100, 200}
  • Learning rate ∈ {0.01, 0.05}
  • Trained for 500 epochs.
  • Best model determined over a validation set.

Evaluation criterion — cross-entropy Hc(pmod, Dtest) =

sn 1 ∈Dtest log2 pmod(sn|s(n−1) 1

) |Dtest|

18 / 47

slide-19
SLIDE 19

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

In general, performance improves with context length.

19 / 47

slide-20
SLIDE 20

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

n-gram model performance worsens at lower context length.

20 / 47

slide-21
SLIDE 21

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

Non-recurrent connectionist models outperform n-grams.

21 / 47

slide-22
SLIDE 22

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

Recurrent connectionist models outperform non-recurrent.

22 / 47

slide-23
SLIDE 23

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

RTDRBM outperforms RTRBM.

23 / 47

slide-24
SLIDE 24

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

With a shorter context: DRBM outperforms RBM.

24 / 47

slide-25
SLIDE 25

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

With a longer context: RBM outperforms DRBM.

25 / 47

slide-26
SLIDE 26

Results

2 4 6 8 2.8 2.9 3 3.1

Context length Cross Entropy

n − gram(b) F NN DRBM RBM n − gram(u) RNN RT RBM RT DRBM

More details and discussion available in the paper.

26 / 47

slide-27
SLIDE 27

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

27 / 47

slide-28
SLIDE 28

Motivation

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U ˆ h(t−1) = σ(Wx(t−1) + Uy(t−1) + c(t−1)) = σ(Wx(t−1) + Uy(t−1) + Whhˆ h(t−2) + c) Limitation: Dependence of h(t) on y∗(t−1) which is not suitable for general sequence-labelling problems

28 / 47

slide-29
SLIDE 29

Motivation

s(0) s(1) s(1) s(2) h(0) h(1) h(2) . . . x(1) y(1) x(2) y(2) . . . Whh c(1) Whv b(1) Whh c(2) Whv b(2) W U W U ˆ h(t−1) = σ(Wx(t−1) + Uy(t−1) + c(t−1)) = σ(Wx(t−1) + Uy(t−1) + Whhˆ h(t−2) + c) Solution: Replace y∗(t−1) (unavailable at test time) with predicted output y(t−1) of previous time-step.

29 / 47

slide-30
SLIDE 30

Experiments: OCR

Dataset (Taskar et al., 2004)

  • 6, 877 English sentences with 52, 152 words
  • Each character a 16 × 8 binary image
  • ASCII code label for each image (26 categories)
  • 10 cross-validation folds, one hold-out test set

Method

  • Grid search over model hyperparameters
  • 10-fold cross validation during model selection
  • Models trained over entire sentences

Evaluation: Average Loss Per Sequence E(y, y∗) = 1 N

N

  • i=1

  1 Li

Li

  • j=1

I

  • (yi)j = (y∗

i )j

 (1)

30 / 47

slide-31
SLIDE 31

Experiments: OCR

Baseline Models (Nguyen & Guo, 2007)

  • Multiclass Support Vector Machine (SVMmulticlass)
  • Structured SVM (SVMstruct)
  • Max-Margin Markov Network (M3N)
  • Averaged Perceptron
  • SEARN
  • Conditional Random Field (CRF)
  • Hidden Markov Model (HMM)
  • Structured Learning Ensemble (SLE)

State-of-the-art

  • Neural Conditional Random Fields (NCRF) (Do et al.,

2010)

  • Gradient Boosted Conditional Random Fields (GBCRF)

(Chen et al., 2015)

31 / 47

slide-32
SLIDE 32

Results: Baseline

Model Error (%) RTDRBM 15.95(±0.0009) SLE 20.58 SVMstruct 21.16 HMM 23.70 M3N 25.08 Perceptron 26.40 SEARN 27.02 SVMmulticlass 28.54 CRF 32.30

Table: Comparison between the prediction error (%) of the RTDRBM and models evaluated in (Nguyen & Guo, 2007).

32 / 47

slide-33
SLIDE 33

Results: State-of-the-art

Model Error (%) NCRF 4.44 GBCRF 4.64(±0.0027) RTDRBM 15.95(±0.0009)

Table: Comparison between the prediction error (%) of the RTDRBM and state-of-the-art on the OCR dataset which use Neural Conditional Random Fields (NCRF) (Do et al., 2010) and Gradient Boosted Conditional Random Fields (Chen et al., 2015).

33 / 47

slide-34
SLIDE 34

Next

1 Introduction: Analysis of Sequences in Music 2 Preliminaries: Restricted Boltzmann Machines, etc. 3 Contribution: The Recurrent Temporal Discriminative RBM 4 Extension: Generalising the RTDRBM 5 Contribution: Generalising the DRBM

34 / 47

slide-35
SLIDE 35

Motivation

h x y V U

  • The DRBM is essentially the RBM.
  • Various variants of the RBM have been proposed
  • {−1, +1}-binary hidden unit activations.
  • Integer valued hidden unit activations.
  • Real-valued hidden unit activations.
  • How might the same be achieved for the DRBM?

35 / 47

slide-36
SLIDE 36

Key Intuition

Generalise the expression for the DRBM conditional distribution p(y|x) as a function of the values that its hidden states can assume, then derive the conditional distribution as per the desired values of its hidden states.

36 / 47

slide-37
SLIDE 37

Generalising the DRBM Conditional Distribution (Cherla et al., 2017)

Begin with the expression for the conditional distribution P(y|x) =

  • h P (x, y, h)
  • y∗
  • h P (x, y∗, h)

=

  • h exp (−E (x, y, h))
  • y∗
  • h exp (−E (x, y∗, h))

(2) where y is the one-hot encoding of a class label y.

37 / 47

slide-38
SLIDE 38

Generalising the DRBM Conditional Distribution (Cherla et al., 2017)

This can be generalised as follows (details in the paper): P(y|x) = exp (by)

j

  • k exp (sk
  • i xiwij + uyj + cj)
  • y∗ exp (by∗)

j

  • k exp (sk
  • i xiwij + uy∗j + cj)

(3) where sk is each of the k states that can be assumed by each hidden unit j of the model.

38 / 47

slide-39
SLIDE 39

(Re-)Deriving the DRBM (Cherla et al., 2017)

The (Bernoulli) DRBM conditional distribution can be derived when the states sk = {0, 1}. Pber (y|x) = exp (by)

j

  • sk∈{0,1} exp (skαj)
  • y∗ exp (by∗)

j

  • sk∈{0,1} exp
  • skα∗

j

  • =

exp (by)

j (1 + exp (αj))

  • y∗ exp (by∗)

j

  • 1 + exp
  • α∗

j

  • (4)

39 / 47

slide-40
SLIDE 40

The Bipolar DRBM (Cherla et al., 2017)

The Bipolar DRBM conditional distribution can be derived when the states sk = {−1, +1}. Pbip (y|x) = exp (by)

j

  • sk∈{−1,+1} exp (skαj)
  • y∗ exp (by∗)

j

  • sk∈{−1,+1} exp
  • skα∗

j

  • =

exp (by)

j (exp (−αj) + exp (αj))

  • y∗ exp (by∗)

j

  • exp
  • −α∗

j

  • + exp
  • α∗

j

. (5)

40 / 47

slide-41
SLIDE 41

The Binomial DRBM (Cherla et al., 2017)

The Binomial DRBM conditional distribution can be derived when the states sk = {0, . . . , N}. SN =

N

  • sk=0

exp (skαj) = 1 + exp (αj)

(N−1)

  • sk=0

exp (skαj) = 1 − exp ((N + 1) αj) 1 − exp (αj) (6)

41 / 47

slide-42
SLIDE 42

Experiments: ML Benchmarks

  • Datasets

1 MNIST digit classification. 2 USPS digit classification. 3 20 Newsgroups document classification.

  • Grid search with each model evaluated over 10 seeded runs.
  • The value of N (bins) in the Binomial DRBM varied as

{2, 4, 8}.

  • Maximise log-likelihood on training and validation set.
  • Report average classification error on test set

E(y, y∗) = 1

N

N

i=1 I (yi = y∗ i ).

42 / 47

slide-43
SLIDE 43

Results: MNIST

Model Average Loss(%) DRBM (nhid = 500, ηinit = 0.05) 1.78(±0.0012) Bipolar DRBM (nhid = 500, ηinit = 0.01) 1.84(±0.0007) Binomial DRBM (nhid = 500, ηinit = 0.01) 1.86(±0.0016)

Table: Results on the USPS dataset. The Binomial DRBM in this table is the one with nbins = 2.

nbins nhid ηinit Average Loss (%) 2 500 0.01 1.86 4 500 0.01 1.88 8 500 0.001 1.90

Table: Performance of the Binomial DRBM with different values of

  • nbins. The difference was within the margin of significance.

43 / 47

slide-44
SLIDE 44

Results: USPS

Model Average Loss (%) DRBM (n = 50, ηinit = 0.01) 6.90(±0.0047) Bipolar DRBM (n = 500, ηinit = 0.01) 6.49(±0.0026) Binomial DRBM (n = 1000, ηinit = 0.01) 6.09(±0.0014)

Table: Performance on the USPS dataset. The Binomial DRBM in this table is the one with nbins = 8.

nbins ηinit nhid Average Loss (%) 2 0.01 50 6.90(±0.0047) 4 0.01 1000 6.48(±0.0018) 8 0.01 1000 6.09(±0.0014)

Table: Classification average losses of the Binomial DRBM with different values of nbins.

44 / 47

slide-45
SLIDE 45

Results: 20 Newsgroups

Model Average Loss (%) DRBM (n = 50, ηinit = 0.01) 28.52(±0.0049) Bipolar DRBM (n = 50, ηinit = 0.001) 27.75(±0.0019) Binomial DRBM (n = 100, ηinit = 0.001) 28.17(±0.0028)

Table: Performance on the 20 Newsgroups dataset. The Binomial DRBM in this table is the one with nbins = 2.

nbins ηinit nhidden Average Loss (%) 2 0.001 100 28.17(±0.0028) 4 0.001 50 28.24(±0.0032) 8 0.0001 50 28.76(±0.0040)

Table: Classification performance of the Binomial DRBM with different values of nbins.

45 / 47

slide-46
SLIDE 46

Acknowledgements

Parts of the above described work were done in collaboration with Son N. Tran (now a researcher at CSIRO) at City, University London.

46 / 47

slide-47
SLIDE 47

Thank you!

Questions?

47 / 47