Polyphonic Music Transcription Non-negative Matrix Factorization - - PowerPoint PPT Presentation

polyphonic music transcription non negative matrix
SMART_READER_LITE
LIVE PREVIEW

Polyphonic Music Transcription Non-negative Matrix Factorization - - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Polyphonic Music Transcription Non-negative Matrix Factorization Graduate School of Culture Technology, KAIST Juhan Nam Outlines Introduction Score-Audio Alignment Multi-Pitch


slide-1
SLIDE 1

GCT634: Musical Applications of Machine Learning

Polyphonic Music Transcription Non-negative Matrix Factorization

Graduate School of Culture Technology, KAIST Juhan Nam

slide-2
SLIDE 2

Outlines

  • Introduction
  • Score-Audio Alignment
  • Multi-Pitch Estimation
  • Non-negative Matrix Factorization (NMF)
slide-3
SLIDE 3

Polyphonic Music Transcription

  • Converting an acoustic musical signal into some form of music

notation

  • MIDI piano roll, staff notation
  • Note information: pitch, onset, offset, loudness

Model Input Output

slide-4
SLIDE 4

Related Tasks

  • Multi-pitch estimation
  • Single source: piano, guitar
  • Multiple source: quartet (woodwind, string)
  • Predominant F0 estimation
  • Melody extraction, singing melody
  • Drum transcription
  • Kick, snare, high-hat
  • Let’s listen to a piece and try to transcribe (hum) the
slide-5
SLIDE 5

Two Directions

  • Performance transcription
  • Detecting exact timing and dynamics of notes (micro-timing with 10ms

resolution or so)

  • Frame-level: onset, offset, intensity
  • Piano-roll notation is usually used (performance score)
  • Score transcription
  • Transform performance into staff notation
  • Note-level: tempo, beat, downbeat
  • Rhythmic transcription (tempo, beat, downbeat) à Temporal quantization
  • Expression detection (pedal, articulation), often phrase-level
  • Instrument identification
  • Very challenging
slide-6
SLIDE 6

Score and Performance

MIDI (score) Valentina Lisitsa Vladimir Horowitz

slide-7
SLIDE 7

Where Are The Differences?

  • Tempo
  • Note-level, (note onset/offset timings), phrase-level, song-level
  • Dynamics
  • Note-level, (note velocity), phrase-level, song-level
  • Different interpretation of musical expressions in score
  • Temporal: ritardando, rubato
  • Dynamics: piano, forte, crescendo, …
  • Play techniques or articulation: legato, staccato
  • Mood and emotion: dolce, grazioso
slide-8
SLIDE 8

Score-to-Audio Alignment

  • Temporal alignment between score and audio from a piece of

music

  • Audio-to-audio and MIDI-to-MIDI (either one is performance) are possible
  • Why do we synchronize them?
  • Automatic page turning
  • Performance analysis
  • Score following
  • Auto-accompaniment

[Müller]

slide-9
SLIDE 9

Algorithm Overview

  • Choose feature representations to compare
  • Often, MIDI is convert to audio for alignment on the same feature space
  • Compute a similarity matrix between two features sequences
  • All possible combinations of local feature pairs
  • Find a path that makes the best alignment on the similarity

matrix

  • Dynamic Time Warping (DTW)

Dynamic Programming

Feature Seq. #1

Similarity Matrix

Feature Seq. #2

Compute the local similarity Find the best path

slide-10
SLIDE 10

Feature Representations

  • Audio feature representations
  • Frequent choice for piano music is chroma

CENS : Normalized Chroma Features (Muller, 2005) MIDI Lisitsa

slide-11
SLIDE 11
  • Similarity between every pair of frame-level features
  • Euclidean or cosine distance

Similarity Matrix

slide-12
SLIDE 12

Finding the Optimal Path

  • There are so many possible paths from one corner to another

Schumann−Traumerei−Lisitsa Schumann−Traumerei−MIDI

50 100 150 200 250 300 50 100 150 200 250

slide-13
SLIDE 13

3D Surface Plot of Similarity Matrix

  • Finding the optimal path is analogous to figuring out a trail route

that you can take with minimum efforts in hiking.

slide-14
SLIDE 14

Dynamic Time Warping

  • Finding an (N, M)-warped path of length L
  • P = (p1, p2, p3, .. pL) where pi = (ni, mi)
  • Three conditions
  • Boundary condition: p1=(1,1), pL=(N,M)
  • Monotonicity condition
  • n1 <= n2 <= … <= nL
  • m1 <=m2 <= .. <mL
  • Step size condition
  • Move only upward,

rightward, diagonal (upper-right)

[Müller]

slide-15
SLIDE 15

Dynamic Time Warping : Bad Examples

[Müller]

slide-16
SLIDE 16

Dynamic Programming for DTW

  • Algorithm
  • Initialization:

D(n,1) = sum(C(1:n,1)), n=1…N D(1,m) = sum(C(1,1:m)), n=1…M

  • Recurrence Relation:

For each m = 1…M For each n = 1…N D(n-1,m) D(n,m)= C(n,m)+ min D(n,m-1) D(n-1,m-1)

  • Termination:

D(N,M) is distance

slide-17
SLIDE 17

Dynamic Programming for DTW

  • Toy Example

[Müller] Similarity Matrix (C) Accumulated cost (D)

slide-18
SLIDE 18

Score and Audio Alignment by DTW

C(i,j) D(i,j)

slide-19
SLIDE 19

Limitations

  • The optimal path is obtained after we arrive the destination (by

back-tracking)

  • In other words, DTW works offline
  • What if the sequences are very long?
  • Online version of DTW?
  • Every frame is equally important
  • In general, human is more sensitive to note onsets
  • Perceptually, every frame is not equally important
slide-20
SLIDE 20

Online DTW

  • Set a moving search window and

calculate the cost only within the window

  • Time and space cost: quadratic à linear
  • The movement is determined by the

position that gives a minimum cost within the current window. If the position is ...

  • Corner: move both up and right (alternatively)
  • Upper edge: move up
  • Right edge: move right

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Figure 2: An example of the on-line time warping algorithm with search window c = 4, showing the order of evaluation for a partic- ular sequence of row and column increments. The axes represent the variables t and j (see Figure 1) respectively. All calculated cells are framed in bold, and the optimal path is coloured grey.

[Dixon, 2005]

slide-21
SLIDE 21

Automatic Page Turner (JKU, Austria)

slide-22
SLIDE 22

Onset-sensitive Alignment

  • We are sensitive to the time alignment
  • n note onsets.
  • The similarity matrix has no additional

weight to onsets

  • DLNCO Features
  • Decaying Locally-adapted Normalized

Chroma Onset

  • Capture only onset strength on chroma

features

  • Normalize onset energy and note length

(by artificially-created note tail)

[Ewert, 2009]

slide-23
SLIDE 23

Demo: PerformScore

  • https://jdasam.github.io/PerformScore/
slide-24
SLIDE 24

Multi-pitch Estimation

  • Two types of polyphonic settings
  • Polyphonic instruments: piano, guitar
  • Ensemble of monophonic instruments: woodwind quintet, string quartet,

chorale

  • Three levels of subtasks
  • First-level: frame-wise estimation of pitches and polyphony (number of

notes)

  • Second-level: tracking pitch within a note based on temporal continuity
  • Third-level: tracking notes for each sound source, usually for ensembles of

monophonic instruments

slide-25
SLIDE 25

Challenges

  • Many sources are mixed and played simultaneously
  • They are likely to be harmonically related in music
  • Some sources can be masked by others
  • Content changes continuously by musical expressions (e.g. vibrato)
  • Compromises
  • Transcribe as many source sounds as possible
  • Only dominant sources: melody, bass, drum
slide-26
SLIDE 26

Frame-wise Multi-pitch Estimation

  • Three categories of approaches
  • Iterative F0 search: repeatedly finds predominant-F0 and removes its

related sources

  • Joint source estimation: examines possible combinations of multiples

sources, e.g., NMF

  • Classification-base approach: no prior knowledge of musical acoustics,
  • nly relies on supervised learning
slide-27
SLIDE 27

Iterative F0 estimation

  • Based on repeated cancellation of harmonic overtones of

detected F0s (Klapuri, 2003)

  • Procedure

1.

Set the original to the residual

2.

Detect predominant F0: based on the harmonic sieve method

3.

Spectral smoothing on harmonics on the detected F0

4.

Cancel the smoothed harmonics from the residual

5.

Repeat the step 2 & 3 until the residual is sufficiently flat

F0 detection Cancel sound From mixture

YR(k)← max(YR(k)− d YD(k),0) YR(k)

slide-28
SLIDE 28

Iterative F0 estimation

Spectral Smoothness

ECE 477 - Computer Audition, Zhiyao Duan 2014

Spectral Smoothness Iterative Estimation

slide-29
SLIDE 29

Iterative F0 estimation

  • Advantages
  • Deterministic: only by signal processing and no data-driven training
  • Can handle inharmonicity (e.g. piano) and vibratio
  • Limitations
  • F0 estimation becomes unreliable as iteration increases
  • Spectral smoothing is not accurate enough
slide-30
SLIDE 30

Joint Source Estimation

  • Based on a model for sound mixture
  • All sources compete with each other to explain the mixture and find a

subset that are mostly likely

  • The number of sources are limited
  • Non-negative matrix factorization (NMF) has been most widely explored
slide-31
SLIDE 31

Joint Source Estimation

  • How many spectral templates can explain the source ?
slide-32
SLIDE 32

Joint Source Estimation

  • We can explain the spectrogram with three spectral basis (𝑋)

and corresponding activations (𝐼)

  • Can we decompose 𝑊 into 𝑋 and 𝐼 automatically ?

𝑋

𝐼

𝑊 ≈ 𝑋𝐼

slide-33
SLIDE 33

Non-negative Matrix Factorization (NMF)

  • One of matrix factorization algorithms but all elements are non-

negative

  • 𝑊(𝑁 x 𝑂 matrix): original data (e.g. spectrogram)
  • 𝑋(𝑁 x 𝐿 matrix ): 𝐿 basis vectors (e.g. dictionary)
  • 𝐼(𝐿 x 𝑂 matrix): activation matrix (e.g. weights or gains)
  • Note that this provides a compressed representation.
  • A low-rank approximation

! " # # # # $ % & & & & ≈ ! " # # # # $ % & & & & ! " # # # $ % & & &

𝑊 𝑋 𝐼

slide-34
SLIDE 34

Algorithm for NMF

  • 𝑊 is known, and 𝑋 and 𝐼 are unknown. How?
  • Alternative the estimation (similar to the EM algorithm)
  • Start with random 𝑋
  • Estimate an 𝐼 given 𝑋
  • Estimate a new 𝑋 given 𝐼
  • Repeat until convergence
slide-35
SLIDE 35
  • If the distance is Euclidean, solve the following:
  • Estimate 𝐼 given 𝑋: 𝐼 = (𝑋,𝑋)./𝑋,𝑊

(least squares!)

  • Make 𝐼 non-negative: 𝐼 = max(𝐼, 0)
  • Estimate 𝑋 given 𝐼: 𝑋 = 𝑊(𝐼,𝐼)./𝐼,(least squares!)
  • Make 𝑋 non-negative: 𝑋 = max(𝑋, 0)
  • Repeat until convergence
  • The problems
  • Require pseudoinverses every iteration: expensive and stability issue
  • Gaussian assumption on the approximation

Algorithm for NMF

min

7,8,9:; <(𝑊 − 𝑋𝐼)> ?,@

𝑊 A = 𝑋𝐼

slide-36
SLIDE 36

Algorithm for NMF

  • Instead, we use a special distance
  • A variant of Kullback-Leibler (KL-divergence)
  • “Multiplicative” (magic) update rule
  • Estimate 𝑋: 𝑋

?@ = 𝑋 ?@ ∑ 7CD (89)CD 𝐼 @E E

  • Estimate 𝐼 : 𝐼

@E = 𝐼 @E ∑

𝑋

?@ 7CD (89)CD ?

  • Repeat until convergence
  • This is much faster and has no inversion!

min

7,8,9:; <(𝑊 ?@log

𝑊

?@

(𝑋𝐼)?@ 𝑊

?@ + (𝑋𝐼)?@) ?,@

(Lee and Seung, 2000)

slide-37
SLIDE 37

Property of NMF

  • The learned basis (𝑋) capture find parts
  • An example is explained by a combination of the parts (e.g additive

synthesis)

  • The basis are more structured and interpretable
slide-38
SLIDE 38

Interpretation of NMF on spectrogram

  • Columns of the spectrogram are a weighted sum of basis

vectors

slide-39
SLIDE 39

Interpretation of NMF on spectrogram

  • The whole spectrogram is approximated as a sum of matrix

“layers”, each of which is explained by one spectral component.

= + +

slide-40
SLIDE 40

Source Separation by NMF

  • We can separate each source

= + +

Resynthesized results:

slide-41
SLIDE 41

Supervised Learning

  • Perform NMF separately for isolated training data of each

source in a mixture

  • Pre-learn individual models of each source, e.g., W1 , W2 and W3
  • Combine them into a single model W = [W1 W2 W3] that explain a mixture
  • Given a mixture V, perform the NMF (Fix W and update H only)
  • Then, the activation H indicates the strength of F0s
  • Usually needs sparsity and temporal continuity on H (Virtanen, 2007)
slide-42
SLIDE 42

Supervised Learning

NMF W1 NMF W2 NMF W=[W1 W2] H=[H1 H2] Throw away H1 and H2

slide-43
SLIDE 43

Semi-supervised Learning

  • Problem in supervised learning
  • It is difficult to have training data of all individual sources.
  • Unknown sources are mixed in the majority of real-time scenarios
  • Semi-supervised Learning
  • Learn spectral basis (i.e. dictionaries) for available sources, say, W1
  • In testing phase, add new spectral basis W2 which explains the remaining

sources in the mixture

  • Fix the trained W1 and update W2 only in the NMF iteration
slide-44
SLIDE 44

Semi-supervised Learning

44

NMF W1 NMF W=[W1 W2] H=[H1 H2] Throw away H1 W2 is initialized with random numbers

slide-45
SLIDE 45

Unsupervised Learning

  • We have no information about individual source
  • Update both W and H for the mixture sound
  • Need additional constraints
  • Spectral harmonicity and smoothness on W (Vincent, 2010)
  • Very difficult!
slide-46
SLIDE 46

Unsupervised Learning

Unsupervised NMF (adapted W) [Vincent, 2010] Supervised NMF (fixed W) Unsupervised NMF + Harmonicity & Smoothness constraint on W

slide-47
SLIDE 47

Issues

  • Number of basis vectors (K)
  • Too small
  • Reconstruction errors will increase
  • The model gets under-estimated.
  • Too large
  • Do not learn parts (distribution of spectral basis vectors become sparse).
  • The model becomes too general and so it can explain other sources well.
  • Sparsity is often added to activation in order to learn “parts”. For example,

min

W,H≥0 D(V ||WH)+ H 1

slide-48
SLIDE 48

Joint Source Estimation

  • Advantages
  • Compositional model: applicable to any mixture
  • Models can be expended well with additional constraints: e.g. source-filter

model, inharmonicity

  • Limitations
  • Model can be computationally expensive: long inference time by iteration
  • Modeled pitches are usually discrete
slide-49
SLIDE 49

Classification-Based Transcription

  • Train a binary classifier for each note
  • Each classifier is trained with two groups of audio features: one including

the note and the other not including it

  • 88 classifiers for polyphonic piano transcription

Classifier (C4 note)

  • n/off

Audio features

Classifier (C#4 note)

  • n/off

. . . . ..

Frames including C4 Frames not including C4 Feature Space

slide-50
SLIDE 50

Classification-Based Transcription

  • Often trained with real music data (not single notes)
  • There are abundant MIDI files for classic piano music. It is easy to get

audio files from them: e.g. using software synthesizers or player pianos

MIDI Piano roll Audio Spectrogram

slide-51
SLIDE 51

Classification-Based Transcription

  • Audio features
  • Auditory filter bank
  • Spectrogram or Log-spectrogram
  • Classifiers
  • Support vector machines
  • Neural Network
  • Multi-label classification problem
  • Approach #1: separated binary classification for each note: select

balanced sets for each note

  • Approach #2: cross-entropy between the binary label vector and predicted
  • utput (this is more commonly used)

r n

... ... ... ... ... ...

Output Input Hidden Layers

...

slide-52
SLIDE 52

Viterbi Decoding

  • Temporal smoothing of predicted outputs
  • Separated HMM for each note: binary state (note on/off)
  • 88 initial states distributions (2x1) and transition probability matrices (2x2)

Input (Spectrogram ) Hidden layer activation HMM output SVM output

  • n

[Nam,2011]

slide-53
SLIDE 53

References

  • G. Widmer, “In search of the Horowitz Factor”, 2003
  • S. Dixon, “Live Tracking Of Musical Performance Using On-line Time

Warping”, 2005

  • S. Ewert, “High Resolution Audio Synchronization Using Chroma Onset

Features”, 2009

  • A. Klapuri, “Multiple fundamental frequency estimation based on harmonicity

and spectral smoothness”, 2003

  • T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix

Factorization with Temporal Continuity and Sparseness Criteria”, 2007

  • E. Vincent, “Adaptive Harmonic Spectral Decomposition for Multiple Pitch

Estimation”, 2010

  • G. Poliner, “A Discriminative Model for Polyphonic Piano Transcription”, 2007
  • J. Nam, “A Classification-Based Polyphonic Piano Transcription Approach

Using Learned Feature Representations”, 2011