Polyphonic Music Transcription Non-negative Matrix Factorization - - PowerPoint PPT Presentation
Polyphonic Music Transcription Non-negative Matrix Factorization - - PowerPoint PPT Presentation
GCT634: Musical Applications of Machine Learning Polyphonic Music Transcription Non-negative Matrix Factorization Graduate School of Culture Technology, KAIST Juhan Nam Outlines Introduction Score-Audio Alignment Multi-Pitch
Outlines
- Introduction
- Score-Audio Alignment
- Multi-Pitch Estimation
- Non-negative Matrix Factorization (NMF)
Polyphonic Music Transcription
- Converting an acoustic musical signal into some form of music
notation
- MIDI piano roll, staff notation
- Note information: pitch, onset, offset, loudness
Model Input Output
Related Tasks
- Multi-pitch estimation
- Single source: piano, guitar
- Multiple source: quartet (woodwind, string)
- Predominant F0 estimation
- Melody extraction, singing melody
- Drum transcription
- Kick, snare, high-hat
- Let’s listen to a piece and try to transcribe (hum) the
Two Directions
- Performance transcription
- Detecting exact timing and dynamics of notes (micro-timing with 10ms
resolution or so)
- Frame-level: onset, offset, intensity
- Piano-roll notation is usually used (performance score)
- Score transcription
- Transform performance into staff notation
- Note-level: tempo, beat, downbeat
- Rhythmic transcription (tempo, beat, downbeat) à Temporal quantization
- Expression detection (pedal, articulation), often phrase-level
- Instrument identification
- Very challenging
Score and Performance
MIDI (score) Valentina Lisitsa Vladimir Horowitz
Where Are The Differences?
- Tempo
- Note-level, (note onset/offset timings), phrase-level, song-level
- Dynamics
- Note-level, (note velocity), phrase-level, song-level
- Different interpretation of musical expressions in score
- Temporal: ritardando, rubato
- Dynamics: piano, forte, crescendo, …
- Play techniques or articulation: legato, staccato
- Mood and emotion: dolce, grazioso
Score-to-Audio Alignment
- Temporal alignment between score and audio from a piece of
music
- Audio-to-audio and MIDI-to-MIDI (either one is performance) are possible
- Why do we synchronize them?
- Automatic page turning
- Performance analysis
- Score following
- Auto-accompaniment
[Müller]
Algorithm Overview
- Choose feature representations to compare
- Often, MIDI is convert to audio for alignment on the same feature space
- Compute a similarity matrix between two features sequences
- All possible combinations of local feature pairs
- Find a path that makes the best alignment on the similarity
matrix
- Dynamic Time Warping (DTW)
Dynamic Programming
Feature Seq. #1
Similarity Matrix
Feature Seq. #2
Compute the local similarity Find the best path
Feature Representations
- Audio feature representations
- Frequent choice for piano music is chroma
CENS : Normalized Chroma Features (Muller, 2005) MIDI Lisitsa
- Similarity between every pair of frame-level features
- Euclidean or cosine distance
Similarity Matrix
Finding the Optimal Path
- There are so many possible paths from one corner to another
Schumann−Traumerei−Lisitsa Schumann−Traumerei−MIDI
50 100 150 200 250 300 50 100 150 200 250
3D Surface Plot of Similarity Matrix
- Finding the optimal path is analogous to figuring out a trail route
that you can take with minimum efforts in hiking.
Dynamic Time Warping
- Finding an (N, M)-warped path of length L
- P = (p1, p2, p3, .. pL) where pi = (ni, mi)
- Three conditions
- Boundary condition: p1=(1,1), pL=(N,M)
- Monotonicity condition
- n1 <= n2 <= … <= nL
- m1 <=m2 <= .. <mL
- Step size condition
- Move only upward,
rightward, diagonal (upper-right)
[Müller]
Dynamic Time Warping : Bad Examples
[Müller]
Dynamic Programming for DTW
- Algorithm
- Initialization:
D(n,1) = sum(C(1:n,1)), n=1…N D(1,m) = sum(C(1,1:m)), n=1…M
- Recurrence Relation:
For each m = 1…M For each n = 1…N D(n-1,m) D(n,m)= C(n,m)+ min D(n,m-1) D(n-1,m-1)
- Termination:
D(N,M) is distance
Dynamic Programming for DTW
- Toy Example
[Müller] Similarity Matrix (C) Accumulated cost (D)
Score and Audio Alignment by DTW
C(i,j) D(i,j)
Limitations
- The optimal path is obtained after we arrive the destination (by
back-tracking)
- In other words, DTW works offline
- What if the sequences are very long?
- Online version of DTW?
- Every frame is equally important
- In general, human is more sensitive to note onsets
- Perceptually, every frame is not equally important
Online DTW
- Set a moving search window and
calculate the cost only within the window
- Time and space cost: quadratic à linear
- The movement is determined by the
position that gives a minimum cost within the current window. If the position is ...
- Corner: move both up and right (alternatively)
- Upper edge: move up
- Right edge: move right
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Figure 2: An example of the on-line time warping algorithm with search window c = 4, showing the order of evaluation for a partic- ular sequence of row and column increments. The axes represent the variables t and j (see Figure 1) respectively. All calculated cells are framed in bold, and the optimal path is coloured grey.
[Dixon, 2005]
Automatic Page Turner (JKU, Austria)
Onset-sensitive Alignment
- We are sensitive to the time alignment
- n note onsets.
- The similarity matrix has no additional
weight to onsets
- DLNCO Features
- Decaying Locally-adapted Normalized
Chroma Onset
- Capture only onset strength on chroma
features
- Normalize onset energy and note length
(by artificially-created note tail)
[Ewert, 2009]
Demo: PerformScore
- https://jdasam.github.io/PerformScore/
Multi-pitch Estimation
- Two types of polyphonic settings
- Polyphonic instruments: piano, guitar
- Ensemble of monophonic instruments: woodwind quintet, string quartet,
chorale
- Three levels of subtasks
- First-level: frame-wise estimation of pitches and polyphony (number of
notes)
- Second-level: tracking pitch within a note based on temporal continuity
- Third-level: tracking notes for each sound source, usually for ensembles of
monophonic instruments
Challenges
- Many sources are mixed and played simultaneously
- They are likely to be harmonically related in music
- Some sources can be masked by others
- Content changes continuously by musical expressions (e.g. vibrato)
- Compromises
- Transcribe as many source sounds as possible
- Only dominant sources: melody, bass, drum
Frame-wise Multi-pitch Estimation
- Three categories of approaches
- Iterative F0 search: repeatedly finds predominant-F0 and removes its
related sources
- Joint source estimation: examines possible combinations of multiples
sources, e.g., NMF
- Classification-base approach: no prior knowledge of musical acoustics,
- nly relies on supervised learning
Iterative F0 estimation
- Based on repeated cancellation of harmonic overtones of
detected F0s (Klapuri, 2003)
- Procedure
1.
Set the original to the residual
2.
Detect predominant F0: based on the harmonic sieve method
3.
Spectral smoothing on harmonics on the detected F0
4.
Cancel the smoothed harmonics from the residual
5.
Repeat the step 2 & 3 until the residual is sufficiently flat
F0 detection Cancel sound From mixture
YR(k)← max(YR(k)− d YD(k),0) YR(k)
Iterative F0 estimation
Spectral Smoothness
ECE 477 - Computer Audition, Zhiyao Duan 2014
Spectral Smoothness Iterative Estimation
Iterative F0 estimation
- Advantages
- Deterministic: only by signal processing and no data-driven training
- Can handle inharmonicity (e.g. piano) and vibratio
- Limitations
- F0 estimation becomes unreliable as iteration increases
- Spectral smoothing is not accurate enough
Joint Source Estimation
- Based on a model for sound mixture
- All sources compete with each other to explain the mixture and find a
subset that are mostly likely
- The number of sources are limited
- Non-negative matrix factorization (NMF) has been most widely explored
Joint Source Estimation
- How many spectral templates can explain the source ?
Joint Source Estimation
- We can explain the spectrogram with three spectral basis (𝑋)
and corresponding activations (𝐼)
- Can we decompose 𝑊 into 𝑋 and 𝐼 automatically ?
𝑋
𝐼
𝑊 ≈ 𝑋𝐼
Non-negative Matrix Factorization (NMF)
- One of matrix factorization algorithms but all elements are non-
negative
- 𝑊(𝑁 x 𝑂 matrix): original data (e.g. spectrogram)
- 𝑋(𝑁 x 𝐿 matrix ): 𝐿 basis vectors (e.g. dictionary)
- 𝐼(𝐿 x 𝑂 matrix): activation matrix (e.g. weights or gains)
- Note that this provides a compressed representation.
- A low-rank approximation
! " # # # # $ % & & & & ≈ ! " # # # # $ % & & & & ! " # # # $ % & & &
𝑊 𝑋 𝐼
Algorithm for NMF
- 𝑊 is known, and 𝑋 and 𝐼 are unknown. How?
- Alternative the estimation (similar to the EM algorithm)
- Start with random 𝑋
- Estimate an 𝐼 given 𝑋
- Estimate a new 𝑋 given 𝐼
- Repeat until convergence
- If the distance is Euclidean, solve the following:
- Estimate 𝐼 given 𝑋: 𝐼 = (𝑋,𝑋)./𝑋,𝑊
(least squares!)
- Make 𝐼 non-negative: 𝐼 = max(𝐼, 0)
- Estimate 𝑋 given 𝐼: 𝑋 = 𝑊(𝐼,𝐼)./𝐼,(least squares!)
- Make 𝑋 non-negative: 𝑋 = max(𝑋, 0)
- Repeat until convergence
- The problems
- Require pseudoinverses every iteration: expensive and stability issue
- Gaussian assumption on the approximation
Algorithm for NMF
min
7,8,9:; <(𝑊 − 𝑋𝐼)> ?,@
𝑊 A = 𝑋𝐼
Algorithm for NMF
- Instead, we use a special distance
- A variant of Kullback-Leibler (KL-divergence)
- “Multiplicative” (magic) update rule
- Estimate 𝑋: 𝑋
?@ = 𝑋 ?@ ∑ 7CD (89)CD 𝐼 @E E
- Estimate 𝐼 : 𝐼
@E = 𝐼 @E ∑
𝑋
?@ 7CD (89)CD ?
- Repeat until convergence
- This is much faster and has no inversion!
min
7,8,9:; <(𝑊 ?@log
𝑊
?@
(𝑋𝐼)?@ 𝑊
?@ + (𝑋𝐼)?@) ?,@
(Lee and Seung, 2000)
Property of NMF
- The learned basis (𝑋) capture find parts
- An example is explained by a combination of the parts (e.g additive
synthesis)
- The basis are more structured and interpretable
Interpretation of NMF on spectrogram
- Columns of the spectrogram are a weighted sum of basis
vectors
≈
Interpretation of NMF on spectrogram
- The whole spectrogram is approximated as a sum of matrix
“layers”, each of which is explained by one spectral component.
= + +
Source Separation by NMF
- We can separate each source
= + +
Resynthesized results:
Supervised Learning
- Perform NMF separately for isolated training data of each
source in a mixture
- Pre-learn individual models of each source, e.g., W1 , W2 and W3
- Combine them into a single model W = [W1 W2 W3] that explain a mixture
- Given a mixture V, perform the NMF (Fix W and update H only)
- Then, the activation H indicates the strength of F0s
- Usually needs sparsity and temporal continuity on H (Virtanen, 2007)
Supervised Learning
NMF W1 NMF W2 NMF W=[W1 W2] H=[H1 H2] Throw away H1 and H2
Semi-supervised Learning
- Problem in supervised learning
- It is difficult to have training data of all individual sources.
- Unknown sources are mixed in the majority of real-time scenarios
- Semi-supervised Learning
- Learn spectral basis (i.e. dictionaries) for available sources, say, W1
- In testing phase, add new spectral basis W2 which explains the remaining
sources in the mixture
- Fix the trained W1 and update W2 only in the NMF iteration
Semi-supervised Learning
44
NMF W1 NMF W=[W1 W2] H=[H1 H2] Throw away H1 W2 is initialized with random numbers
Unsupervised Learning
- We have no information about individual source
- Update both W and H for the mixture sound
- Need additional constraints
- Spectral harmonicity and smoothness on W (Vincent, 2010)
- Very difficult!
Unsupervised Learning
Unsupervised NMF (adapted W) [Vincent, 2010] Supervised NMF (fixed W) Unsupervised NMF + Harmonicity & Smoothness constraint on W
Issues
- Number of basis vectors (K)
- Too small
- Reconstruction errors will increase
- The model gets under-estimated.
- Too large
- Do not learn parts (distribution of spectral basis vectors become sparse).
- The model becomes too general and so it can explain other sources well.
- Sparsity is often added to activation in order to learn “parts”. For example,
min
W,H≥0 D(V ||WH)+ H 1
Joint Source Estimation
- Advantages
- Compositional model: applicable to any mixture
- Models can be expended well with additional constraints: e.g. source-filter
model, inharmonicity
- Limitations
- Model can be computationally expensive: long inference time by iteration
- Modeled pitches are usually discrete
Classification-Based Transcription
- Train a binary classifier for each note
- Each classifier is trained with two groups of audio features: one including
the note and the other not including it
- 88 classifiers for polyphonic piano transcription
Classifier (C4 note)
- n/off
Audio features
Classifier (C#4 note)
- n/off
. . . . ..
Frames including C4 Frames not including C4 Feature Space
Classification-Based Transcription
- Often trained with real music data (not single notes)
- There are abundant MIDI files for classic piano music. It is easy to get
audio files from them: e.g. using software synthesizers or player pianos
MIDI Piano roll Audio Spectrogram
Classification-Based Transcription
- Audio features
- Auditory filter bank
- Spectrogram or Log-spectrogram
- Classifiers
- Support vector machines
- Neural Network
- Multi-label classification problem
- Approach #1: separated binary classification for each note: select
balanced sets for each note
- Approach #2: cross-entropy between the binary label vector and predicted
- utput (this is more commonly used)
r n
... ... ... ... ... ...
Output Input Hidden Layers
...
Viterbi Decoding
- Temporal smoothing of predicted outputs
- Separated HMM for each note: binary state (note on/off)
- 88 initial states distributions (2x1) and transition probability matrices (2x2)
Input (Spectrogram ) Hidden layer activation HMM output SVM output
- n
[Nam,2011]
References
- G. Widmer, “In search of the Horowitz Factor”, 2003
- S. Dixon, “Live Tracking Of Musical Performance Using On-line Time
Warping”, 2005
- S. Ewert, “High Resolution Audio Synchronization Using Chroma Onset
Features”, 2009
- A. Klapuri, “Multiple fundamental frequency estimation based on harmonicity
and spectral smoothness”, 2003
- T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix
Factorization with Temporal Continuity and Sparseness Criteria”, 2007
- E. Vincent, “Adaptive Harmonic Spectral Decomposition for Multiple Pitch
Estimation”, 2010
- G. Poliner, “A Discriminative Model for Polyphonic Piano Transcription”, 2007
- J. Nam, “A Classification-Based Polyphonic Piano Transcription Approach