polyphonic music transcription non negative matrix
play

Polyphonic Music Transcription Non-negative Matrix Factorization - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Polyphonic Music Transcription Non-negative Matrix Factorization Graduate School of Culture Technology, KAIST Juhan Nam Outlines Introduction Score-Audio Alignment Multi-Pitch


  1. GCT634: Musical Applications of Machine Learning Polyphonic Music Transcription Non-negative Matrix Factorization Graduate School of Culture Technology, KAIST Juhan Nam

  2. Outlines • Introduction • Score-Audio Alignment • Multi-Pitch Estimation • Non-negative Matrix Factorization (NMF)

  3. Polyphonic Music Transcription • Converting an acoustic musical signal into some form of music notation - MIDI piano roll, staff notation - Note information: pitch, onset, offset, loudness Model Input Output

  4. Related Tasks • Multi-pitch estimation - Single source: piano, guitar - Multiple source: quartet (woodwind, string) • Predominant F0 estimation - Melody extraction, singing melody • Drum transcription - Kick, snare, high-hat • Let’s listen to a piece and try to transcribe (hum) the

  5. Two Directions • Performance transcription - Detecting exact timing and dynamics of notes (micro-timing with 10ms resolution or so) - Frame-level: onset, offset, intensity - Piano-roll notation is usually used (performance score) • Score transcription - Transform performance into staff notation - Note-level: tempo, beat, downbeat - Rhythmic transcription (tempo, beat, downbeat) à Temporal quantization - Expression detection (pedal, articulation), often phrase-level - Instrument identification - Very challenging

  6. Score and Performance MIDI (score) Valentina Lisitsa Vladimir Horowitz

  7. Where Are The Differences? • Tempo - Note-level, (note onset/offset timings), phrase-level, song-level • Dynamics - Note-level, (note velocity), phrase-level, song-level • Different interpretation of musical expressions in score - Temporal: ritardando, rubato - Dynamics: piano, forte, crescendo, … - Play techniques or articulation: legato, staccato - Mood and emotion: dolce, grazioso

  8. Score-to-Audio Alignment • Temporal alignment between score and audio from a piece of music - Audio-to-audio and MIDI-to-MIDI (either one is performance) are possible • Why do we synchronize them? - Automatic page turning - Performance analysis - Score following - Auto-accompaniment [Müller]

  9. Algorithm Overview • Choose feature representations to compare - Often, MIDI is convert to audio for alignment on the same feature space • Compute a similarity matrix between two features sequences - All possible combinations of local feature pairs • Find a path that makes the best alignment on the similarity matrix - Dynamic Time Warping (DTW) Feature Seq. #1 Similarity Dynamic Matrix Programming Feature Seq. #2 Compute Find the local similarity the best path

  10. Feature Representations • Audio feature representations - Frequent choice for piano music is chroma MIDI Lisitsa CENS : Normalized Chroma Features (Muller, 2005)

  11. Similarity Matrix • Similarity between every pair of frame-level features - Euclidean or cosine distance

  12. Finding the Optimal Path • There are so many possible paths from one corner to another 250 Schumann − Traumerei − MIDI 200 150 100 50 50 100 150 200 250 300 Schumann − Traumerei − Lisitsa

  13. 3D Surface Plot of Similarity Matrix • Finding the optimal path is analogous to figuring out a trail route that you can take with minimum efforts in hiking.

  14. Dynamic Time Warping • Finding an (N, M)-warped path of length L - P = (p1, p2, p3, .. pL) where pi = (ni, mi) • Three conditions - Boundary condition: p1=(1,1), pL=(N,M) - Monotonicity condition - n1 <= n2 <= … <= nL - m1 <=m2 <= .. <mL - Step size condition - Move only upward, rightward, diagonal (upper-right) [Müller]

  15. Dynamic Time Warping : Bad Examples [Müller]

  16. Dynamic Programming for DTW • Algorithm - Initialization: D(n,1) = sum(C(1:n,1)), n=1…N D(1,m) = sum(C(1,1:m)), n=1…M - Recurrence Relation : For each m = 1…M For each n = 1…N D(n-1,m) D(n,m)= C(n,m)+ min D(n,m-1) D(n-1,m-1) - Termination : D(N,M) is distance

  17. Dynamic Programming for DTW • Toy Example Similarity Matrix ( C ) Accumulated cost ( D ) [Müller]

  18. Score and Audio Alignment by DTW D(i,j) C(i,j)

  19. Limitations • The optimal path is obtained after we arrive the destination (by back-tracking) - In other words, DTW works offline - What if the sequences are very long? - Online version of DTW? • Every frame is equally important - In general, human is more sensitive to note onsets - Perceptually, every frame is not equally important

  20. Online DTW • Set a moving search window and calculate the cost only within the window 20 - Time and space cost: quadratic à linear 17 16 13 21 11 18 19 • The movement is determined by the 10 9 14 15 position that gives a minimum cost within 7 12 5 the current window. If the position is ... 3 1 2 4 6 8 - Corner: move both up and right (alternatively) Figure 2: An example of the on-line time warping algorithm with - Upper edge: move up search window c = 4 , showing the order of evaluation for a partic- ular sequence of row and column increments. The axes represent - Right edge: move right the variables t and j (see Figure 1) respectively. All calculated cells are framed in bold, and the optimal path is coloured grey. [Dixon, 2005]

  21. Automatic Page Turner (JKU, Austria)

  22. Onset-sensitive Alignment • We are sensitive to the time alignment on note onsets. - The similarity matrix has no additional weight to onsets • DLNCO Features - D ecaying L ocally-adapted N ormalized C hroma O nset - Capture only onset strength on chroma features - Normalize onset energy and note length (by artificially-created note tail) [Ewert, 2009]

  23. Demo: PerformScore • https://jdasam.github.io/PerformScore/

  24. Multi-pitch Estimation • Two types of polyphonic settings - Polyphonic instruments: piano, guitar - Ensemble of monophonic instruments: woodwind quintet, string quartet, chorale • Three levels of subtasks - First-level: frame-wise estimation of pitches and polyphony (number of notes) - Second-level: tracking pitch within a note based on temporal continuity - Third-level: tracking notes for each sound source, usually for ensembles of monophonic instruments

  25. Challenges • Many sources are mixed and played simultaneously - They are likely to be harmonically related in music - Some sources can be masked by others - Content changes continuously by musical expressions (e.g. vibrato) • Compromises - Transcribe as many source sounds as possible - Only dominant sources: melody, bass, drum

  26. Frame-wise Multi-pitch Estimation • Three categories of approaches - Iterative F0 search: repeatedly finds predominant-F0 and removes its related sources - Joint source estimation: examines possible combinations of multiples sources, e.g., NMF - Classification-base approach: no prior knowledge of musical acoustics, only relies on supervised learning

  27. Iterative F0 estimation • Based on repeated cancellation of harmonic overtones of detected F0s (Klapuri, 2003) • Procedure Set the original to the residual 1. Detect predominant F0: based on the harmonic sieve method 2. Spectral smoothing on harmonics on the detected F0 3. Cancel the smoothed harmonics from the residual 4. Repeat the step 2 & 3 until the residual is sufficiently flat 5. Cancel sound From mixture Y R ( k ) ← max( Y R ( k ) − d Y D ( k ),0) F0 detection Y R ( k )

  28. Iterative F0 estimation Spectral Smoothness Iterative Estimation Spectral Smoothness ECE 477 - Computer Audition, Zhiyao Duan 2014

  29. Iterative F0 estimation • Advantages - Deterministic: only by signal processing and no data-driven training - Can handle inharmonicity (e.g. piano) and vibratio • Limitations - F0 estimation becomes unreliable as iteration increases - Spectral smoothing is not accurate enough

  30. Joint Source Estimation • Based on a model for sound mixture - All sources compete with each other to explain the mixture and find a subset that are mostly likely - The number of sources are limited - Non-negative matrix factorization (NMF) has been most widely explored

  31. Joint Source Estimation • How many spectral templates can explain the source ?

  32. Joint Source Estimation We can explain the spectrogram with three spectral basis ( 𝑋 ) • and corresponding activations ( 𝐼 ) Can we decompose 𝑊 into 𝑋 and 𝐼 automatically ? • 𝑋 𝑊 ≈ 𝑋𝐼 𝐼

  33. Non-negative Matrix Factorization (NMF) • One of matrix factorization algorithms but all elements are non- negative - 𝑊 ( 𝑁 x 𝑂 matrix): original data (e.g. spectrogram) - 𝑋 ( 𝑁 x 𝐿 matrix ): 𝐿 basis vectors (e.g. dictionary) - 𝐼 ( 𝐿 x 𝑂 matrix): activation matrix (e.g. weights or gains) • Note that this provides a compressed representation. - A low-rank approximation ! $ ! $ ! $ # & # & # & # & # & ≈ # & # & # & # & # & # & " % " % " % 𝑊 𝐼 𝑋

  34. Algorithm for NMF • 𝑊 is known, and 𝑋 and 𝐼 are unknown. How? • Alternative the estimation (similar to the EM algorithm) - Start with random 𝑋 - Estimate an 𝐼 given 𝑋 - Estimate a new 𝑋 given 𝐼 - Repeat until convergence

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend