 
              GCT634: Musical Applications of Machine Learning Rhythm Transcription Dynamic Programming Graduate School of Culture Technology, KAIST Juhan Nam
Outlines • Overview of Automatic Music Transcription (AMT) - Types of AMT Tasks • Rhythmic Transcription - Introduction - Onset detection - Tempo Estimation • Dynamic Programming - Beat Tracking
Overview of Automatic Music Transcription (AMT) • Predicting musical score information from audio - Primary score information is note but they are arranged based on rhythm, harmony and structure - Equivalent to automatic speech recognition (ASR) for speech signals Beat Onsets Tempo Model Chord Key Structure
Types of AMT Tasks • Rhythm transcription • Note transcription - Onset detection - Monophonic note - Tempo estimation - Polyphonic note - Beat tracking - Expression detection (e.g. vibrato, pedal) • Tonal analysis • Structure analysis - Key estimation - Musical structure - Chord recognition - Musical boundary / repetition detection • Timbre analysis - Highlight detection - Instrument identification
Types of AMT Tasks • Rhythm transcription • Note transcription - Onset detection - Monophonic note - Tempo estimation - Polyphonic note - Beat tracking - Expression detection (e.g. vibrato, pedal) • Tonal analysis • Structure analysis - Key estimation - Musical structure - Chord recognition - Musical boundary / repetition detection • Timbre analysis - Highlight detection - Instrument identification We will mainly focus on these topics!
Overview of AMT Systems • Acoustic model - Estimate the target information given input audio (usually short segment) • Musical knowledge - Music theory (e.g. rhythm, harmony), performance (e.g. playability) • Prior/Lexical model - Statistical distribution of the score-level music information (e.g. chord progression) Score-Level Musical Prior or Lexical Model Knowledge Beat, Tempo Transcription Acoustic Key, Chords Model Model Notes Audio-Level
Introduction to Rhythm • Rhythm - A strong, regular, and repeated pattern of sound - Distinguish music from speech • The most primitive and foundational element of music - Melody, harmony and other musical elements are arranged on the basis of rhythm • Human and rhythm - Human has innate ability of rhythm perception: heart beat, walking - Associated with motor control: dance, labor song
Introduction to Rhythm • Hierarchical structure of rhythm - Beat (tactus): the most prominent level, foot tapping rate - Division (tatum): temporal atom, eighth or sixteenth - Measure (bar): the unit of rhythm pattern (and also harmonic changes) • Notations - Tempo: beats per minute, e.g. 90 bpm - Time signature: e.g. 4/4, 3/4, 6/8 [Wikipedia]
Human Perception of Tempo • Mckinney and Moelant (2006) - Collect tapping data from 40 human subjects - Initial synchronization delay and anticipation (by tempo estimation) - Ambiguity in tempo: beat or its division ? [D. Ellis’ e4896 slides]
Overview of Rhythm Transcription Systems • Consists of several cascaded tasks that detect moments of musical stress (accents) and their regularity Tempo Beat Onset Estimation Tracking Detection Musical Knowledge
Onset Detection • Identify the starting times of musical events - Notes, drum sounds [M.Muller] • Types of onsets - Hard onsets: percussive sounds - Soft onsets: source-driven sounds (e.g. singing voice, woodwind, bowed strings)
Example: Onset Detection 1 0.5 amplitude 0 − 0.5 “Eat ( 꺼내먹어요 ) ” Zion.T − 1 0 1 2 3 4 5 6 ? time [sec]
Onset Detection Systems Onset Detection Audio Decision Function Representations Algorithm (Feature Extraction) (Classifier) • Onset detection function (ODF) - Instantaneous measure of temporal change, often called “novelty” function - Types: time-domain energy, spectral or sub-band energy, phase difference • Decision algorithm - Ruled-based approach - Learning-based approach
Onset Detection Function (ODF) • Types of ODFs - Time-domain energy - Spectral or sub-band energy - Phase difference
Time-Domain Onset Detection Waveform 1 • Local energy 0.5 amplitude - Usually have high energy at onsets 0 - Effective for percussive sounds − 0.5 − 1 0 1 2 3 4 5 6 time [sec] • Various versions - Frame-level energy 20 15 / ODF 10 𝑦 𝑜 + 𝑛 𝑥(𝑛) . 𝑃𝐸𝐺(𝑜) = 𝐹 𝑜 = ) 5 012/ 0 0 1 2 3 4 5 6 time [sec] - Half-wave rectification 10 8 𝑃𝐸𝐺(𝑜) = 𝐼(𝐹 𝑜 + 1 − 𝐹 𝑜 ) 6 ODF 4 𝐼 𝑠 = 𝑠 + 𝑠 = 8𝑠, 𝑠 ≥ 0 2 0, 𝑠 < 0 2 0 0 1 2 3 4 5 6 time [sec]
Spectral-Based Onset Detection • Spectral Flux 4 x 10 - Sum of the positive differences from 2 log spectrogram 1.5 - ODF changes depending on the frequency − kHz amount of compression 𝜍 1 0.5 𝑍 𝑜, 𝑙 = log 1 + 𝜍 𝑌 𝑜, 𝑙 𝑌 𝑜, 𝑙 : STFT 0 1 2 3 4 5 time [sec] 400 /2A 300 𝑃𝐸𝐺(𝑜) = ) 𝐼(𝑍 𝑜 + 1, 𝑙 − 𝑍 𝑜, 𝑙 ) ODF 200 B1C 100 0 0 1 2 3 4 5 time [sec]
Phase Deviation • Sinusoidal components of a note is continuous while the note is sustained - Abrupt change in phase means that there may be a new event [D. Ellis’ e4896 slides] ϕ k ( n ) − ϕ k ( n − 1) ≈ ϕ k ( n − 1) − ϕ k ( n − 2) Phase continuation (e.g. during sustain of a single note) Δ ϕ k ( n ) = ϕ k ( n ) − 2 ϕ k ( n − 1) + ϕ k ( n − 2) ≈ 0 N ζ p = 1 Deviation from the steady-state ∑ Δ ϕ k ( n ) for all frequency bins N k = 1
Post-Processing • DC removal - Subtract the mean of ODF • Normalization - Scaling level of ODF • Low-pass filtering - Remove small peaks • Down-sampling - For data reduction Low-pass Filtering (Solid line) (Tzanetakis, 2010)
Onset Decision Algorithm • Rule-based Approach: peak detection rule - Peaks above thresholds are determined as onsets - The thresholds are often adaptively computed from the ODF - Averaging and median are popular choices to compute the thresholds threshold = α + β ⋅ median( ODF ) α :offset, β :scaling 350 ODF 300 Threshold 250 200 ODF 150 100 50 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time [sec] Median with window size 5
Challenging Issue in Onset Detection: Vibrato Onset detection using spectral flux
SuperFlux • A state-of-the-art rule-based onset detection function - S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”, DAFx, 2013 • Step1: log-spectrogram - Make harmonic partials have the same depth of vibrato contour 𝑍 𝑜, 𝑛 = log 1 + 𝑌 𝑜, 𝑙 L 𝐺 𝑙, 𝑛 𝑌 𝑜, 𝑙 : STFT • Step2: max-filtering - Take the maximum in a window on the frequency axis - The vibrato contours become thicker 𝑍 0MN 𝑜, 𝑛 = max (𝑍 𝑜, 𝑛 − 𝑚: 𝑛 + 𝑚 )
SuperFlux • A state-of-the-art rule-based onset detection function - S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”, DAFx, 2013 • Step1: log-spectrogram - Make harmonic partials have the same depth of vibrato contours 𝑍 𝑜, 𝑛 = log 1 + 𝑌 𝑜, 𝑙 L 𝐺 𝑙, 𝑛 𝑌 𝑜, 𝑙 : STFT • Step2: max-filtering - Take the maximum in a window on the frequency axis - The vibrato contours become thicker 𝑍 0MN 𝑜, 𝑛 = max (𝑍 𝑜, 𝑛 − 𝑚: 𝑛 + 𝑚 )
SuperFlux Log-spectrogram Max-filtered Log-spectrogram
SuperFlux • Step3: Super-flux - Take the difference with some distance - Assumption: frame-rate is high in onset detection (i.e. small hop size) /2A (𝑂 2 − min 𝑜 𝑥 𝑜 > 𝑠 ) 𝑇𝐺 ∗ (𝑜) = ) 𝐼(𝑍 𝑜 + 𝜈, 𝑙 − 𝑍 𝑜, 𝑙 ) 𝜈 = max (1, + 0.5 ℎ B1C (0 ≤ 𝑠 ≤ 1) • Step 4: pick-picking (𝑇𝐺 ∗ 𝑜 − 𝑞𝑠𝑓 0MN : 𝑜 + 𝑞𝑝𝑡𝑢 0MN ) - 1) 𝑇𝐺 ∗ (𝑜) = max (𝑇𝐺 ∗ 𝑜 − 𝑞𝑠𝑓 M\] : 𝑜 + 𝑞𝑝𝑡𝑢 M\] ) + 𝜀 - 2) 𝑇𝐺 ∗ (𝑜) ≥ mean - 3) 𝑜 − 𝑜 _`a\bcde2cfeag > 𝑑𝑝𝑛𝑐𝑗𝑜𝑏𝑢𝑗𝑝𝑜 𝑥𝑗𝑒𝑢ℎ
SuperFlux Max-filtered Log-spectrogram Peak-picking
Tempo Estimation • Estimate a regular time interval between beats - Tempo is a global attribute of a song: e.g. bpm or mid-tempo song • Tempo often changes within a song - Intentionally: e.g. dramatic effect: Top 10 tempo changes - Unintentionally: e.g. re-mastering, live performance • There are also local tempo changes: e.g. rubato
Tempo Estimation Methods • Auto-Correlation - Find the periodicity as used in pitch detection • Discrete Fourier Transform - Use DFT over ODF and find the periodicity • Comb-filter Banks - Leverage the “oscillating nature” of musical beats
Auto-Correlation • ACF is a generic method to detect periodicity of a signal - Thus, this can be applied to ODF to find a dominant period that may correspond to tempo - The ACF shows the dominant peaks that indicate dominant tempi 5 3 x 10 400 300 2 ODF ODF 200 1 100 0 − 1 0 0 1 2 3 4 5 0 1 2 3 4 5 time [sec] time [sec] Onset Detection Function (spectral flux) Auto-Correlation
Recommend
More recommend