Audio Representations Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation
Audio Representations Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation
GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines Time-domain Representation - Sampling - Quantization Time-Frequency Representations - Short-time
Outlines
- Time-domain Representation
- Sampling
- Quantization
- Time-Frequency Representations
- Short-time Fourier Transform
- Spectrogram
- Mel-Spectrogram
- Constant-Q transforms
- Auditory Filterbank
Music Representations
- Audio
- Score
… 0 1 1 0 1 1 0 … … 0 0 1 1 0 1 1 …
Time-Domain Representation of Audio
- Musical sounds arrives in microphones as vibration of air.
- In computer, they are converted into a sequence of binary
values via Sampling and Quantization.
… 0 1 1 0 1 1 0 …
Sampling
- What is an appropriate sampling rate?
- Too high: increase data rate
- Too low: become hard to reconstruct the original signal
- Sampling Theorem
- Sampling rate must be greater than twice the maximum frequency in the
signal in order to reconstruct fully (or to avoid aliasing)
- Half the sampling rate (𝑔
𝑡/2) is called Nyquist frequency
𝑔
𝑡 > 2 ∙ 𝑔 𝑛
𝑔
𝑡: sampling rate
𝑔
𝑛: maximum frequency
Sampling Rate
- Determined by the bandwidth of signals or hearing limits
- Audio CD: 44.1 kHz (44100 samples per second)
- Speech communication: 8 kHz (8000 samples per second)
Sampling Rate Conversion (Resampling)
- Up-sampling and Down-sampling
- 44.1kHz CD quality music is often down-
sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks
- Resampling is computed by
interpolation using a low-pass filter (e.g. windowed sinc function)
Down-sampling Up-sampling
Quantization
- Discretizing the amplitude of real-valued signals
- Round the amplitude to the nearest discrete steps
- The discrete steps are determined by the number of bit bits
- The range of B bit quantization: -2B-1 ~ 2B-1-1
- Audio CD (16 bits) : -215 ~ 215-1 à this is often scaled to -1.0 ~ 1.0
Time-Domain Representation of Audio
- Waveforms are a natural representation of audio but limited in
analyzing the content
Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change
Time-Frequency Representations of Audio
- We hear and observe musical sounds in terms of how the
content changes over time and frequency
Short-Time Fourier Transform (STFT)
- Definition
- Computation Steps
- Take a window (one frame)
- Compute FFT
- Shifting by the hop size
- Repeat above
- This returns a 2D matrix
: hop size : window : FFT size
𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓56789/:
:5; 9<=
𝐼 𝑥 𝑜 𝑂
Windowing
- Types of window functions
- Trade-off between the width of main-lobe and the level of side-lobe
- Tapering windows suppresses side-lobe levels at the expense of wider
main lobe.
- Hann window is the most widely used in music analysis.
−200 200 0.5 1
Amplitude Rectangular
−500 500 −60 −40 −20 20 40
Magnitude(dB)
−200 200 0.5 1
Triangular
−500 500 −60 −40 −20 20 40 −200 200 0.5 1
Hann
−500 500 −60 −40 −20 20 40 −200 200 0.5 1
Blackmann
−500 500 −60 −40 −20 20 40
Spectra of Windowed Sines
Time-Frequency Resolution by Window Size
- Trade-off between time and frequency resolutions
- Short window: low frequency-resolution and high time-resolution
- Long window: high frequency-resolution and low time-resolution
Discrete Fourier Transform (DFT)
- Definition
- Inner product with the signal and complex sinusoids
- Magnitude spectrum:
- Phase spectrum:
𝑌 𝑙 = 𝑦 𝑜 ∙ 𝑡𝑙(𝑜) = - 𝑦 𝑜 𝑓−𝑘2𝜌𝑙𝑜
𝑂 𝑂−1 𝑜=0
= 𝑌𝑆 𝑙 + 𝑘𝑌𝐽 𝑙 = 𝐵(𝑙)𝑘ϕ 𝑙 𝑌 𝑙 = 𝐵 𝑙 = 𝑌𝑆
2 𝑙 + 𝑌𝐽 2 𝑙
- ∠𝑌 𝑙 = ϕ 𝑙 = tan−1(𝑌𝐽(𝑙)
𝑌𝑆(𝑙)) 𝑡𝑙 𝑜 = 𝑓𝑘2𝜌𝑙𝑜
𝑂
= cos 2𝜌𝑙𝑜 𝑂 + 𝑘sin 2𝜌𝑙𝑜 𝑂 By Euler’s identity
Fast Fourier Transform (FFT)
- Matrix multiplication view of DFT
- “Fast Fourier Transform (FFT)” is an efficient algorithm that
computes the matrix multiplication fast
- Based on "divide and conquer”
- Complexity reduction: O(N2)à O(Nlog2N)
Frequency Scale in Spectrogram
- Linear frequency scale
- Great to see the harmonic structure of a single tone.
- However, spectral distributions are skewed toward low frequency
“Chopin Prelude E minor” “How insensitive”
Human Pitch Perception
- The basilar membrane in cochlea is a rough spectral analyzer
- Resonate at a different position selectively according to the frequency of
incoming vibration
- The resonance frequency increases in a log scale along the basilar
membrane
(Unrolled) Cochlear Basilar Membrane
Oval window Round window
Inner Ear (P. Cook)
Human Pitch Perception
- Pitch Resolution
- Just noticeable difference (JND) increases
as the frequency goes up
- Critical bandwidth
- Frequency bandwidth within which one tone
interferes with the perception of another tone by auditory masking
- Constant at low frequency but linear at high
frequency
0.5 1 1.5 2 2.5 x 10
4
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frequency (Hz) normalized scales ERB Mel Bark
Psychoacoustical Pitch Scales
- Mel scale
- Based on pitch ratio of tones
- Bark scale
- Critical band measurement by masking
- Equivalent regular bandwidth rate
- Critical band measurement using the
notched-noise method
Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/
Comparison of pitch scales
m = 2595log10(1+ f / 700)
Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2)
ERBS = 21.4⋅log10(1+ 0.00437 f )
Tuning System in Musical Instrument
- Equal temperament
- 1: 21/12 ratio between two adjacent notes
- Music note (m) and frequency (f) in Hz
https://newt.phys.unsw.edu.au/jw/notes.html
f = 440⋅2
(m−69) 12
m =12log2( f 440 )+69,
Log-Spectrogram Using Frequency Mapping
- Mapping linear scale to a perceptual (log-like) scale
- Locate center frequencies according to the frequency mapping
- Linear interpolation on the center frequency with the corresponding
bandwidth skirt
Center Frequency Band width
Linear-Frequency Spectrogram Log-Frequency Spectrogram
Log-Spectrogram Using Frequency Mapping
- The mapping can be formed as matrix multiplication
- Each column of the mapping matrix contain the interpolation coefficients
- Limitation
- Simple but time frequency resolutions are still constrained on STFT
100 200 300 400 500 600 20 40 60 80 100 120
× Y = M ⋅ X
(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)
=
Mel-Frequency Spectrogram
- Mel-scaled spectrogram is widely used for music classification
- Usually mapped to a smaller number of mel-scaled bins that the FFT size
Linear-Frequency Spectrogram Mel-Frequency Spectrogram
Constant-Q Transform
- Use a set of sinusoidal kernels with:
- Logarithmically spaced frequencies
- Constant Q = frequency/bandwidth
Constant-Q Filter Bank
Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)
Example: Constant-Q Filter Bank
- Müller’s 88-note filter bank
- The center frequency is set to each of 88
piano notes
- The bandwidth is set to have constant-Q
with +/- 25 cent around the center (Müller, 2011)
Comparison of Time-Frequency Representations
Spectrogram (short window)
time frequency
Spectrogram (long window)
time frequency
Mel Spectrogram Constant-Q transform
time frequency time frequency
Auditory Filter Bank
- A set of filter bank that imitates the magnitude and delay of
traveling waves on basilar membrane in cochlear
- Produce 3-D representation (time-channel-lag) or “auditory images”
Cochlear Filter banks
Oval window High Freq. Low Freq. Stabilize & Combine input
. . . HC HC HC . . . ACF ACF ACF
Summary ACF
Correlogram Summary ACF Correlogram
Hair cells Auto-Correlation Functions
Types of Auditory Filter Banks
- Gamma-tone Filter banks (R. Patterson)
- Gamma-tone:
- Pole-Zero Filter Cascade (D. Lyon)
g(t) = atn−1e−2πbt cos(2π ft +ϕ)u(t)
Hair-Cell
- (Inner) Hair-cell
- Transform mechanical movement into neural spikes
- Modeled as cascade of
- Half-wave rectification
- Compression
- Low-pass filtering
- This conducts a non-linear processing
- Generate new harmonic partials
- Associated with missing fundamentals
Example of Correlogram: Piano
Example of Correlogram: Rock Music
Software
- Tools
- C++: http://soundlab.cs.princeton.edu/software/sndpeek/
- WebAudio: https://musiclab.chromeexperiments.com/Spectrogram
- Audacity, Sonic Visualizer, Adobe Audition, Praat, …
- Libraries
- Librosa (python): http://librosa.github.io/librosa/
- Auditory Toolbox (Matlab):