[PPT] - Audio Representations Graduate School of Culture Technology, KAIST PowerPoint Presentation

SLIDE 1

GCT634: Musical Applications of Machine Learning

Audio Representations

Graduate School of Culture Technology, KAIST Juhan Nam

SLIDE 2

Outlines

Time-domain Representation
Sampling
Quantization
Time-Frequency Representations
Short-time Fourier Transform
Spectrogram
Mel-Spectrogram
Constant-Q transforms
Auditory Filterbank

SLIDE 3

Music Representations

Audio
Score

… 0 1 1 0 1 1 0 … … 0 0 1 1 0 1 1 …

SLIDE 4

Time-Domain Representation of Audio

Musical sounds arrives in microphones as vibration of air.
In computer, they are converted into a sequence of binary

values via Sampling and Quantization.

… 0 1 1 0 1 1 0 …

SLIDE 5

Sampling

What is an appropriate sampling rate?
Too high: increase data rate
Too low: become hard to reconstruct the original signal
Sampling Theorem
Sampling rate must be greater than twice the maximum frequency in the

signal in order to reconstruct fully (or to avoid aliasing)

Half the sampling rate (𝑔

𝑡/2) is called Nyquist frequency

𝑔

𝑡 > 2 ∙ 𝑔 𝑛

𝑔

𝑡: sampling rate

𝑔

𝑛: maximum frequency

SLIDE 6

Sampling Rate

Determined by the bandwidth of signals or hearing limits
Audio CD: 44.1 kHz (44100 samples per second)
Speech communication: 8 kHz (8000 samples per second)

SLIDE 7

Sampling Rate Conversion (Resampling)

Up-sampling and Down-sampling
44.1kHz CD quality music is often down-

sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks

Resampling is computed by

interpolation using a low-pass filter (e.g. windowed sinc function)

Down-sampling Up-sampling

SLIDE 8

Quantization

Discretizing the amplitude of real-valued signals
Round the amplitude to the nearest discrete steps
The discrete steps are determined by the number of bit bits
The range of B bit quantization: -2B-1 ~ 2B-1-1
Audio CD (16 bits) : -215 ~ 215-1 à this is often scaled to -1.0 ~ 1.0

SLIDE 9

Time-Domain Representation of Audio

Waveforms are a natural representation of audio but limited in

analyzing the content

Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change

SLIDE 10

Time-Frequency Representations of Audio

We hear and observe musical sounds in terms of how the

content changes over time and frequency

SLIDE 11

Short-Time Fourier Transform (STFT)

Definition
Computation Steps
Take a window (one frame)
Compute FFT
Shifting by the hop size
Repeat above
This returns a 2D matrix

: hop size : window : FFT size

𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓56789/:

:5; 9<=

𝐼 𝑥 𝑜 𝑂

SLIDE 12

Windowing

Types of window functions
Trade-off between the width of main-lobe and the level of side-lobe
Tapering windows suppresses side-lobe levels at the expense of wider

main lobe.

Hann window is the most widely used in music analysis.

−200 200 0.5 1

Amplitude Rectangular

−500 500 −60 −40 −20 20 40

Magnitude(dB)

−200 200 0.5 1

Triangular

−500 500 −60 −40 −20 20 40 −200 200 0.5 1

Hann

−500 500 −60 −40 −20 20 40 −200 200 0.5 1

Blackmann

−500 500 −60 −40 −20 20 40

Spectra of Windowed Sines

SLIDE 13

Time-Frequency Resolution by Window Size

Trade-off between time and frequency resolutions
Short window: low frequency-resolution and high time-resolution
Long window: high frequency-resolution and low time-resolution

SLIDE 14

Discrete Fourier Transform (DFT)

Definition
Inner product with the signal and complex sinusoids
Magnitude spectrum:
Phase spectrum:

𝑌 𝑙 = 𝑦 𝑜 ∙ 𝑡𝑙(𝑜) = - 𝑦 𝑜 𝑓−𝑘2𝜌𝑙𝑜

𝑂 𝑂−1 𝑜=0

= 𝑌𝑆 𝑙 + 𝑘𝑌𝐽 𝑙 = 𝐵(𝑙)𝑘ϕ 𝑙 𝑌 𝑙 = 𝐵 𝑙 = 𝑌𝑆

2 𝑙 + 𝑌𝐽 2 𝑙

∠𝑌 𝑙 = ϕ 𝑙 = tan−1(𝑌𝐽(𝑙)

𝑌𝑆(𝑙)) 𝑡𝑙 𝑜 = 𝑓𝑘2𝜌𝑙𝑜

𝑂

= cos 2𝜌𝑙𝑜 𝑂 + 𝑘sin 2𝜌𝑙𝑜 𝑂 By Euler’s identity

SLIDE 15

Fast Fourier Transform (FFT)

Matrix multiplication view of DFT
“Fast Fourier Transform (FFT)” is an efficient algorithm that

computes the matrix multiplication fast

Based on "divide and conquer”
Complexity reduction: O(N2)à O(Nlog2N)

SLIDE 16

Frequency Scale in Spectrogram

Linear frequency scale
Great to see the harmonic structure of a single tone.
However, spectral distributions are skewed toward low frequency

“Chopin Prelude E minor” “How insensitive”

SLIDE 17

Human Pitch Perception

The basilar membrane in cochlea is a rough spectral analyzer
Resonate at a different position selectively according to the frequency of

incoming vibration

The resonance frequency increases in a log scale along the basilar

membrane

(Unrolled) Cochlear Basilar Membrane

Oval window Round window

Inner Ear (P. Cook)

SLIDE 18

Human Pitch Perception

Pitch Resolution
Just noticeable difference (JND) increases

as the frequency goes up

Critical bandwidth
Frequency bandwidth within which one tone

interferes with the perception of another tone by auditory masking

Constant at low frequency but linear at high

frequency

SLIDE 19

0.5 1 1.5 2 2.5 x 10

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frequency (Hz) normalized scales ERB Mel Bark

Psychoacoustical Pitch Scales

Mel scale
Based on pitch ratio of tones
Bark scale
Critical band measurement by masking
Equivalent regular bandwidth rate
Critical band measurement using the

notched-noise method

Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/

Comparison of pitch scales

m = 2595log10(1+ f / 700)

Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2)

ERBS = 21.4⋅log10(1+ 0.00437 f )

SLIDE 20

Tuning System in Musical Instrument

Equal temperament
1: 21/12 ratio between two adjacent notes
Music note (m) and frequency (f) in Hz

https://newt.phys.unsw.edu.au/jw/notes.html

f = 440⋅2

(m−69) 12

m =12log2( f 440 )+69,

SLIDE 21

Log-Spectrogram Using Frequency Mapping

Mapping linear scale to a perceptual (log-like) scale
Locate center frequencies according to the frequency mapping
Linear interpolation on the center frequency with the corresponding

bandwidth skirt

Center Frequency Band width

Linear-Frequency Spectrogram Log-Frequency Spectrogram

SLIDE 22

Log-Spectrogram Using Frequency Mapping

The mapping can be formed as matrix multiplication
Each column of the mapping matrix contain the interpolation coefficients
Limitation
Simple but time frequency resolutions are still constrained on STFT

100 200 300 400 500 600 20 40 60 80 100 120

× Y = M ⋅ X

(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)

=

SLIDE 23

Mel-Frequency Spectrogram

Mel-scaled spectrogram is widely used for music classification
Usually mapped to a smaller number of mel-scaled bins that the FFT size

Linear-Frequency Spectrogram Mel-Frequency Spectrogram

SLIDE 24

Constant-Q Transform

Use a set of sinusoidal kernels with:
Logarithmically spaced frequencies
Constant Q = frequency/bandwidth

SLIDE 25

Constant-Q Filter Bank

Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)

SLIDE 26

Example: Constant-Q Filter Bank

Müller’s 88-note filter bank
The center frequency is set to each of 88

piano notes

The bandwidth is set to have constant-Q

with +/- 25 cent around the center (Müller, 2011)

SLIDE 27

Comparison of Time-Frequency Representations

Spectrogram (short window)

time frequency

Spectrogram (long window)

time frequency

Mel Spectrogram Constant-Q transform

time frequency time frequency

SLIDE 28

Auditory Filter Bank

A set of filter bank that imitates the magnitude and delay of

traveling waves on basilar membrane in cochlear

Produce 3-D representation (time-channel-lag) or “auditory images”

Cochlear Filter banks

Oval window High Freq. Low Freq. Stabilize & Combine input

. . . HC HC HC . . . ACF ACF ACF

Summary ACF

Correlogram Summary ACF Correlogram

Hair cells Auto-Correlation Functions

SLIDE 29

Types of Auditory Filter Banks

Gamma-tone Filter banks (R. Patterson)
Gamma-tone:
Pole-Zero Filter Cascade (D. Lyon)

g(t) = atn−1e−2πbt cos(2π ft +ϕ)u(t)

SLIDE 30

Hair-Cell

(Inner) Hair-cell
Transform mechanical movement into neural spikes
Modeled as cascade of
Half-wave rectification
Compression
Low-pass filtering
This conducts a non-linear processing
Generate new harmonic partials
Associated with missing fundamentals

SLIDE 31

Example of Correlogram: Piano

SLIDE 32

Example of Correlogram: Rock Music

SLIDE 33

Software

Tools
C++: http://soundlab.cs.princeton.edu/software/sndpeek/
WebAudio: https://musiclab.chromeexperiments.com/Spectrogram
Audacity, Sonic Visualizer, Adobe Audition, Praat, …
Libraries
Librosa (python): http://librosa.github.io/librosa/
Auditory Toolbox (Matlab):

GCT634: Musical Applications of Machine Learning

Audio Representations

Graduate School of Culture Technology, KAIST Juhan Nam

Outlines

Music Representations

Time-Domain Representation of Audio

values via Sampling and Quantization.

Sampling

signal in order to reconstruct fully (or to avoid aliasing)

𝑔

𝑔

𝑔

Sampling Rate

Sampling Rate Conversion (Resampling)

sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks

interpolation using a low-pass filter (e.g. windowed sinc function)

Quantization

Time-Domain Representation of Audio

analyzing the content

Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change

Time-Frequency Representations of Audio

content changes over time and frequency

Short-Time Fourier Transform (STFT)

𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓56789/:

Windowing

main lobe.

Time-Frequency Resolution by Window Size

Discrete Fourier Transform (DFT)

Fast Fourier Transform (FFT)

computes the matrix multiplication fast

Frequency Scale in Spectrogram

Human Pitch Perception

incoming vibration

membrane

Human Pitch Perception

as the frequency goes up

interferes with the perception of another tone by auditory masking

frequency

Psychoacoustical Pitch Scales

notched-noise method

m = 2595log10(1+ f / 700)

Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2)

ERBS = 21.4⋅log10(1+ 0.00437 f )

Tuning System in Musical Instrument

f = 440⋅2

m =12log2( f 440 )+69,

Log-Spectrogram Using Frequency Mapping

bandwidth skirt

Log-Spectrogram Using Frequency Mapping

× Y = M ⋅ X

=

Mel-Frequency Spectrogram

Constant-Q Transform

Constant-Q Filter Bank

Example: Constant-Q Filter Bank

piano notes

with +/- 25 cent around the center (Müller, 2011)

Comparison of Time-Frequency Representations

Auditory Filter Bank

traveling waves on basilar membrane in cochlear

Types of Auditory Filter Banks

g(t) = atn−1e−2πbt cos(2π ft +ϕ)u(t)

Hair-Cell

Example of Correlogram: Piano

Example of Correlogram: Rock Music

Software

https://engineering.purdue.edu/~malcolm/interval/1998-010/