Audio Representations Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation

audio representations
SMART_READER_LITE
LIVE PREVIEW

Audio Representations Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines Time-domain Representation - Sampling - Quantization Time-Frequency Representations - Short-time


slide-1
SLIDE 1

GCT634: Musical Applications of Machine Learning

Audio Representations

Graduate School of Culture Technology, KAIST Juhan Nam

slide-2
SLIDE 2

Outlines

  • Time-domain Representation
  • Sampling
  • Quantization
  • Time-Frequency Representations
  • Short-time Fourier Transform
  • Spectrogram
  • Mel-Spectrogram
  • Constant-Q transforms
  • Auditory Filterbank
slide-3
SLIDE 3

Music Representations

  • Audio
  • Score

… 0 1 1 0 1 1 0 … … 0 0 1 1 0 1 1 …

slide-4
SLIDE 4

Time-Domain Representation of Audio

  • Musical sounds arrives in microphones as vibration of air.
  • In computer, they are converted into a sequence of binary

values via Sampling and Quantization.

… 0 1 1 0 1 1 0 …

slide-5
SLIDE 5

Sampling

  • What is an appropriate sampling rate?
  • Too high: increase data rate
  • Too low: become hard to reconstruct the original signal
  • Sampling Theorem
  • Sampling rate must be greater than twice the maximum frequency in the

signal in order to reconstruct fully (or to avoid aliasing)

  • Half the sampling rate (𝑔

𝑡/2) is called Nyquist frequency

𝑔

𝑡 > 2 ∙ 𝑔 𝑛

𝑔

𝑡: sampling rate

𝑔

𝑛: maximum frequency

slide-6
SLIDE 6

Sampling Rate

  • Determined by the bandwidth of signals or hearing limits
  • Audio CD: 44.1 kHz (44100 samples per second)
  • Speech communication: 8 kHz (8000 samples per second)
slide-7
SLIDE 7

Sampling Rate Conversion (Resampling)

  • Up-sampling and Down-sampling
  • 44.1kHz CD quality music is often down-

sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks

  • Resampling is computed by

interpolation using a low-pass filter (e.g. windowed sinc function)

Down-sampling Up-sampling

slide-8
SLIDE 8

Quantization

  • Discretizing the amplitude of real-valued signals
  • Round the amplitude to the nearest discrete steps
  • The discrete steps are determined by the number of bit bits
  • The range of B bit quantization: -2B-1 ~ 2B-1-1
  • Audio CD (16 bits) : -215 ~ 215-1 à this is often scaled to -1.0 ~ 1.0
slide-9
SLIDE 9

Time-Domain Representation of Audio

  • Waveforms are a natural representation of audio but limited in

analyzing the content

Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change

slide-10
SLIDE 10

Time-Frequency Representations of Audio

  • We hear and observe musical sounds in terms of how the

content changes over time and frequency

slide-11
SLIDE 11

Short-Time Fourier Transform (STFT)

  • Definition
  • Computation Steps
  • Take a window (one frame)
  • Compute FFT
  • Shifting by the hop size
  • Repeat above
  • This returns a 2D matrix

: hop size : window : FFT size

𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓56789/:

:5; 9<=

𝐼 𝑥 𝑜 𝑂

slide-12
SLIDE 12

Windowing

  • Types of window functions
  • Trade-off between the width of main-lobe and the level of side-lobe
  • Tapering windows suppresses side-lobe levels at the expense of wider

main lobe.

  • Hann window is the most widely used in music analysis.

−200 200 0.5 1

Amplitude Rectangular

−500 500 −60 −40 −20 20 40

Magnitude(dB)

−200 200 0.5 1

Triangular

−500 500 −60 −40 −20 20 40 −200 200 0.5 1

Hann

−500 500 −60 −40 −20 20 40 −200 200 0.5 1

Blackmann

−500 500 −60 −40 −20 20 40

Spectra of Windowed Sines

slide-13
SLIDE 13

Time-Frequency Resolution by Window Size

  • Trade-off between time and frequency resolutions
  • Short window: low frequency-resolution and high time-resolution
  • Long window: high frequency-resolution and low time-resolution
slide-14
SLIDE 14

Discrete Fourier Transform (DFT)

  • Definition
  • Inner product with the signal and complex sinusoids
  • Magnitude spectrum:
  • Phase spectrum:

𝑌 𝑙 = 𝑦 𝑜 ∙ 𝑡𝑙(𝑜) = - 𝑦 𝑜 𝑓−𝑘2𝜌𝑙𝑜

𝑂 𝑂−1 𝑜=0

= 𝑌𝑆 𝑙 + 𝑘𝑌𝐽 𝑙 = 𝐵(𝑙)𝑘ϕ 𝑙 𝑌 𝑙 = 𝐵 𝑙 = 𝑌𝑆

2 𝑙 + 𝑌𝐽 2 𝑙

  • ∠𝑌 𝑙 = ϕ 𝑙 = tan−1(𝑌𝐽(𝑙)

𝑌𝑆(𝑙)) 𝑡𝑙 𝑜 = 𝑓𝑘2𝜌𝑙𝑜

𝑂

= cos 2𝜌𝑙𝑜 𝑂 + 𝑘sin 2𝜌𝑙𝑜 𝑂 By Euler’s identity

slide-15
SLIDE 15

Fast Fourier Transform (FFT)

  • Matrix multiplication view of DFT
  • “Fast Fourier Transform (FFT)” is an efficient algorithm that

computes the matrix multiplication fast

  • Based on "divide and conquer”
  • Complexity reduction: O(N2)à O(Nlog2N)
slide-16
SLIDE 16

Frequency Scale in Spectrogram

  • Linear frequency scale
  • Great to see the harmonic structure of a single tone.
  • However, spectral distributions are skewed toward low frequency

“Chopin Prelude E minor” “How insensitive”

slide-17
SLIDE 17

Human Pitch Perception

  • The basilar membrane in cochlea is a rough spectral analyzer
  • Resonate at a different position selectively according to the frequency of

incoming vibration

  • The resonance frequency increases in a log scale along the basilar

membrane

(Unrolled) Cochlear Basilar Membrane

Oval window Round window

Inner Ear (P. Cook)

slide-18
SLIDE 18

Human Pitch Perception

  • Pitch Resolution
  • Just noticeable difference (JND) increases

as the frequency goes up

  • Critical bandwidth
  • Frequency bandwidth within which one tone

interferes with the perception of another tone by auditory masking

  • Constant at low frequency but linear at high

frequency

slide-19
SLIDE 19

0.5 1 1.5 2 2.5 x 10

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frequency (Hz) normalized scales ERB Mel Bark

Psychoacoustical Pitch Scales

  • Mel scale
  • Based on pitch ratio of tones
  • Bark scale
  • Critical band measurement by masking
  • Equivalent regular bandwidth rate
  • Critical band measurement using the

notched-noise method

Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/

Comparison of pitch scales

m = 2595log10(1+ f / 700)

Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2)

ERBS = 21.4⋅log10(1+ 0.00437 f )

slide-20
SLIDE 20

Tuning System in Musical Instrument

  • Equal temperament
  • 1: 21/12 ratio between two adjacent notes
  • Music note (m) and frequency (f) in Hz

https://newt.phys.unsw.edu.au/jw/notes.html

f = 440⋅2

(m−69) 12

m =12log2( f 440 )+69,

slide-21
SLIDE 21

Log-Spectrogram Using Frequency Mapping

  • Mapping linear scale to a perceptual (log-like) scale
  • Locate center frequencies according to the frequency mapping
  • Linear interpolation on the center frequency with the corresponding

bandwidth skirt

Center Frequency Band width

Linear-Frequency Spectrogram Log-Frequency Spectrogram

slide-22
SLIDE 22

Log-Spectrogram Using Frequency Mapping

  • The mapping can be formed as matrix multiplication
  • Each column of the mapping matrix contain the interpolation coefficients
  • Limitation
  • Simple but time frequency resolutions are still constrained on STFT

100 200 300 400 500 600 20 40 60 80 100 120

× Y = M ⋅ X

(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)

=

slide-23
SLIDE 23

Mel-Frequency Spectrogram

  • Mel-scaled spectrogram is widely used for music classification
  • Usually mapped to a smaller number of mel-scaled bins that the FFT size

Linear-Frequency Spectrogram Mel-Frequency Spectrogram

slide-24
SLIDE 24

Constant-Q Transform

  • Use a set of sinusoidal kernels with:
  • Logarithmically spaced frequencies
  • Constant Q = frequency/bandwidth
slide-25
SLIDE 25

Constant-Q Filter Bank

Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)

slide-26
SLIDE 26

Example: Constant-Q Filter Bank

  • Müller’s 88-note filter bank
  • The center frequency is set to each of 88

piano notes

  • The bandwidth is set to have constant-Q

with +/- 25 cent around the center (Müller, 2011)

slide-27
SLIDE 27

Comparison of Time-Frequency Representations

Spectrogram (short window)

time frequency

Spectrogram (long window)

time frequency

Mel Spectrogram Constant-Q transform

time frequency time frequency

slide-28
SLIDE 28

Auditory Filter Bank

  • A set of filter bank that imitates the magnitude and delay of

traveling waves on basilar membrane in cochlear

  • Produce 3-D representation (time-channel-lag) or “auditory images”

Cochlear Filter banks

Oval window High Freq. Low Freq. Stabilize & Combine input

. . . HC HC HC . . . ACF ACF ACF

Summary ACF

Correlogram Summary ACF Correlogram

Hair cells Auto-Correlation Functions

slide-29
SLIDE 29

Types of Auditory Filter Banks

  • Gamma-tone Filter banks (R. Patterson)
  • Gamma-tone:
  • Pole-Zero Filter Cascade (D. Lyon)

g(t) = atn−1e−2πbt cos(2π ft +ϕ)u(t)

slide-30
SLIDE 30

Hair-Cell

  • (Inner) Hair-cell
  • Transform mechanical movement into neural spikes
  • Modeled as cascade of
  • Half-wave rectification
  • Compression
  • Low-pass filtering
  • This conducts a non-linear processing
  • Generate new harmonic partials
  • Associated with missing fundamentals
slide-31
SLIDE 31

Example of Correlogram: Piano

slide-32
SLIDE 32

Example of Correlogram: Rock Music

slide-33
SLIDE 33

Software

  • Tools
  • C++: http://soundlab.cs.princeton.edu/software/sndpeek/
  • WebAudio: https://musiclab.chromeexperiments.com/Spectrogram
  • Audacity, Sonic Visualizer, Adobe Audition, Praat, …
  • Libraries
  • Librosa (python): http://librosa.github.io/librosa/
  • Auditory Toolbox (Matlab):

https://engineering.purdue.edu/~malcolm/interval/1998-010/