Audio Data Representations Juhan Nam Types of Music Data Audio - - PowerPoint PPT Presentation
Audio Data Representations Juhan Nam Types of Music Data Audio - - PowerPoint PPT Presentation
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) Image Score
Types of Music Data
- Audio
○ MP3, WAV
- Score (symbolic)
○ MIDI, typesetting script languages (e.g., MusicXML)
- Image
○ Score (scanned image), album/playlist cover, performance video
- Text
○ Meta data, tags, lyrics, reviews
- User Data
○ Listening history, rating
Types of Music Data
- Audio
○ MP3, WAW
- Score (symbolic)
○ MIDI, typesetting script languages (e.g., MusicXML)
- Image
○ Score (scanned image), album/playlist cover, performance video
- Text
○ Meta data, tags, lyrics, reviews
- User Data
○ Listening history, favorites or scores
Types of Audio Data Representations
- Waveform (digital audio samples): sampling and quantization
- Spectrogram: short-time Fourier transform
- Mel-spectrogram: human pitch perception
- Constant-Q transform: transform into musical (chromatic) scale
Digital Audio Chain
…0 0 1 0 1 0 …
\
microphone
Lowpass Filters Sampling Quantization Storage, Processing Digital-to-Analog Conversion Lowpass Filters Amplifier
loudspeaker
Analog-to-Digital Conversion
Sampling and Quantization
…0 0 1 0 1 0 …
\
microphone
Lowpass Filters Sampling Quantization Storage, Processing Digital-to-Analog Conversion Lowpass Filters Amplifier
loudspeaker
Analog-to-Digital Conversion
Sampling
- Convert continuous-time signals to discrete-time signals by periodically
picking up the instantaneous values
○ Represented as a sequence of numbers ○ Sampling period (Ts): the amount of time between samples ○ Sampling rate ( fs =1/Ts ) x(t) → x(nTs)
Signal notation Ts
Sampling Theorem
- What is an appropriate sampling rate?
○ Too high: increase the data size in the digital domain ○ Too low: cannot reconstruct the original signal
- Sampling Theorem
○ The sampling rate must be greater than twice the maximum frequency in the signal in order to reconstruct the original signal ○ Half the sampling rate is called Nyquist frequency (𝑔
!/2)
𝑔
! > 2 $ 𝑔 " 𝑔
!: sampling rate
𝑔
": maximum frequency of the signal
Sampling in the Frequency Domain
fm
- fm
Frequency
fm
- fm
Frequency
fm
- fm
Frequency
(𝑔
! > 2 $ 𝑔 ")
(𝑔
! < 2 $ 𝑔 ")
- fs
fs fs-fm fs+fm
- fs-fm
- fs+fm
- fs-fm
- fs+fm
fs-fm fs+fm
The high-frequency content above the Nyquist frequency is folded over Alias Alias
Sampling Rate
- Determined by the bandwidth of signals or hearing limits
○ Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional) ○ Speech communication: 8 kHz
Music Speech
Sampling Rate Conversion (Resampling)
- We often increase or decrease the sampling rate
○ 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower rates to reduce the data size ○ Computed by signal interpolation
■ In down-sampling, preceded by a low-pass filter to avoid the aliasing noise ■ Windowed sinc function ■ https://ccrma.stanford.edu/~jos/resample/
Down-sampling Up-sampling
Quantization
- Discretizing the amplitude of real-valued signals
○ Round the amplitude to the nearest discrete steps ○ The bit discrete steps are determined by the number of bit bits (bit depth)
■ N bits can range from -2N-1 to 2N-1-1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766)
Quantization step
2N-1-1
- 2N-1
Quantization
- Determined by the dynamic range of of signals
○ Adding 1 bits to LSB increases 6dB in sound level: N bits à 6N dB ○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB
Music Speech
Loading Audio Files
- Check the sampling rate and bit depth
○ You can check them using audio software such as Audacity
- Do resampling (usually down-sampling) if necessary
○ Librosa provides resampling when loading audio files
Waveform
- Waveform is a natural representation of audio but limited in analyzing
the content
○ Mainly show the temporal energy
Spectrogram
- 2D-image representation of audio using short-time Fourier transform
○ x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude ○ Easy to match what you hear to what you see
- For each short segment (frame)
○ Take a window (one frame) ○ Compute DFT (FFT) ○ Convert them to polar coordinate
■ Magnitude and Phase
○ Compress the magnitude
■ 20log!"𝑌#$%: decibel
○ Shifting by a hop size
- Spectrogram parameters
○ Window size (FFT size) ○ Hop size ○ Window type
Computing Spectrogram
DFT 𝑦(𝑚) 𝑌"#$ 𝑦(𝑚 − 1)
Magnitude Compression
𝑌!%&'
Short-Time Fourier Transform (STFT)
hop size window size
Windowing Windowing
- Find the frequency (sinusoidal) component of 𝑦 𝑜
- Represent 𝑦 𝑜 with 𝑦 𝑜 = ∑$%&
'() ) ' 𝐵 𝑙 cos(*+$, '
+ ϕ(𝑙))
○ 𝐵 𝑙 : amplitude (or magnitude) of the sinusoid ○ ϕ(𝑙): phase of the sinusoid ○ 𝑂: size of DFT or the input segment ○ 𝑙: frequency bin index (0 to 𝑂 − 1) (
" # 𝑔 ! is the frequency at each frequency bin)
- DFT provides the way of finding 𝐵 𝑙 and ϕ(𝑙)
Discrete Fourier Transform (DFT)
Pink Floyd ”The Dark Side of the Moon”
Discrete Fourier Transform (DFT)
- Use the orthogonality of sinusoids
○
∑&'"
()! cos(*+,& (
)cos(*+-&
( ) = ,𝑂/2
if 𝑙 = 𝑚 or 𝑙 = 𝑂 − 𝑚
- therwise
(equivalent to − 𝑚)
○
∑&'"
()! cos(*+,& (
)sin(*+-&
( ) = 0
○
∑&'"
()! sin(*+,& (
)sin(*+-&
( ) = ?
𝑂/2 −𝑂/2
- therwise
𝑙 = 𝑚 𝑙 = 𝑂 − 𝑚 (equivalent to − 𝑚)
- The inner product (or correlation) between the two sinusoids:
○ If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other)
Discrete Fourier Transform (DFT)
- Inner product with the input and sinusoids
○
𝑌./ 𝑙 = ∑&'"
()! 𝑦 𝑜 cos *+-& (
= ∑&'"
()!(∑,'" ()! ! ( 𝐵 𝑙 cos(*+,& (
+ ϕ(𝑙)))cos *+-&
(
= 𝐵 𝑙 cos ϕ 𝑙
○
𝑌0# 𝑙 = − ∑&'"
()! 𝑦 𝑜 sin *+-& (
= − ∑&'"
()! ∑,'" ()! ! ( 𝐵 𝑙 cos(*+,& (
+ 𝜚(𝑙)) sin *+-&
(
= 𝐵 𝑙 sin ϕ 𝑙
- The magnitude and phase
○
𝑌#$%(𝑙) = 𝐵 𝑙 = 𝑌./
* 𝑙 + 𝑌0# *
𝑙 , 𝑌12$3/(𝑙) = ϕ 𝑙 = tan)!(4()(,)
4*+(,))
- The definition of DFT can be simplified using complex sinusoids
○
𝑌 𝑙 = ∑&'"
()! 𝑦 𝑜 𝑓)7,-./
= 𝑌./ 𝑙 + 𝑘𝑌0# 𝑙 = 𝐵(𝑙)78 ,
𝑓!"#$%
&
= cos 2𝜌𝑙𝑜 𝑂 + 𝑘sin 2𝜌𝑙𝑜 𝑂 Euler’s identity
Discrete Fourier Transform (DFT)
- Can be viewed as matrix multiplication
- In practice, we use an FFT algorithm instead of
direct multiplication
○ Divide the matrix into small matrices recursively ○ Complexity reduction: O(N2)à O(Nlog2N)
𝑋
'1!
𝑋
!23
𝑦(𝑜)
to polar
𝑌4& 𝑌2" 𝑌"#$ 𝑌%5#!& 𝑋
'1!
𝑋
!23
𝑡6
∗(𝑜) = 𝑓89:;63 <
Discrete Fourier Transform (DFT)
- When DFT is applied to musical sounds
○ A musical tone with pitch has periodic waveforms ○ DFT shows harmonic spectrum (harmonic overtones) ○ Pitch information can be also extracted ○ The magnitude is generally more sparse than the waveform
𝑦(𝑜) 𝑌"#$(𝑙) 𝐺0 2𝐺0 3𝐺0
Effect of Window Type
- Types of window functions
○ Trade-off between the width of main-lobe and the level of side-lobe ○ Hann window is the most widely used in music analysis.
200 200 0.5 1
Amplitude Rectangular
500 500 60 40 20 20 40
Magnitude(dB)
200 200 0.5 1
Triangular
500 500 60 40 20 20 40 200 200 0.5 1
Hann
500 500 60 40 20 20 40 200 200 0.5 1
Blackmann
500 500 60 40 20 20 40
Spectra of windowed single sinusoids
Effect of Window Size
- Trade-off between time and frequency resolutions
○ Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution
Hop=128, N=256 Hop=128, N=4096
Human Ears
- Human ear is a spectrum analyzer?
○ Our ear has a complicated pathway from the ear drum to the auditory nerve ○ The cochlea in the inner ear is a bandpass-filter bank ○ The membrane resonates at a different position depending the frequency of the input. The resonance frequency increases in a log scale along the membrane
(Unrolled) Cochlear Membrane
Human Pitch Perception
- Pitch Resolution
○ Just noticeable difference (JND) increases as the frequency goes up
- Mel scale
○ Approximate the human pitch resolution based on pitch ratio of tones ○ Most widely used for speech and music analysis ○ A log frequency scale
m = 2595log10(1+ f / 700)
Computing Mel-Spectrogram
- Mapping linear frequency to mel scale
○ A mel-scaled filter bank is used: linear interpolation on the center frequency with the corresponding bandwidth skirt ○ The high-frequency range is zoomed-out and the low-frequency range is relatively zoomed-in; the number of frequency bins is usually smaller
Spectrogram (1024 freq. bins) Mel-spectrogram (128 mel bins) Mel-scaled filter bank
Center Frequency Band Width
Musical scale
- Musical tuning system
○ Equal temperament: 1: 21/12 ratio for semi-note ○ Music note (m) and frequency (f) in Hz
f = 440⋅2
(m−69) 12
m =12log2( f 440)+ 69,
https://newt.phys.unsw.edu.au/jw/notes.html
Review
- Now we know that we need a log-scale for music
- The log-scale filter bank will look like this
- Question:
○ Can we obtain the log-frequency scale spectrogram directly from waveforms using a time-frequency representation?
Log-scaled filter bank
Center Frequency Band Width
Constant-Q Transform
- Time-frequency representation which uses a set of sinusoidal kernels
with log-spaced frequencies
○ As the frequency increases, the length of sinusoidal kernels becomes shorter (bandwidth becomes wider) to have constant Q (= frequency/bandwidth)
(Schorkhuber and Klapuri, 2010)
Constant-Q IIR Filter Bank
- Musically designed constant-Q transform
○ 88 IIR bandpass filters ○ The center frequency corresponds to the pitch of each piano note ○ The bandwidth is set to have constant-Q with +/- 25 cent around the center (100 cents = 1 semi-tone)
(Müller, 2011)
Example: Constant-Q Transform
- Chromatic music scale
○ The harmonics of notes increase linearly in the constant-Q transform
Spectrogram Constant-Q transform