Audio Data Representations Juhan Nam Types of Music Data Audio - - PowerPoint PPT Presentation

audio data representations
SMART_READER_LITE
LIVE PREVIEW

Audio Data Representations Juhan Nam Types of Music Data Audio - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) Image Score


slide-1
SLIDE 1

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Audio Data Representations

Juhan Nam

slide-2
SLIDE 2

Types of Music Data

  • Audio

○ MP3, WAV

  • Score (symbolic)

○ MIDI, typesetting script languages (e.g., MusicXML)

  • Image

○ Score (scanned image), album/playlist cover, performance video

  • Text

○ Meta data, tags, lyrics, reviews

  • User Data

○ Listening history, rating

slide-3
SLIDE 3

Types of Music Data

  • Audio

○ MP3, WAW

  • Score (symbolic)

○ MIDI, typesetting script languages (e.g., MusicXML)

  • Image

○ Score (scanned image), album/playlist cover, performance video

  • Text

○ Meta data, tags, lyrics, reviews

  • User Data

○ Listening history, favorites or scores

slide-4
SLIDE 4

Types of Audio Data Representations

  • Waveform (digital audio samples): sampling and quantization
  • Spectrogram: short-time Fourier transform
  • Mel-spectrogram: human pitch perception
  • Constant-Q transform: transform into musical (chromatic) scale
slide-5
SLIDE 5

Digital Audio Chain

…0 0 1 0 1 0 …

\

microphone

Lowpass Filters Sampling Quantization Storage, Processing Digital-to-Analog Conversion Lowpass Filters Amplifier

loudspeaker

Analog-to-Digital Conversion

slide-6
SLIDE 6

Sampling and Quantization

…0 0 1 0 1 0 …

\

microphone

Lowpass Filters Sampling Quantization Storage, Processing Digital-to-Analog Conversion Lowpass Filters Amplifier

loudspeaker

Analog-to-Digital Conversion

slide-7
SLIDE 7

Sampling

  • Convert continuous-time signals to discrete-time signals by periodically

picking up the instantaneous values

○ Represented as a sequence of numbers ○ Sampling period (Ts): the amount of time between samples ○ Sampling rate ( fs =1/Ts ) x(t) → x(nTs)

Signal notation Ts

slide-8
SLIDE 8

Sampling Theorem

  • What is an appropriate sampling rate?

○ Too high: increase the data size in the digital domain ○ Too low: cannot reconstruct the original signal

  • Sampling Theorem

○ The sampling rate must be greater than twice the maximum frequency in the signal in order to reconstruct the original signal ○ Half the sampling rate is called Nyquist frequency (𝑔

!/2)

𝑔

! > 2 $ 𝑔 " 𝑔

!: sampling rate

𝑔

": maximum frequency of the signal

slide-9
SLIDE 9

Sampling in the Frequency Domain

fm

  • fm

Frequency

fm

  • fm

Frequency

fm

  • fm

Frequency

(𝑔

! > 2 $ 𝑔 ")

(𝑔

! < 2 $ 𝑔 ")

  • fs

fs fs-fm fs+fm

  • fs-fm
  • fs+fm
  • fs-fm
  • fs+fm

fs-fm fs+fm

The high-frequency content above the Nyquist frequency is folded over Alias Alias

slide-10
SLIDE 10

Sampling Rate

  • Determined by the bandwidth of signals or hearing limits

○ Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional) ○ Speech communication: 8 kHz

Music Speech

slide-11
SLIDE 11

Sampling Rate Conversion (Resampling)

  • We often increase or decrease the sampling rate

○ 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower rates to reduce the data size ○ Computed by signal interpolation

■ In down-sampling, preceded by a low-pass filter to avoid the aliasing noise ■ Windowed sinc function ■ https://ccrma.stanford.edu/~jos/resample/

Down-sampling Up-sampling

slide-12
SLIDE 12

Quantization

  • Discretizing the amplitude of real-valued signals

○ Round the amplitude to the nearest discrete steps ○ The bit discrete steps are determined by the number of bit bits (bit depth)

■ N bits can range from -2N-1 to 2N-1-1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766)

Quantization step

2N-1-1

  • 2N-1
slide-13
SLIDE 13

Quantization

  • Determined by the dynamic range of of signals

○ Adding 1 bits to LSB increases 6dB in sound level: N bits à 6N dB ○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB

Music Speech

slide-14
SLIDE 14

Loading Audio Files

  • Check the sampling rate and bit depth

○ You can check them using audio software such as Audacity

  • Do resampling (usually down-sampling) if necessary

○ Librosa provides resampling when loading audio files

slide-15
SLIDE 15

Waveform

  • Waveform is a natural representation of audio but limited in analyzing

the content

○ Mainly show the temporal energy

slide-16
SLIDE 16

Spectrogram

  • 2D-image representation of audio using short-time Fourier transform

○ x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude ○ Easy to match what you hear to what you see

slide-17
SLIDE 17
  • For each short segment (frame)

○ Take a window (one frame) ○ Compute DFT (FFT) ○ Convert them to polar coordinate

■ Magnitude and Phase

○ Compress the magnitude

■ 20log!"𝑌#$%: decibel

○ Shifting by a hop size

  • Spectrogram parameters

○ Window size (FFT size) ○ Hop size ○ Window type

Computing Spectrogram

DFT 𝑦(𝑚) 𝑌"#$ 𝑦(𝑚 − 1)

Magnitude Compression

𝑌!%&'

Short-Time Fourier Transform (STFT)

hop size window size

Windowing Windowing

slide-18
SLIDE 18
  • Find the frequency (sinusoidal) component of 𝑦 𝑜
  • Represent 𝑦 𝑜 with 𝑦 𝑜 = ∑$%&

'() ) ' 𝐵 𝑙 cos(*+$, '

+ ϕ(𝑙))

○ 𝐵 𝑙 : amplitude (or magnitude) of the sinusoid ○ ϕ(𝑙): phase of the sinusoid ○ 𝑂: size of DFT or the input segment ○ 𝑙: frequency bin index (0 to 𝑂 − 1) (

" # 𝑔 ! is the frequency at each frequency bin)

  • DFT provides the way of finding 𝐵 𝑙 and ϕ(𝑙)

Discrete Fourier Transform (DFT)

Pink Floyd ”The Dark Side of the Moon”

slide-19
SLIDE 19

Discrete Fourier Transform (DFT)

  • Use the orthogonality of sinusoids

∑&'"

()! cos(*+,& (

)cos(*+-&

( ) = ,𝑂/2

if 𝑙 = 𝑚 or 𝑙 = 𝑂 − 𝑚

  • therwise

(equivalent to − 𝑚)

∑&'"

()! cos(*+,& (

)sin(*+-&

( ) = 0

∑&'"

()! sin(*+,& (

)sin(*+-&

( ) = ?

𝑂/2 −𝑂/2

  • therwise

𝑙 = 𝑚 𝑙 = 𝑂 − 𝑚 (equivalent to − 𝑚)

  • The inner product (or correlation) between the two sinusoids:

○ If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other)

slide-20
SLIDE 20

Discrete Fourier Transform (DFT)

  • Inner product with the input and sinusoids

𝑌./ 𝑙 = ∑&'"

()! 𝑦 𝑜 cos *+-& (

= ∑&'"

()!(∑,'" ()! ! ( 𝐵 𝑙 cos(*+,& (

+ ϕ(𝑙)))cos *+-&

(

= 𝐵 𝑙 cos ϕ 𝑙

𝑌0# 𝑙 = − ∑&'"

()! 𝑦 𝑜 sin *+-& (

= − ∑&'"

()! ∑,'" ()! ! ( 𝐵 𝑙 cos(*+,& (

+ 𝜚(𝑙)) sin *+-&

(

= 𝐵 𝑙 sin ϕ 𝑙

  • The magnitude and phase

𝑌#$%(𝑙) = 𝐵 𝑙 = 𝑌./

* 𝑙 + 𝑌0# *

𝑙 , 𝑌12$3/(𝑙) = ϕ 𝑙 = tan)!(4()(,)

4*+(,))

  • The definition of DFT can be simplified using complex sinusoids

𝑌 𝑙 = ∑&'"

()! 𝑦 𝑜 𝑓)7,-./

= 𝑌./ 𝑙 + 𝑘𝑌0# 𝑙 = 𝐵(𝑙)78 ,

𝑓!"#$%

&

= cos 2𝜌𝑙𝑜 𝑂 + 𝑘sin 2𝜌𝑙𝑜 𝑂 Euler’s identity

slide-21
SLIDE 21

Discrete Fourier Transform (DFT)

  • Can be viewed as matrix multiplication
  • In practice, we use an FFT algorithm instead of

direct multiplication

○ Divide the matrix into small matrices recursively ○ Complexity reduction: O(N2)à O(Nlog2N)

𝑋

'1!

𝑋

!23

𝑦(𝑜)

to polar

𝑌4& 𝑌2" 𝑌"#$ 𝑌%5#!& 𝑋

'1!

𝑋

!23

𝑡6

∗(𝑜) = 𝑓89:;63 <

slide-22
SLIDE 22

Discrete Fourier Transform (DFT)

  • When DFT is applied to musical sounds

○ A musical tone with pitch has periodic waveforms ○ DFT shows harmonic spectrum (harmonic overtones) ○ Pitch information can be also extracted ○ The magnitude is generally more sparse than the waveform

𝑦(𝑜) 𝑌"#$(𝑙) 𝐺0 2𝐺0 3𝐺0

slide-23
SLIDE 23

Effect of Window Type

  • Types of window functions

○ Trade-off between the width of main-lobe and the level of side-lobe ○ Hann window is the most widely used in music analysis.

200 200 0.5 1

Amplitude Rectangular

500 500 60 40 20 20 40

Magnitude(dB)

200 200 0.5 1

Triangular

500 500 60 40 20 20 40 200 200 0.5 1

Hann

500 500 60 40 20 20 40 200 200 0.5 1

Blackmann

500 500 60 40 20 20 40

Spectra of windowed single sinusoids

slide-24
SLIDE 24

Effect of Window Size

  • Trade-off between time and frequency resolutions

○ Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution

Hop=128, N=256 Hop=128, N=4096

slide-25
SLIDE 25

Human Ears

  • Human ear is a spectrum analyzer?

○ Our ear has a complicated pathway from the ear drum to the auditory nerve ○ The cochlea in the inner ear is a bandpass-filter bank ○ The membrane resonates at a different position depending the frequency of the input. The resonance frequency increases in a log scale along the membrane

(Unrolled) Cochlear Membrane

slide-26
SLIDE 26

Human Pitch Perception

  • Pitch Resolution

○ Just noticeable difference (JND) increases as the frequency goes up

  • Mel scale

○ Approximate the human pitch resolution based on pitch ratio of tones ○ Most widely used for speech and music analysis ○ A log frequency scale

m = 2595log10(1+ f / 700)

slide-27
SLIDE 27

Computing Mel-Spectrogram

  • Mapping linear frequency to mel scale

○ A mel-scaled filter bank is used: linear interpolation on the center frequency with the corresponding bandwidth skirt ○ The high-frequency range is zoomed-out and the low-frequency range is relatively zoomed-in; the number of frequency bins is usually smaller

Spectrogram (1024 freq. bins) Mel-spectrogram (128 mel bins) Mel-scaled filter bank

Center Frequency Band Width

slide-28
SLIDE 28

Musical scale

  • Musical tuning system

○ Equal temperament: 1: 21/12 ratio for semi-note ○ Music note (m) and frequency (f) in Hz

f = 440⋅2

(m−69) 12

m =12log2( f 440)+ 69,

https://newt.phys.unsw.edu.au/jw/notes.html

slide-29
SLIDE 29

Review

  • Now we know that we need a log-scale for music
  • The log-scale filter bank will look like this
  • Question:

○ Can we obtain the log-frequency scale spectrogram directly from waveforms using a time-frequency representation?

Log-scaled filter bank

Center Frequency Band Width

slide-30
SLIDE 30

Constant-Q Transform

  • Time-frequency representation which uses a set of sinusoidal kernels

with log-spaced frequencies

○ As the frequency increases, the length of sinusoidal kernels becomes shorter (bandwidth becomes wider) to have constant Q (= frequency/bandwidth)

(Schorkhuber and Klapuri, 2010)

slide-31
SLIDE 31

Constant-Q IIR Filter Bank

  • Musically designed constant-Q transform

○ 88 IIR bandpass filters ○ The center frequency corresponds to the pitch of each piano note ○ The bandwidth is set to have constant-Q with +/- 25 cent around the center (100 cents = 1 semi-tone)

(Müller, 2011)

slide-32
SLIDE 32

Example: Constant-Q Transform

  • Chromatic music scale

○ The harmonics of notes increase linearly in the constant-Q transform

Spectrogram Constant-Q transform