Audio Data Representations Juhan Nam Types of Music Data Audio - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam

Types of Music Data ● Audio MP3, WAV ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, rating ○

Types of Music Data ● Audio MP3, WAW ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, favorites or scores ○

Types of Audio Data Representations ● Waveform (digital audio samples): sampling and quantization ● Spectrogram: short-time Fourier transform ● Mel-spectrogram: human pitch perception ● Constant-Q transform: transform into musical (chromatic) scale

Digital Audio Chain Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker

Sampling and Quantization Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker

Sampling ● Convert continuous-time signals to discrete-time signals by periodically picking up the instantaneous values Represented as a sequence of numbers ○ Sampling period ( T s ): the amount of time between samples ○ Sampling rate ( f s = 1/ T s ) ○ T s Signal notation x ( t ) → x ( nT s )

Sampling Theorem ● What is an appropriate sampling rate? Too high: increase the data size in the digital domain ○ Too low: cannot reconstruct the original signal ○ ● Sampling Theorem The sampling rate must be greater than twice the maximum frequency in the ○ signal in order to reconstruct the original signal 𝑔 ! : sampling rate 𝑔 ! > 2 $ 𝑔 " 𝑔 " : maximum frequency of the signal Half the sampling rate is called Nyquist frequency ( 𝑔 ! /2 ) ○

Sampling in the Frequency Domain Frequency -f m f m 0 Alias Alias ( 𝑔 ! > 2 $ 𝑔 " ) Frequency -f s -f m -f s -f s +f m -f m f m f s f s +f m f s -f m 0 The high-frequency content above the Nyquist frequency is folded over ( 𝑔 ! < 2 $ 𝑔 " ) -f m Frequency -f s -f m f m f s +f m -f s +f m 0 f s -f m

Sampling Rate ● Determined by the bandwidth of signals or hearing limits Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional) ○ Speech communication: 8 kHz ○ Music Speech

Sampling Rate Conversion (Resampling) ● We often increase or decrease the sampling rate 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower ○ rates to reduce the data size Up-sampling Down-sampling Computed by signal interpolation ○ In down-sampling, preceded by a low-pass filter ■ to avoid the aliasing noise Windowed sinc function ■ https://ccrma.stanford.edu/~jos/resample/ ■

Quantization ● Discretizing the amplitude of real-valued signals Round the amplitude to the nearest discrete steps ○ The bit discrete steps are determined by the number of bit bits (bit depth) ○ N bits can range from -2 N -1 to 2 N -1 -1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766) ■ 2 N -1 -1 Quantization step -2 N -1

Quantization ● Determined by the dynamic range of of signals Adding 1 bits to LSB increases 6dB in sound level: N bits à 6 N dB ○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB ○ Music Speech

Loading Audio Files ● Check the sampling rate and bit depth You can check them using audio software such as Audacity ○ ● Do resampling (usually down-sampling) if necessary Librosa provides resampling when loading audio files ○

Waveform ● Waveform is a natural representation of audio but limited in analyzing the content Mainly show the temporal energy ○

Spectrogram ● 2D-image representation of audio using short-time Fourier transform x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude ○ Easy to match what you hear to what you see ○

Computing Spectrogram 𝑌 !%&' ● For each short segment (frame) Take a window (one frame) ○ Magnitude Compute DFT (FFT) ○ Compression Convert them to polar coordinate ○ 𝑌 "#$ Short-Time Magnitude and Phase ■ Fourier DFT Transform Compress the magnitude ○ (STFT) 𝑦(𝑚 − 1) 20log !" 𝑌 #$% : decibel ■ 𝑦(𝑚) Shifting by a hop size ○ window size ● Spectrogram parameters Windowing Window size (FFT size) ○ Windowing Hop size ○ hop size Window type ○

Discrete Fourier Transform (DFT) ● Find the frequency (sinusoidal) component of 𝑦 𝑜 '() ) ' 𝐵 𝑙 cos( *+$, ● Represent 𝑦 𝑜 with 𝑦 𝑜 = ∑ $%& + ϕ(𝑙) ) ' 𝐵 𝑙 : amplitude (or magnitude) of the sinusoid ○ ϕ(𝑙) : phase of the sinusoid ○ 𝑂 : size of DFT or the input segment ○ 𝑙 : frequency bin index ( 0 to 𝑂 − 1 ) ○ " ( # 𝑔 ! is the frequency at each frequency bin) ● DFT provides the way of finding 𝐵 𝑙 and ϕ(𝑙) Pink Floyd ”The Dark Side of the Moon”

Discrete Fourier Transform (DFT) ● Use the orthogonality of sinusoids (equivalent to − 𝑚) ( ) = ,𝑂/2 if 𝑙 = 𝑚 or 𝑙 = 𝑂 − 𝑚 ()! cos( *+,& ) cos( *+-& ○ ∑ &'" 0 otherwise ( ()! cos( *+,& ) sin( *+-& ○ ∑ &'" ( ) = 0 ( 0 otherwise ()! sin( *+,& ) sin( *+-& ○ ∑ &'" 𝑂/2 ( ) = ? 𝑙 = 𝑚 ( −𝑂/2 𝑙 = 𝑂 − 𝑚 (equivalent to − 𝑚) ● The inner product (or correlation) between the two sinusoids: If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other) ○

Discrete Fourier Transform (DFT) ● Inner product with the input and sinusoids ()! 𝑦 𝑜 cos *+-& ()! ! ( 𝐵 𝑙 cos( *+,& + ϕ(𝑙) ) )cos *+-& ()! (∑ ,'" ○ = ∑ &'" = ∑ &'" 𝑌 ./ 𝑙 ( ( ( = 𝐵 𝑙 cos ϕ 𝑙 ()! 𝑦 𝑜 sin *+-& ()! ∑ ,'" ()! ! ( 𝐵 𝑙 cos( *+,& + 𝜚(𝑙) ) sin *+-& ○ 𝑌 0# 𝑙 = − ∑ &'" = − ∑ &'" ( ( ( = 𝐵 𝑙 sin ϕ 𝑙 ● The magnitude and phase * 𝑙 + 𝑌 0# 𝑙 , 𝑌 12$3/ (𝑙) = ϕ 𝑙 = tan )! ( 4 () (,) * ○ 𝑌 #$% (𝑙) = 𝐵 𝑙 = 𝑌 ./ 4 *+ (,) ) ● The definition of DFT can be simplified using complex sinusoids ()! 𝑦 𝑜 𝑓 )7 ,-./ = 𝑌 ./ 𝑙 + 𝑘𝑌 0# 𝑙 = 𝐵(𝑙) 78 , ○ 𝑌 𝑙 = ∑ &'" = cos 2𝜌𝑙𝑜 + 𝑘sin 2𝜌𝑙𝑜 𝑓 !"#$% 0 & 𝑂 𝑂 Euler’s identity

Discrete Fourier Transform (DFT) 𝑌 "#$ 𝑌 %5#!& ● Can be viewed as matrix multiplication ∗ (𝑜) = 𝑓 89:;63 𝑡 6 < to polar 𝑌 4& 𝑌 2" 𝑋 𝑋 '1! !23 𝑦(𝑜) ● In practice, we use an FFT algorithm instead of direct multiplication Divide the matrix into small matrices recursively ○ Complexity reduction: O ( N 2 ) à O( N log 2 N ) ○ 𝑋 𝑋 '1! !23

Discrete Fourier Transform (DFT) ● When DFT is applied to musical sounds ○ A musical tone with pitch has periodic waveforms DFT shows harmonic spectrum (harmonic overtones) ○ Pitch information can be also extracted ○ ○ The magnitude is generally more sparse than the waveform 𝐺0 2𝐺0 3𝐺0 𝑦(𝑜) 𝑌 "#$ (𝑙)

Effect of Window Type ● Types of window functions Trade-off between the width of main-lobe and the level of side-lobe ○ Hann window is the most widely used in music analysis. ○ Rectangular Triangular Hann Blackmann 1 1 1 1 Amplitude 0.5 0.5 0.5 0.5 0 0 0 0 � 200 0 200 � 200 0 200 � 200 0 200 � 200 0 200 40 40 40 40 Magnitude(dB) 20 20 20 20 0 0 0 0 � 20 � 20 � 20 � 20 � 40 � 40 � 40 � 40 � 60 � 60 � 60 � 60 � 500 0 500 � 500 0 500 � 500 0 500 � 500 0 500 Spectra of windowed single sinusoids

Effect of Window Size ● Trade-off between time and frequency resolutions Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution ○ Hop=128, N=4096 Hop=128, N=256

Human Ears ● Human ear is a spectrum analyzer? Our ear has a complicated pathway from the ear drum to the auditory nerve ○ The cochlea in the inner ear is a bandpass-filter bank ○ The membrane resonates at a different position depending the frequency of ○ the input. The resonance frequency increases in a log scale along the membrane Membrane (Unrolled) Cochlear

Human Pitch Perception ● Pitch Resolution Just noticeable difference (JND) increases ○ as the frequency goes up ● Mel scale Approximate the human pitch resolution ○ based on pitch ratio of tones Most widely used for speech and music ○ analysis A log frequency scale ○ m = 2595log 10 (1 + f / 700)

Audio Data Representations Juhan Nam Types of Music Data Audio - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) Image Score

CMPT 365 Multimedia Systems Media Representations - Audio Spring 2017 CMPT365 Multimedia

Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines

GCT535- Sound Technology for Multimedia Fourier Representations of Audio Graduate School of

Automatic Classification of Automatic Classification of Audio Data Audio Data Carlos H. C.

Audio Declipping Using Sparse Multiscale Representations Boris Mailh Queen Mary University of

Digital Audio 3 4 Sampling 44.1 kHz 48 kHz 96 kHz 192 kHz digitisation =

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio

Shift- and Transform-Invariant Representations Denoising Speech Signals Class 18. 22 Oct 2009

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei

Music Representations Meinard Mller International Audio Laboratories Erlangen

61A Lecture 16 Announcements String Representations String Representations 4 String

Sound File Formats Raw data has samples (interleaved w/stereo) Need way to parse raw

GCT535- Sound Technology for Multimedia Music and Audio Alignment Graduate School of Culture

Time Series Representations for Better Data Mining What can we do with time series data?

MuBu IMTR IRCAM Centre Pompidou Norbert Schnell Riccardo Borghesi 20/10/2010 Motivation

CTP431- Music and Audio Computing Digital Audio Graduate School of Culture Technology KAIST

AGENDA PRESENTATION OFFRES REPRESENTATIONS STRENGHTS o o o o MOT DU DG STRATGIES

7. Video databases Video data representations Video = time-ordered sequence of correlated

Learning text representations from character-level data Grzegorz Chrupa la Department of

the stack & the heap hic 1 memory management So far: data representations: how are

Data Representation Data Representation Types of data: Numbers Text Audio

12/21/2016 Machine-Level Representations Prior lectures Data representation x86 Data Access

On SAT representations of XOR constraints (towards a theory of good SAT representations) Oliver

Ask what, not how Kostas Tzoumas Data is an important asset video & audio

Audio Data Representations Juhan Nam Types of Music Data Audio - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) Image Score

CMPT 365 Multimedia Systems Media Representations - Audio Spring 2017 CMPT365 Multimedia

Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines

GCT535- Sound Technology for Multimedia Fourier Representations of Audio Graduate School of

Automatic Classification of Automatic Classification of Audio Data Audio Data Carlos H. C.

Audio Declipping Using Sparse Multiscale Representations Boris Mailh Queen Mary University of

Digital Audio 3 4 Sampling 44.1 kHz 48 kHz 96 kHz 192 kHz digitisation =

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio

Shift- and Transform-Invariant Representations Denoising Speech Signals Class 18. 22 Oct 2009

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei

Music Representations Meinard Mller International Audio Laboratories Erlangen

61A Lecture 16 Announcements String Representations String Representations 4 String

Sound File Formats Raw data has samples (interleaved w/stereo) Need way to parse raw

GCT535- Sound Technology for Multimedia Music and Audio Alignment Graduate School of Culture

Time Series Representations for Better Data Mining What can we do with time series data?

MuBu IMTR IRCAM Centre Pompidou Norbert Schnell Riccardo Borghesi 20/10/2010 Motivation

CTP431- Music and Audio Computing Digital Audio Graduate School of Culture Technology KAIST

AGENDA PRESENTATION OFFRES REPRESENTATIONS STRENGHTS o o o o MOT DU DG STRATGIES

7. Video databases Video data representations Video = time-ordered sequence of correlated

Learning text representations from character-level data Grzegorz Chrupa la Department of

the stack &amp; the heap hic 1 memory management So far: data representations: how are

Data Representation Data Representation Types of data: Numbers Text Audio

12/21/2016 Machine-Level Representations Prior lectures Data representation x86 Data Access

On SAT representations of XOR constraints (towards a theory of good SAT representations) Oliver

Ask what, not how Kostas Tzoumas Data is an important asset video &amp; audio

the stack & the heap hic 1 memory management So far: data representations: how are

Ask what, not how Kostas Tzoumas Data is an important asset video & audio