Machine Learning for Music: Intro Juhan Nam Definition of Machine - - PowerPoint PPT Presentation

machine learning for music intro
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Music: Intro Juhan Nam Definition of Machine - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Machine Learning for Music: Intro Juhan Nam Definition of Machine Learning Tom M. Mitchell provided a widely accepted definition: A computer program is said to learn from


slide-1
SLIDE 1

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Machine Learning for Music: Intro

slide-2
SLIDE 2

Definition of Machine Learning

  • Tom M. Mitchell provided a widely accepted definition:

○ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

slide-3
SLIDE 3

Definition of Machine Learning

  • Tasks T

○ Classification, Regression, Transcription, Machine Translation, Structured

  • utput, Anomaly detection, Synthesis and Sampling, Imputation of missing

values, Denoising, and Density Estimation (listed from the DL book)

  • Experience E

○ Data and their correspondence: supervised /unsupervised learning/reinforcement learning

  • Performance P

○ Loss function, accuracy metrics

slide-4
SLIDE 4

In Musical Context

  • Tasks T

○ Analysis tasks: music genre/mood classification, music-auto tagging, automatic music transcription, source separation ○ Synthesis tasks: sound synthesis, music generation (automatic music composition or arrangement), expressive performance rendering

  • Experience E

○ Music data (audio, MIDI, text, images) and their correspondence

  • Performance P

○ Objective measure: loss function, accuracy metrics (e.g., F-score) ○ Subjective measures: user test (i.e., human test)

slide-5
SLIDE 5

Classification Tasks in Music

  • Classification is the most commonly used supervised learning approach

in music analysis tasks

○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model

“C2” “C#2” “D2” …

Pitch Estimation (frame-level)

slide-6
SLIDE 6

Classification Tasks in Music

  • Classification is the most commonly used supervised learning approach

in many music analysis tasks

○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model

“Piano” “Drum” “Guitar” …

Instrument Recognition (note-level)

slide-7
SLIDE 7

Classification Tasks in Music

  • Classification is the most commonly used supervised learning approach

in music analysis tasks

○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model

“Jazz” “Metal” “Classical” …

Genre Classification (segment-level)

slide-8
SLIDE 8

Classification Model for Music

  • The classification models are formed with the following steps in

common

○ Audio data representation: waveforms, spectrogram, mel-spectrogram ○ Feature extraction: highly depends on the tasks and the abstraction level

■ Higher-levels require longer input size and more complex features

○ Classifiers: measuring the distance between the feature vector and class templates for the final classification

Feature Extraction

“Class #1 ” “Class #2” “Class #3” …

Classification Model

Audio Data Representation Classifier

slide-9
SLIDE 9

Classification Model for Music

  • It is important to extract good audio features!

“Metal” “Jazz” “Classical” Feature Space

Good Features Bad Features

“Metal” “Jazz” “Classical” Feature Space

slide-10
SLIDE 10

Classification Model for Music

  • Traditional machine learning
  • Deep learning
slide-11
SLIDE 11
  • Use hand-designed features for the task

○ Based on domain knowledge (e.g. acoustics, signal processing) ○ Mel-frequency cepstral coefficient (MFCC), chroma, spectral statistics

  • Use standard classifiers

○ Logistic regression, support vector machine, multi-layer-perceptron

Traditional Machine Learning

Hand-designed Features

“Class #1 ” “Class #2” “Class #3” …

Classification Model

Audio Data Representation Classifier

Learning algorithm

slide-12
SLIDE 12

Traditional Machine Learning

  • Advantages

○ A small dataset is fine ○ The classifiers are fast to train ○ The hand-designed features are interpretable

  • Disadvantages

○ Requires domain knowledge ○ The feature design is an art ○ The two-stage approach is sub-optimal

  • Good as a baseline algorithm
slide-13
SLIDE 13

Deep Learning

  • Learn feature representations using neural network modules

○ Better to call it representation learning ○ Fully-connected, convolutional, recurrent, pooling, non-linear layers ○ Stack more layers as the output has a higher abstraction level ○ Audio data representation can be also learned (end-to-end learning) ○ Gradient-based learning: all neural network modules are differentiable. We can also add a new custom layer as long as it is differentiable

Neural Network Modules

“Class #1 ” “Class #2” “Class #3” …

Audio Data Representation

Learning algorithm

Classification Model

Linear Classifier

Learned features via feature embedding

slide-14
SLIDE 14

Deep Learning

  • Advantages

○ Less domain knowledge required. We can borrow many successful models from other domains (e.g. image or speech) ○ The trained model is reusable (transfer learning) ○ Superior performance in numerous machine learning tasks

slide-15
SLIDE 15

Deep Learning

  • Disadvantages (or challenges)

○ A large-scale labeled dataset and the models are slow to train

■ Semi-supervised/unsupervised/self-supervised learning are actively developed

○ Required regularization to avoid overfitting

■ Many regularization techniques have been studied

○ Designing neural nets and searching hyperparameter is an art

■ Model and hyperparameter optimization is another research topic: e.g., AutoML

○ Understanding learned features is hard

■ Feature visualization techniques ■ Disentangled learning models where one parameter controls one sub-dimension

  • f learned features
slide-16
SLIDE 16

Example: Mel-Frequency Cepstral Coefficient (MFCC)

  • Most popularly used audio feature to extract “timbre”

○ Extract spectrum envelop from an audio frame: remove pitch information ○ Standard audio feature in the legacy speech recognition systems

  • Computation Steps

○ Mel-spectrum: use a mel-filter bank ○ Discrete cosine transform (DCT): a small set of cosine kernels with low

  • frequencies. It captures slowly varying trend of mel-spectrum over frequency

which correspond to the spectrum envelope

DFT Mel Filterbank DCT

Mel-spectrum

abs (magnitude) log compression

Magnitude Spectrum

MFCC

slide-17
SLIDE 17

Example: Mel-Frequency Cepstral Coefficient (MFCC)

Magnitude spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Mel spectrum Reconstructed Magnitude spectrum

DCT Inverse DCT

Inverse mel filterbank Mel filterbank

slide-18
SLIDE 18

Example: Mel-Frequency Cepstral Coefficient (MFCC)

Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC

slide-19
SLIDE 19

Representation Learning Point of View: MFCC

  • We can replace the hand-designed modules with the trainable modules

○ DFT, Mel-filterbank and DCT is a linear transform ○ Abs and log compression is a non-linear function ○ The linear transforms are designed by hands in MFCC but they can be

  • ptimized further using the trainable modules

DFT Mel Filterbank DCT Abs (magnitude) Log compression Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Deep Neural Network MFCC

slide-20
SLIDE 20

Example: Chroma

  • Musical notes are denoted with a pitch class and an octave number

○ Pitch class: C, C#, D, D#, E, F, F#, G, G#, A, A#, B ○ Octave number: 0, 1, 2, 3, 4, 5, … ○ Example: C4 (middle C), E3, G5

  • The octave difference is the most consonant pitch interval

○ Therefore, they belong to the same pitch class

  • This can be represented with “pitch helix”

○ Chroma: inherent circularity of pitch organization ○ Height: naturally increase and have one octave above for one rotation

Pitch Helix and Chroma (Shepard, 2001)

slide-21
SLIDE 21

Example: Chroma

  • Compute the energy distribution of an audio frame on 12 pitch classes

○ Convert the frequency to a musical note (=12log!

" ##$ + 69) and take the

pitch class from the musical note (e.g. 69à A4 à A) ○ Extract harmonic characteristics while removing timbre information ○ Useful in music synchronization, chord recognition, music structure analysis, music genre classification

  • Computation Steps

○ Projecting the DFT or Constant-Q transform onto 12 pitch classes

DFT or Constant-Q Transform Chroma Mapping abs (magnitude) Chroma

slide-22
SLIDE 22

Example: Chroma

Spectrogram Chroma Chroma mapping (Reconstructed Chroma: Shepard tone)

slide-23
SLIDE 23

Representation Learning Point of View: Chroma

  • We can replace the hand-designed modules with the trainable modules

○ DFT, constant-Q transform, and chroma mapping are a linear transform ○ Abs correspond to a non-linear function ○ The linear transforms are designed by hands in chroma but they can be

  • ptimized further using the trainable modules

DFT or Constant-Q Transform Chroma Mapping Abs (magnitude) Linear Transform Linear Transform Non-linear function Deep Neural Network Chroma

slide-24
SLIDE 24

Summary

  • Introduce machine learning in the perspective of representation learning

(or feature learning)

  • In the traditional machine learning approach, we design feature

representations by hands. Once the features are extracted, we use standard machine learning algorithms.

  • In the deep learning approach, we design the network architecture by
  • hands. The feature representations are learned through the neural

network modules and the optimization