Machine Learning for Music: Intro Juhan Nam Definition of Machine - - PowerPoint PPT Presentation
Machine Learning for Music: Intro Juhan Nam Definition of Machine - - PowerPoint PPT Presentation
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Machine Learning for Music: Intro Juhan Nam Definition of Machine Learning Tom M. Mitchell provided a widely accepted definition: A computer program is said to learn from
Definition of Machine Learning
- Tom M. Mitchell provided a widely accepted definition:
○ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
Definition of Machine Learning
- Tasks T
○ Classification, Regression, Transcription, Machine Translation, Structured
- utput, Anomaly detection, Synthesis and Sampling, Imputation of missing
values, Denoising, and Density Estimation (listed from the DL book)
- Experience E
○ Data and their correspondence: supervised /unsupervised learning/reinforcement learning
- Performance P
○ Loss function, accuracy metrics
In Musical Context
- Tasks T
○ Analysis tasks: music genre/mood classification, music-auto tagging, automatic music transcription, source separation ○ Synthesis tasks: sound synthesis, music generation (automatic music composition or arrangement), expressive performance rendering
- Experience E
○ Music data (audio, MIDI, text, images) and their correspondence
- Performance P
○ Objective measure: loss function, accuracy metrics (e.g., F-score) ○ Subjective measures: user test (i.e., human test)
Classification Tasks in Music
- Classification is the most commonly used supervised learning approach
in music analysis tasks
○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model
“C2” “C#2” “D2” …
Pitch Estimation (frame-level)
Classification Tasks in Music
- Classification is the most commonly used supervised learning approach
in many music analysis tasks
○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model
“Piano” “Drum” “Guitar” …
Instrument Recognition (note-level)
Classification Tasks in Music
- Classification is the most commonly used supervised learning approach
in music analysis tasks
○ Train the model with audio data and its class labels and then predict labels from new test audio Classification Model
“Jazz” “Metal” “Classical” …
Genre Classification (segment-level)
Classification Model for Music
- The classification models are formed with the following steps in
common
○ Audio data representation: waveforms, spectrogram, mel-spectrogram ○ Feature extraction: highly depends on the tasks and the abstraction level
■ Higher-levels require longer input size and more complex features
○ Classifiers: measuring the distance between the feature vector and class templates for the final classification
Feature Extraction
“Class #1 ” “Class #2” “Class #3” …
Classification Model
Audio Data Representation Classifier
Classification Model for Music
- It is important to extract good audio features!
“Metal” “Jazz” “Classical” Feature Space
Good Features Bad Features
“Metal” “Jazz” “Classical” Feature Space
Classification Model for Music
- Traditional machine learning
- Deep learning
- Use hand-designed features for the task
○ Based on domain knowledge (e.g. acoustics, signal processing) ○ Mel-frequency cepstral coefficient (MFCC), chroma, spectral statistics
- Use standard classifiers
○ Logistic regression, support vector machine, multi-layer-perceptron
Traditional Machine Learning
Hand-designed Features
“Class #1 ” “Class #2” “Class #3” …
Classification Model
Audio Data Representation Classifier
Learning algorithm
Traditional Machine Learning
- Advantages
○ A small dataset is fine ○ The classifiers are fast to train ○ The hand-designed features are interpretable
- Disadvantages
○ Requires domain knowledge ○ The feature design is an art ○ The two-stage approach is sub-optimal
- Good as a baseline algorithm
Deep Learning
- Learn feature representations using neural network modules
○ Better to call it representation learning ○ Fully-connected, convolutional, recurrent, pooling, non-linear layers ○ Stack more layers as the output has a higher abstraction level ○ Audio data representation can be also learned (end-to-end learning) ○ Gradient-based learning: all neural network modules are differentiable. We can also add a new custom layer as long as it is differentiable
Neural Network Modules
“Class #1 ” “Class #2” “Class #3” …
Audio Data Representation
Learning algorithm
Classification Model
Linear Classifier
Learned features via feature embedding
Deep Learning
- Advantages
○ Less domain knowledge required. We can borrow many successful models from other domains (e.g. image or speech) ○ The trained model is reusable (transfer learning) ○ Superior performance in numerous machine learning tasks
Deep Learning
- Disadvantages (or challenges)
○ A large-scale labeled dataset and the models are slow to train
■ Semi-supervised/unsupervised/self-supervised learning are actively developed
○ Required regularization to avoid overfitting
■ Many regularization techniques have been studied
○ Designing neural nets and searching hyperparameter is an art
■ Model and hyperparameter optimization is another research topic: e.g., AutoML
○ Understanding learned features is hard
■ Feature visualization techniques ■ Disentangled learning models where one parameter controls one sub-dimension
- f learned features
Example: Mel-Frequency Cepstral Coefficient (MFCC)
- Most popularly used audio feature to extract “timbre”
○ Extract spectrum envelop from an audio frame: remove pitch information ○ Standard audio feature in the legacy speech recognition systems
- Computation Steps
○ Mel-spectrum: use a mel-filter bank ○ Discrete cosine transform (DCT): a small set of cosine kernels with low
- frequencies. It captures slowly varying trend of mel-spectrum over frequency
which correspond to the spectrum envelope
DFT Mel Filterbank DCT
Mel-spectrum
abs (magnitude) log compression
Magnitude Spectrum
MFCC
Example: Mel-Frequency Cepstral Coefficient (MFCC)
Magnitude spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Mel spectrum Reconstructed Magnitude spectrum
DCT Inverse DCT
Inverse mel filterbank Mel filterbank
Example: Mel-Frequency Cepstral Coefficient (MFCC)
Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC
Representation Learning Point of View: MFCC
- We can replace the hand-designed modules with the trainable modules
○ DFT, Mel-filterbank and DCT is a linear transform ○ Abs and log compression is a non-linear function ○ The linear transforms are designed by hands in MFCC but they can be
- ptimized further using the trainable modules
DFT Mel Filterbank DCT Abs (magnitude) Log compression Linear Transform Linear Transform Linear Transform Non-linear function Non-linear function Deep Neural Network MFCC
Example: Chroma
- Musical notes are denoted with a pitch class and an octave number
○ Pitch class: C, C#, D, D#, E, F, F#, G, G#, A, A#, B ○ Octave number: 0, 1, 2, 3, 4, 5, … ○ Example: C4 (middle C), E3, G5
- The octave difference is the most consonant pitch interval
○ Therefore, they belong to the same pitch class
- This can be represented with “pitch helix”
○ Chroma: inherent circularity of pitch organization ○ Height: naturally increase and have one octave above for one rotation
Pitch Helix and Chroma (Shepard, 2001)
Example: Chroma
- Compute the energy distribution of an audio frame on 12 pitch classes
○ Convert the frequency to a musical note (=12log!
" ##$ + 69) and take the
pitch class from the musical note (e.g. 69à A4 à A) ○ Extract harmonic characteristics while removing timbre information ○ Useful in music synchronization, chord recognition, music structure analysis, music genre classification
- Computation Steps
○ Projecting the DFT or Constant-Q transform onto 12 pitch classes
DFT or Constant-Q Transform Chroma Mapping abs (magnitude) Chroma
Example: Chroma
Spectrogram Chroma Chroma mapping (Reconstructed Chroma: Shepard tone)
Representation Learning Point of View: Chroma
- We can replace the hand-designed modules with the trainable modules
○ DFT, constant-Q transform, and chroma mapping are a linear transform ○ Abs correspond to a non-linear function ○ The linear transforms are designed by hands in chroma but they can be
- ptimized further using the trainable modules
DFT or Constant-Q Transform Chroma Mapping Abs (magnitude) Linear Transform Linear Transform Non-linear function Deep Neural Network Chroma
Summary
- Introduce machine learning in the perspective of representation learning
(or feature learning)
- In the traditional machine learning approach, we design feature
representations by hands. Once the features are extracted, we use standard machine learning algorithms.
- In the deep learning approach, we design the network architecture by
- hands. The feature representations are learned through the neural