machine learning for music intro
play

Machine Learning for Music: Intro Juhan Nam Definition of Machine - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Machine Learning for Music: Intro Juhan Nam Definition of Machine Learning Tom M. Mitchell provided a widely accepted definition: A computer program is said to learn from


  1. GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Machine Learning for Music: Intro Juhan Nam

  2. Definition of Machine Learning ● Tom M. Mitchell provided a widely accepted definition: “A computer program is said to learn from experience E with respect to ○ some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

  3. Definition of Machine Learning ● Tasks T Classification, Regression, Transcription, Machine Translation, Structured ○ output, Anomaly detection, Synthesis and Sampling, Imputation of missing values, Denoising, and Density Estimation (listed from the DL book) ● Experience E Data and their correspondence: supervised /unsupervised ○ learning/reinforcement learning ● Performance P Loss function, accuracy metrics ○

  4. In Musical Context ● Tasks T Analysis tasks: music genre/mood classification, music-auto tagging, ○ automatic music transcription, source separation Synthesis tasks: sound synthesis, music generation (automatic music ○ composition or arrangement), expressive performance rendering ● Experience E Music data (audio, MIDI, text, images) and their correspondence ○ ● Performance P Objective measure: loss function, accuracy metrics (e.g., F-score) ○ Subjective measures: user test (i.e., human test) ○

  5. Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “C2” Classification “C#2” Model “D2” … Pitch Estimation (frame-level)

  6. Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in many music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “Piano” Classification “Drum” Model “Guitar” … Instrument Recognition (note-level)

  7. Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “Jazz” Classification “Metal” Model “Classical” … Genre Classification (segment-level)

  8. Classification Model for Music ● The classification models are formed with the following steps in common Audio data representation: waveforms, spectrogram, mel-spectrogram ○ Feature extraction: highly depends on the tasks and the abstraction level ○ Higher-levels require longer input size and more complex features ■ Classifiers: measuring the distance between the feature vector and class ○ templates for the final classification “Class #1 ” Audio Data Feature “Class #2” Classifier Representation Extraction “Class #3” … Classification Model

  9. Classification Model for Music ● It is important to extract good audio features! “Classical” “Classical” “Jazz” “Jazz” “Metal” “Metal” Feature Space Feature Space Good Features Bad Features

  10. Classification Model for Music ● Traditional machine learning ● Deep learning

  11. Traditional Machine Learning ● Use hand-designed features for the task Based on domain knowledge (e.g. acoustics, signal processing) ○ Mel-frequency cepstral coefficient (MFCC), chroma, spectral statistics ○ ● Use standard classifiers Logistic regression, support vector machine, multi-layer-perceptron ○ “Class #1 ” Audio Data Hand-designed “Class #2” Classifier Representation Features “Class #3” Learning algorithm … Classification Model

  12. Traditional Machine Learning ● Advantages A small dataset is fine ○ The classifiers are fast to train ○ The hand-designed features are interpretable ○ ● Disadvantages Requires domain knowledge ○ The feature design is an art ○ The two-stage approach is sub-optimal ○ ● Good as a baseline algorithm

  13. Deep Learning ● Learn feature representations using neural network modules Better to call it representation learning ○ Fully-connected, convolutional, recurrent, pooling, non-linear layers ○ Stack more layers as the output has a higher abstraction level ○ Audio data representation can be also learned (end-to-end learning) ○ Gradient-based learning: all neural network modules are differentiable. We ○ can also add a new custom layer as long as it is differentiable “Class #1 ” Neural Network Linear Audio Data “Class #2” Modules Classifier Representation “Class #3” Learning algorithm … Classification Model Learned features via feature embedding

  14. Deep Learning ● Advantages Less domain knowledge required. We can borrow many successful models ○ from other domains (e.g. image or speech) The trained model is reusable (transfer learning) ○ Superior performance in numerous machine learning tasks ○

  15. Deep Learning ● Disadvantages (or challenges) A large-scale labeled dataset and the models are slow to train ○ Semi-supervised/unsupervised/self-supervised learning are actively developed ■ Required regularization to avoid overfitting ○ Many regularization techniques have been studied ■ Designing neural nets and searching hyperparameter is an art ○ Model and hyperparameter optimization is another research topic: e.g., AutoML ■ Understanding learned features is hard ○ Feature visualization techniques ■ Disentangled learning models where one parameter controls one sub-dimension ■ of learned features

  16. Example: Mel-Frequency Cepstral Coefficient (MFCC) ● Most popularly used audio feature to extract “timbre” Extract spectrum envelop from an audio frame: remove pitch information ○ Standard audio feature in the legacy speech recognition systems ○ ● Computation Steps Mel-spectrum: use a mel-filter bank ○ Discrete cosine transform (DCT): a small set of cosine kernels with low ○ frequencies. It captures slowly varying trend of mel-spectrum over frequency which correspond to the spectrum envelope abs log Mel DFT MFCC DCT (magnitude) compression Filterbank Magnitude Spectrum Mel-spectrum

  17. Example: Mel-Frequency Cepstral Coefficient (MFCC) Mel DCT filterbank Frequency spectrum Magnitude spectrum MFCC (mel-scaled, 60 bins) (512 bins) (13 dim) Inverse Inverse DCT mel filterbank Reconstructed Reconstructed Magnitude spectrum Mel spectrum

  18. Example: Mel-Frequency Cepstral Coefficient (MFCC) Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC

  19. Representation Learning Point of View: MFCC ● We can replace the hand-designed modules with the trainable modules DFT, Mel-filterbank and DCT is a linear transform ○ Abs and log compression is a non-linear function ○ The linear transforms are designed by hands in MFCC but they can be ○ optimized further using the trainable modules Abs Mel Log DFT DCT (magnitude) Filterbank compression MFCC Linear Non-linear Non-linear Linear Linear Transform function function Transform Transform Deep Neural Network

  20. Example: Chroma ● Musical notes are denoted with a pitch class and an octave number Pitch class: C, C#, D, D#, E, F, F#, G, G#, A, A#, B ○ Octave number: 0, 1, 2, 3, 4, 5, … ○ Example: C4 (middle C), E3, G5 ○ ● The octave difference is the most consonant pitch interval Therefore, they belong to the same pitch class ○ ● This can be represented with “pitch helix” Chroma: inherent circularity of pitch organization ○ Height: naturally increase and have one octave above ○ for one rotation Pitch Helix and Chroma (Shepard, 2001)

  21. Example: Chroma ● Compute the energy distribution of an audio frame on 12 pitch classes " Convert the frequency to a musical note (= 12log ! ##$ + 69 ) and take the ○ pitch class from the musical note (e.g. 69 à A4 à A) Extract harmonic characteristics while removing timbre information ○ Useful in music synchronization, chord recognition, music structure ○ analysis, music genre classification ● Computation Steps Projecting the DFT or Constant-Q transform onto 12 pitch classes ○ DFT or abs Chroma Chroma Constant-Q Transform (magnitude) Mapping

  22. Example: Chroma Chroma Spectrogram Chroma mapping (Reconstructed Chroma: Shepard tone)

  23. Representation Learning Point of View: Chroma ● We can replace the hand-designed modules with the trainable modules DFT, constant-Q transform, and chroma mapping are a linear transform ○ Abs correspond to a non-linear function ○ The linear transforms are designed by hands in chroma but they can be ○ optimized further using the trainable modules DFT or Abs Chroma Constant-Q Transform (magnitude) Mapping Chroma Linear Non-linear Linear Transform function Transform Deep Neural Network

  24. Summary ● Introduce machine learning in the perspective of representation learning (or feature learning) ● In the traditional machine learning approach, we design feature representations by hands. Once the features are extracted, we use standard machine learning algorithms. ● In the deep learning approach, we design the network architecture by hands. The feature representations are learned through the neural network modules and the optimization

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend