Traditional Machine Learning: Unsupervised Learning Juhan Nam - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Traditional Machine Learning: Unsupervised Learning Juhan Nam

Traditional Machine Learning Pipeline in Classification Tasks ● A set of hand-designed audio features are selected for a given task and they are concatenated The majority of them are extracted in frame-level: MFCC, chroma, spectral ○ statistics The concatenated features are complementary to each other ○ “Class #1 ” . . . Chroma “Class #2” Classifier Spectral Statistics “Class #3” MFCC …

Issues: Redundancy and Dimensionality ● The information in the concatenated feature vectors can be repetitive ● Adding more features increases the dimensionality of the feature vectors. The classification will become more demanding. “Class #1 ” . . . Chroma “Class #2” Classifier ? Spectral Statistics “Class #3” MFCC …

Issues: Temporal Summarization ● Taking the entire frames as a single vector is too much for classifiers 10 ~ 100 frames per second is typical in the frame-level processing ○ ● Temporal order is important and thus taking multiple features (that capture local temporal order) are acceptable MFCC: concatenated with its frame-wise differences (delta and double-delta) ○ ● However, extracting the long-term temporal dependency is hard! Averaging is OK but too simple. ○ DCT over time for each feature dimension is an option ○ . . . Chroma Classifier ? ? Spectral Statistics MFCC

Unsupervised Learning ● Principal Component Analysis (PCA) Learn a linear transform so that the transformed features are de-correlated ○ Dimensionality reduction: 2D for visualization ○ ● K-means Learn K cluster centers and determine the membership ○ Move each data point to a fixed set of learned vectors (cluster centers): ○ vector quantization and one-hot sparse feature representation ● Gaussian Mixture Models (GMM) Learn K Gaussian distribution parameters and the soft membership ○ Density estimation (likelihood estimation): can be used for classification ○ when estimated for each class

Principal Component Analysis ● Correlation and Redundancy We can measure the redundancy between two elements in a feature vector ○ by computing their correlation ∑ ! " ! # $ " ! % ! #% ! Pearson correlation coefficient = " ! " % " " ! # $ %# & If some of the elements have high correlations, we can remove the ○ redundant elements 𝑦 ! 𝑦 " ⋮ 𝑦 #

Principal Component Analysis ● Transform the input space ( 𝑌 ) into a latent space ( 𝑎 ) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other) Linear transform designed to maximize the variance of the first principal ○ component and minimize the variance of the last principle component 𝑎 $ 𝑎 𝑌 Orthogonal vectors (principal components)

Principal Component Analysis ● Transform the input space ( 𝑌 ) into a latent space ( 𝑎 ) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other) Linear transform designed to maximize the variance of the first principal ○ component and minimize the variance of the last principle component 𝜇 " 0 0 0 0 𝜇 # 0 0 𝑎𝑎 ! = 𝑂𝐉 = = ⋱ ⋮ 0 0 ⋯ 𝜇 $ 0 0 𝑎 𝑌 𝑋 The diagonal elements correspond to the variances of transformed data points on each dimension

Principal Component Analysis: Eigenvalue Decomposition ● To derive 𝑋 𝑋𝑌𝑌 ! 𝑋 ! = 𝑂𝐉 𝑎𝑎 ! = 𝑂𝐉 (𝑋𝑌)(𝑋𝑌) ! = 𝑂𝐉 𝑋Cov(𝑌)𝑋 ! = 𝑂𝐉 ● Eigenvalue Decomposition ( 𝑅 : eigenvectors, Λ : eigenvalue matrix) 𝑅 %" 𝐵𝑅 = Λ 𝑅 ! 𝐵𝑅 = Λ𝐉 𝐵𝑦 & = 𝜇 & 𝑦 & 𝐵𝑅 = 𝑅Λ (If 𝐵 is symmetric) 𝑅 = [𝑦 ! 𝑦 " … 𝑦 # ] Λ = diag( 𝜇 % ) ● 𝑋 is obtained from the eigenvectors of Cov(𝑌) 𝑋 = 𝑅 !

Principal Component Analysis: Eigenvalue Decomposition ● In addition, we can normalize the latent space Λ %"/# 𝑅 ! 𝐵𝑅Λ %"/# = 𝐉 𝑎 2 = Λ 2 𝑎 𝑅 ! 𝐵𝑅 = Λ𝐉 𝑋 2 Cov(𝑌)𝑋 2! = 𝐉 1 0 4 𝜇 ! ⋮ 1 0 4 𝑋 2 = Λ %"/# 𝑅 ! = Λ %"/# 𝑋 = Λ 2 𝑋 Λ $ = Λ &!/" = 𝜇 " ⋱ 0 1 ⋯ 0 4 𝜇 $ 𝑋 2 contains a set of orthonormal vectors ●

Principal Component Analysis In Practice ● In practice, 𝑌 is a huge matrix where each column is a data point Computing the covariance matrix is a bottleneck and so we often sample the ○ input data Cov 𝑌 = . . . 𝑌 𝑌 ! . . .

Principal Component Analysis In Practice ● Shift the distribution to have zero mean ● The normalization is optional: called PCA whitening 𝑌 $ = 𝑌 − mean(𝑌) Shifting 𝑌 Rotation Λ $ 𝑋𝑌 $ 𝑋𝑌 $ Normalization (Scaling)

Dimensionality Reduction Using PCA ● We can remove principal components with small variances Sort the variances in the latent space (the eigenvalues) in descending order ○ and removing the tails A strategy is accumulating the variances from the first principal component. ○ When it reaches 90% or 95% of the sum of all variances, remove the remaining dimensions. This significantly reduces the dimensionality. Variances ⋯ 95% ● Note that you can reconstruct the original data with some loss You can use PCA as a data compression method ○

Visualization Using PCA ● Taking the first two or three principal components only for 2D or 3D visualization A popularized used feature visualization method along with t -SNE in ○ analyzing the latent feature space in the trained deep neural network source:https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

K-Means Clustering ● Grouping the data points into K clusters Each point has the membership to one of the clusters ○ Each cluster has a cluster center (not necessarily one of the data points) ○ The membership is determined by choosing the nearest cluster center ○ The cluster center is the mean of the data points that belong to the cluster ○ This is dilemma!

K-Means: Definition ● The loss function to minimize is defined as: " 𝑦 (() − 𝜈 + * , (() = 51 if 𝑙 = argmin " 𝑠 (+ 𝑦 (() − 𝜈 + 𝑠 𝑀 = 2 2 / + 0 otherwise ()! +)! Regarded as a problem that learns cluster centers ( 𝜈 ! ) that minimize the loss ○ (#) is the binary indicator of the membership of each data point 𝑠 ○ ! ● Taking the derivative of the loss 𝑀 w.r.t the cluster center 𝜈 ; Again, we should know the * , (() 𝑦 (() * ∑ ()! 𝑒𝑀 𝑠 (() (𝑦 (() − 𝜈 + ) = 0 cluster centers (to determine + = 2 2 2𝑠 𝜈 + = + (() 𝑒𝜈 + membership) before computing * ∑ ()! 𝑠 ()! +)! + the cluster centers

Learning Algorithm ● Iterative learning Initialize the cluster centers with random values (a) ○ Compute the memberships of each data point given the cluster centers (b) ○ Update the cluster centers by averaging the data points that belong to them (c) ○ Repeat the two steps above until convergence (d, e, f) ○ (a) (b) (c) 2 2 2 points) 1000 - 0 0 0 algo- J pro- − 2 − 2 − 2 as- 500 − 2 0 2 − 2 0 2 − 2 0 2 . (d) (e) (f) 2 2 2 0 0 0 0 1 2 3 4 − 2 − 2 − 2 The loss monotonically − 2 0 2 − 2 0 2 − 2 0 2 decreases every iteration (The PRML book)

Data Compression Using K-means ● Vector Quantization The set of cluster centers is called “codebook”: ○ Encoding a sample vector to a single scalar ○ value of “codebook index” (membership index) The compressed data can be reconstructed ○ using the codebook Example: speech codec (CELP) ○ A component of speech sound is vector- ■ Example of a codebook for a 2D quantized and the codebook index is transmitted Gaussian with 16 code vectors in the speech communication 𝜈 0 𝜈 1 ⋯ 𝑦 (!) 𝑦 (") 3 5 ⋯ ⋯ Encoding Decoding source:https://wiki.aalto.fi/pages/viewpage.action?pageId=149883153

Codebook-based Feature Summarization ● Compute the histogram of codebook index Represent the codebook index with one-hot vector ○ if K is a large number, it is regarded as a sparse representation of the features ■ Useful for summarizing a long sequence of feature-level features ○ Often called “a bag of features” (computer vision) or a bag of words (NLP) ■ 0 0 0 0 one-hot vector 1 0 … 𝑦 (!) 𝑦 (") ⋯ representation 0 0 Encoding 0 1 0 0 Summarization (histogram) a bag of features K -dimensional vector source:https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb

Gaussian Mixture Model (GMM) ● Fit a set of multivariate Gaussian distribution to data Similar to K-means clustering but it learns not only the cluster centers ○ (means) but also the covariance of clusters The membership is a soft assignment as a multinomial distribution ○ The multinomial distribution is regarded as mapping on a latent space ■ K-means GMM

Traditional Machine Learning: Unsupervised Learning Juhan Nam - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning Pipeline in Classification Tasks A set of hand-designed audio features are selected

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Deep Learning: Intro Juhan Nam Review of Traditional Machine Learning The traditional machine

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

INFO 1998: Introduction to Machine Learning Lecture 9: Clustering and Unsupervised Learning INFO

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Common Endpoint Locator Pools Common Endpoint Locator Pools Common Endpoint Locator Pools (CELP)

Improved Noise Weighting in CELP Coding of Speech T T Applying the Vorbis Psychoacoustic Model

WIRELESS COMMUNICATION II LESSON I Monday, 27 March 2017 ETI 2511-CURRICULUM(1) ETI 2511

CS653 Mobile Computing Spring 2014 Spring 2014 Course Overview PG elective course, open to

The Energy/Frequency Convexity Rule of Energy Consumption for Programs: Modeling,

Android Smartphone as a Microphone in SmartRoom System Pavel Y. Kovyrshin, Dmitry G. Korzun

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

COMP 110-003 Introduction to Programming Computer Basics January 10, 2013 Haohan Li TR 11:00