Traditional Machine Learning: Unsupervised Learning Juhan Nam - - PowerPoint PPT Presentation

traditional machine learning unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Traditional Machine Learning: Unsupervised Learning Juhan Nam - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning Pipeline in Classification Tasks A set of hand-designed audio features are selected


slide-1
SLIDE 1

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Traditional Machine Learning: Unsupervised Learning

slide-2
SLIDE 2

Traditional Machine Learning Pipeline in Classification Tasks

  • A set of hand-designed audio features are selected for a given task and

they are concatenated

○ The majority of them are extracted in frame-level: MFCC, chroma, spectral statistics ○ The concatenated features are complementary to each other

“Class #1 ” “Class #2” “Class #3” …

Classifier

MFCC Spectral Statistics Chroma

. . .

slide-3
SLIDE 3

Issues: Redundancy and Dimensionality

  • The information in the concatenated feature vectors can be repetitive
  • Adding more features increases the dimensionality of the feature
  • vectors. The classification will become more demanding.

“Class #1 ” “Class #2” “Class #3” …

Classifier

MFCC Spectral Statistics Chroma

. . .

?

slide-4
SLIDE 4

Issues: Temporal Summarization

  • Taking the entire frames as a single vector is too much for classifiers

○ 10 ~ 100 frames per second is typical in the frame-level processing

  • Temporal order is important and thus taking multiple features (that

capture local temporal order) are acceptable

○ MFCC: concatenated with its frame-wise differences (delta and double-delta)

  • However, extracting the long-term temporal dependency is hard!

○ Averaging is OK but too simple. ○ DCT over time for each feature dimension is an option

Classifier

MFCC Spectral Statistics Chroma

. . .

? ?

slide-5
SLIDE 5

Unsupervised Learning

  • Principal Component Analysis (PCA)

○ Learn a linear transform so that the transformed features are de-correlated ○ Dimensionality reduction: 2D for visualization

  • K-means

○ Learn K cluster centers and determine the membership ○ Move each data point to a fixed set of learned vectors (cluster centers): vector quantization and one-hot sparse feature representation

  • Gaussian Mixture Models (GMM)

○ Learn K Gaussian distribution parameters and the soft membership ○ Density estimation (likelihood estimation): can be used for classification when estimated for each class

slide-6
SLIDE 6

Principal Component Analysis

  • Correlation and Redundancy

○ We can measure the redundancy between two elements in a feature vector by computing their correlation ○ If some of the elements have high correlations, we can remove the redundant elements

Pearson correlation coefficient =

∑! "!# $ "! %!#%! "!# $ "! " %# & % "

𝑦! 𝑦" ⋮ 𝑦#

slide-7
SLIDE 7

Principal Component Analysis

  • Transform the input space (𝑌) into a latent space (𝑎) such that the latent

space is de-correlated (i.e., each dimension is orthogonal to each other)

○ Linear transform designed to maximize the variance of the first principal component and minimize the variance of the last principle component

𝑌 𝑎$ 𝑎

Orthogonal vectors (principal components)

slide-8
SLIDE 8

Principal Component Analysis

  • Transform the input space (𝑌) into a latent space (𝑎) such that the latent

space is de-correlated (i.e., each dimension is orthogonal to each other)

○ Linear transform designed to maximize the variance of the first principal component and minimize the variance of the last principle component

𝑎 𝑎𝑎! = 𝑂𝐉 = 𝜇" 𝜇# ⋱ ⋮ ⋯ 𝜇$ = 𝑋 𝑌

The diagonal elements correspond to the variances

  • f transformed data points on each dimension
slide-9
SLIDE 9

Principal Component Analysis: Eigenvalue Decomposition

  • To derive 𝑋
  • Eigenvalue Decomposition (𝑅: eigenvectors, Λ: eigenvalue matrix)
  • 𝑋 is obtained from the eigenvectors of Cov(𝑌)

𝑎𝑎! = 𝑂𝐉 (𝑋𝑌)(𝑋𝑌)!= 𝑂𝐉 𝑋𝑌𝑌!𝑋! = 𝑂𝐉 𝑋Cov(𝑌)𝑋! = 𝑂𝐉 𝐵𝑅 = 𝑅Λ 𝑅!𝐵𝑅 = Λ𝐉 𝑅%" 𝐵𝑅 = Λ

(If 𝐵 is symmetric)

𝑋 = 𝑅! 𝐵𝑦& = 𝜇&𝑦&

𝑅 = [𝑦!𝑦" … 𝑦#] Λ = diag(𝜇%)

slide-10
SLIDE 10

Principal Component Analysis: Eigenvalue Decomposition

  • In addition, we can normalize the latent space
  • 𝑋2 contains a set of orthonormal vectors

𝑋2Cov(𝑌)𝑋2! = 𝐉 Λ%"/#𝑅!𝐵𝑅Λ%"/# = 𝐉

Λ$ = Λ&!/" =

4 1 𝜇! 4 1 𝜇" ⋮ ⋯ ⋱ 4 1 𝜇$

𝑎2 = Λ2𝑎 𝑅!𝐵𝑅 = Λ𝐉 𝑋2 = Λ%"/#𝑅! = Λ%"/#𝑋 = Λ2𝑋

slide-11
SLIDE 11

Principal Component Analysis In Practice

  • In practice, 𝑌 is a huge matrix where each column is a data point

○ Computing the covariance matrix is a bottleneck and so we often sample the input data

Cov 𝑌 = 𝑌

. . .

𝑌!

. . .

slide-12
SLIDE 12

Principal Component Analysis In Practice

  • Shift the distribution to have zero mean
  • The normalization is optional: called PCA whitening

Shifting

𝑌 𝑌$ = 𝑌 − mean(𝑌)

Rotation Normalization (Scaling)

𝑋𝑌$ Λ$ 𝑋𝑌$

slide-13
SLIDE 13

Dimensionality Reduction Using PCA

  • We can remove principal components with small variances

○ Sort the variances in the latent space (the eigenvalues) in descending order and removing the tails ○ A strategy is accumulating the variances from the first principal component. When it reaches 90% or 95% of the sum of all variances, remove the remaining dimensions. This significantly reduces the dimensionality.

  • Note that you can reconstruct the original data with some loss

○ You can use PCA as a data compression method

Variances

95%

slide-14
SLIDE 14

Visualization Using PCA

  • Taking the first two or three principal components only for 2D or 3D

visualization

○ A popularized used feature visualization method along with t-SNE in analyzing the latent feature space in the trained deep neural network

source:https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

slide-15
SLIDE 15

K-Means Clustering

  • Grouping the data points into K clusters

○ Each point has the membership to one of the clusters ○ Each cluster has a cluster center (not necessarily one of the data points) ○ The membership is determined by choosing the nearest cluster center ○ The cluster center is the mean of the data points that belong to the cluster

This is dilemma!

slide-16
SLIDE 16

K-Means: Definition

  • The loss function to minimize is defined as:

○ Regarded as a problem that learns cluster centers (𝜈!) that minimize the loss ○ 𝑠

! (#) is the binary indicator of the membership of each data point

  • Taking the derivative of the loss 𝑀 w.r.t the cluster center 𝜈;

𝑀 = 2

()! *

2

+)! ,

𝑠(+ 𝑦(() − 𝜈+

"

𝑠

+ (() = 51 if 𝑙 = argmin /

𝑦(() − 𝜈+

"

  • therwise

𝜈+ = ∑()!

*

𝑠

+ (()𝑦(()

∑()!

*

𝑠

+ (()

Again, we should know the cluster centers (to determine membership) before computing the cluster centers

𝑒𝑀 𝑒𝜈+ = 2

()! *

2

+)! ,

2𝑠

+ (()(𝑦(() − 𝜈+) = 0

slide-17
SLIDE 17

Learning Algorithm

  • Iterative learning

○ Initialize the cluster centers with random values (a) ○ Compute the memberships of each data point given the cluster centers (b) ○ Update the cluster centers by averaging the data points that belong to them (c) ○ Repeat the two steps above until convergence (d, e, f)

(a) −2 2 −2 2 (b) −2 2 −2 2 (c) −2 2 −2 2 (d) −2 2 −2 2 (e) −2 2 −2 2 (f) −2 2 −2 2

(The PRML book)

points)

  • algo-

pro- as- . J 1 2 3 4 500 1000

The loss monotonically decreases every iteration

slide-18
SLIDE 18

Data Compression Using K-means

  • Vector Quantization

○ The set of cluster centers is called “codebook”: ○ Encoding a sample vector to a single scalar value of “codebook index” (membership index) ○ The compressed data can be reconstructed using the codebook ○ Example: speech codec (CELP)

■ A component of speech sound is vector- quantized and the codebook index is transmitted in the speech communication

Encoding

3 5 ⋯

Decoding

𝑦(!) 𝑦(") ⋯ 𝜈0 𝜈1 ⋯

Example of a codebook for a 2D Gaussian with 16 code vectors

source:https://wiki.aalto.fi/pages/viewpage.action?pageId=149883153

slide-19
SLIDE 19

Codebook-based Feature Summarization

  • Compute the histogram of codebook index

○ Represent the codebook index with one-hot vector

■ if K is a large number, it is regarded as a sparse representation of the features

○ Useful for summarizing a long sequence of feature-level features

■ Often called “a bag of features” (computer vision) or a bag of words (NLP)

1 … 1 Summarization (histogram) K-dimensional vector Encoding

𝑦(!) 𝑦(") ⋯

  • ne-hot vector

representation

source:https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb

a bag of features

slide-20
SLIDE 20

Gaussian Mixture Model (GMM)

  • Fit a set of multivariate Gaussian distribution to data

○ Similar to K-means clustering but it learns not only the cluster centers (means) but also the covariance of clusters ○ The membership is a soft assignment as a multinomial distribution

■ The multinomial distribution is regarded as mapping on a latent space

K-means GMM

slide-21
SLIDE 21

Gaussian Mixture Model (GMM)

  • Replace the hard assignment with a multinomial distribution
  • Replace a single cluster with a multivariate Gaussian distribution

○ Mean and covariance

𝑠+ ∈ 0,1 (hard assignment) 𝜌+ = 𝑄(𝑨+|𝑦) 2

+)! ,

𝜌+ = 1 (soft assignment) 𝑒(𝑦, 𝜈+) = 𝑦 − 𝜈+ "

𝑄(𝑦|𝑨!) = 1 (2𝜌)%/' Σ! (/' 𝑓)(

'(*)+%)& ,% '/)(*)+%)

. . .

1 2 3 4 K

slide-22
SLIDE 22

Gaussian Mixture Model (GMM)

  • The likelihood of a data point can be computed as a mixture of

Gaussians

  • Fit this model to data by maximum likelihood estimation

○ Equivalent to minimizing the negative log likelihood (this is the loss function) ○ This model fitting is called density estimation

  • GMM is also called a latent model

○ z is a latent variable: regarded as hidden causes of the data distribution

𝑞 𝑦 = 2

2

𝑞 𝑦, ℎ = 2

2

𝑞 ℎ 𝑞 𝑦 ℎ = 2

+)! ,

𝜌+ 𝑂(𝑦|𝜈+, Σ+)

slide-23
SLIDE 23

Learning Algorithm: K-Means

  • Iterative learning

○ Initialize the cluster centers with random values (a) ○ Compute the memberships of each data point given the cluster centers (b) ○ Update the cluster centers by averaging the data points that belong to them (c) ○ Repeat the two steps above until convergence (d, e, f)

slide-24
SLIDE 24

Learning Algorithm: GMM

  • Iterative learning

○ Initialize the cluster centers with random values (a) ○ Compute the memberships of each data point given the cluster centers (b) ○ Update the cluster centers by averaging the data points that belong to them (c) ○ Repeat the two steps above until convergence (d, e, f)

Expectation (E step) Maximization (M step)

Update the clusters by minimizing the loss given the membership (Update the clusters by maximizing the likelihood given the membership) Gaussian distribution parameters

slide-25
SLIDE 25

Learning Algorithm

  • Initialize the parameters
  • E-step

○ Evaluate the “soft” membership of samples given the Gaussian distributions

  • M-step

○ Update the parameters that maximize the log-likelihood

𝛿+

(() =

𝜌+𝑂(𝑦(()|𝜈+, Σ+) ∑/ 𝜌/𝑂(𝑦(()|𝜈+, Σ+) 𝜄 ∈ 𝜌+, 𝜈+, Σ+ 𝑂+ = 2

(

𝛿+

(()

𝜈+ = 1 𝑂+ 2

(

𝛿+

(()𝑦(() Σ+ = 1

𝑂+ 2

(

𝛿+

(()(𝑦(() − 𝜈+)(𝑦(() − 𝜈+)3

𝜌+ = 𝑂+ 𝑂

# of membership multinomial dist. Gaussian dist. (mean and covariance of each cluster)

slide-26
SLIDE 26

Classification Using GMM

  • Training: fit one GMM model to each class of data
  • Test: use Bayes’ rule for classification

𝑄 𝑦 𝑧 = 𝑑1, 𝜄4! , 𝑄 𝑦 𝑧 = 𝑑2, 𝜄4" , 𝑄 𝑦 𝑧 = 𝑑3, 𝜄40 , … V 𝑧 = argmax

5

𝑄 𝑧 𝑦 = 𝑦( = argmax

5

𝑄 𝑦 = 𝑦( 𝑧 𝑞(𝑧) 𝑞(𝑦 = 𝑦() V 𝑧 = argmax

5

𝑄 𝑦 = 𝑦( 𝑧 𝑞(𝑧)

Prior distribution of each class If you don’t any information on the prior, you can ignore 𝑞(𝑧) by assuming that all classes are equally possible.

𝑑1 𝑑2 𝑑3