Music Classification Overview and Audio Features Graduate School of - - PowerPoint PPT Presentation

music classification overview and audio features
SMART_READER_LITE
LIVE PREVIEW

Music Classification Overview and Audio Features Graduate School of - - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Music Classification Overview and Audio Features Graduate School of Culture Technology, KAIST Juhan Nam Outlines Definition of Music Classification Tasks Overview of Music Classification


slide-1
SLIDE 1

GCT634: Musical Applications of Machine Learning

Music Classification Overview and Audio Features

Graduate School of Culture Technology, KAIST Juhan Nam

slide-2
SLIDE 2

Outlines

  • Definition of Music Classification Tasks
  • Overview of Music Classification Systems
  • Audio Features
slide-3
SLIDE 3

Definition

  • Categorizing input audio into labels
  • Labels can be anything, even including note, chord or beat notations
  • However, we limit them to semantic words such as genre, mood,

instrument, era and other word-based descriptions Model Input Output

slide-4
SLIDE 4

Types of Music Classification Tasks

  • Genre/Mood classification
  • Classify music clips into a category
  • Single-label classification
  • Instrument Identification
  • Can be recast as a classification problem
  • Polyphonic cases: pre-dominant instrument detection (single-label

classification) or multiple instrument detection (multiple-label classification)

  • Music Auto-Tagging
  • Labels can be anything (e.g. genre, mood, instrument, era, vocal quality)
  • Multi-label classification
slide-5
SLIDE 5

Music Genre

  • Numerous genres and their sub-genres
  • http://research.google.com/bigpicture/music/
  • http://en.wikipedia.org/wiki/List_of_popular_music_genres
  • Evolutionary and influence-based
  • https://frananddavesmusicaladventure.wordpress.com/the-music-tree/
  • http://www.historyshots.com/rockmusic/
  • http://techno.org/electronic-music-guide/
  • Based on cultural context
  • Many cultural communities (or countries with homogenous culture) have

different genre distributions

  • Unique genres (e.g. trot) and different popularity (e.g. metal)
slide-6
SLIDE 6

Genre Categories in MIREX

Blues Jazz Country/Western Baroque Classical Romantic Electronica Hip-Hop Rock HardRock/Metal

US Pop Genre Classification

Axe Bachata Bolero Forro Gaucha Merengue Pagode Salsa Sertaneja Tango

Latin Genre Classification

Ballad Dance Folk Hip-hop R&B Rock Trot

K-pop Classification

http://www.music-ir.org/mirex/wiki/2017:Audio_Classification_(Train/Test)_Tasks

  • MIREX (Music Information Retrieval Evaluation eXchange)
  • Community-based algorithm evaluation framework and events
slide-7
SLIDE 7

Music Mood

  • Russel’s circumplex model of affect
  • “Arousal-Valence” 2D space

Models in Music Psychology 2/2

Dimensional

Russell’s

circumplex model

10/10/2012

Russell, J. A. 1980. A circumplex model of affect. Social , 39: 1161‐ 1178.

27

slide-8
SLIDE 8

Music Mood

  • Mood clustering
  • Using mood labels for songs ( “allmusic.com”)
  • Song by mood matrix à mood by mood correlation matrix à clustering

Mood Label Clustering

Mood labels for albums

10/10/2012

Mood labels for songs

C1 C2 C3 C4 C5

Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with Genre, Artist and Usage Metadata. In

31

10/10/2012 30

(Hu, 2007)

slide-9
SLIDE 9

Mood Categories in MIREX

  • The five clusters are used in the MIREX mood classification task

Mood Classification

Cluster_1: passionate, rousing, confident, boisterous, rowdy Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry Cluster_5: aggressive, fiery, tense/anxious, intense, volatile, visceral http://www.music-ir.org/mirex/wiki/2017:Audio_Classification_(Train/Test)_Tasks

slide-10
SLIDE 10

Overview of Music Classification Systems

Classifier Feature Extraction Audio Representations

“Metal” “Jazz” “Classical” (?)

slide-11
SLIDE 11

Overview of Music Classification Systems

Classifier Feature Extraction Audio Representations

  • Audio representations
  • Low-level representation of audio
  • Preserve the majority of information in input data
  • e.g. waveform, spectrogram, mel-spectrogram

“Metal” “Jazz” “Classical” (?)

slide-12
SLIDE 12

Overview of Music Classification Systems

Classifier Feature Extraction Audio Representations

  • Feature extraction
  • Summary of acoustic or musical patterns that explain the characteristics of

the audio representations

  • e.g. MFCC, chroma, learning-based feature represenations

“Metal” “Jazz” “Classical” (?)

slide-13
SLIDE 13

Overview of Music Classification Systems

Classifier Feature Extraction Audio Representations

  • Classifiers
  • Determine the category based on the extracted features
  • A learning algorithm is necessary: e.g. SVM, GMM, NN
  • Training and Testing

“Metal” “Jazz” “Classical” (?)

slide-14
SLIDE 14

It is important to extract good audio features!

Classifier Feature Extraction Audio Representations

“Metal” “Jazz” “Classical” Feature Space

Good Features Bad Features

“Metal” “Jazz” “Classical” Feature Space

slide-15
SLIDE 15

Let’s listen to examples

  • What the genre of the music ?
  • What the mood of the music ?
  • What are the features of the music that explain your answers?
slide-16
SLIDE 16

Human Knowledge to Explain Music

  • Acoustic Level
  • Loudness
  • Pitch
  • Timbre
  • Musical Level
  • Instrumentation
  • Rhythm
  • Key and scale
  • Chord and melodic pattern
  • Lyrics, structure, singing style, …
slide-17
SLIDE 17

Two Approaches in Music Classification

  • Feature engineering
  • Features are designed based on domain knowledge and heuristics
  • Traditional approach: e.g. MFCC+GMM model
  • Feature learning
  • Features are learned using optimization algorithms
  • Recent approach: e.g. deep neural networks

Classifier Feature Extraction Audio Representations

Let’s focus on the feature engineering approach first!

slide-18
SLIDE 18

Feature Engineering Model

  • Feature extraction is divided into several steps

Normalization Temporal Summarization Frame-Level Audio Features

(G. Tzanetakis)

slide-19
SLIDE 19

(Frame-Level) Audio Features

  • Loudness
  • Root-Mean-Squares (RMS) of audio frames
  • Timbre features
  • Zero-crossing rate
  • MFCC (w/ delta or double-delta): spectral envelop
  • Spectral summary: centroid, roll-off, …
  • Pitch/Harmony features
  • Chroma
  • Rhythm features (this is not frame-level)
  • Beat histogram, Tempogram
slide-20
SLIDE 20

Zero-Crossing Rate (ZCR)

  • ZCR is low for harmonic (voiced) sounds and high for noisy

(unvoiced) sounds

  • Useful to classify different drum sounds (e.g. bass, snare, high-hat)
  • For narrow-band periodic signals, it is related to the F0

Voiced Unvoiced

slide-21
SLIDE 21

Spectral Statistics

  • Spectral Centroid: “Center of gravity” of the spectrum
  • Associated with the brightness of sounds
  • Spectral Roll-off: frequency under which 85% or 95% of spectral

energy is concentrated in

SC(t) = fk Xt(k)

k

Xt(k)

k

Xt(k)

k Rt

= 0.85 Xt(k)

k N

slide-22
SLIDE 22

Examples of Spectral Centroids

time [sec] frequency [Hz]

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Classical: “Beethoven String Quartet” Pop: “Video killed the radio star”

time [sec] frequency [Hz]

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

slide-23
SLIDE 23

Spectral Statistics

  • Spectral Spread(SS): a measure of the bandwidth of the

spectrum

  • Spectral flatness (SF): a measure of the noisiness of the

spectrum

  • The ratio between the geometric and arithmetic means

SS(t) = ( fk − SC(t))2 Xt(k)

k

Xt(k)

k

SF(t) = Xt(k)

k

K

1 K Xt(k)

k

slide-24
SLIDE 24

Mel-Frequency Cepstral Coefficient (MFCC)

  • Most popularly used audio feature for timbre feature extraction
  • Extract spectral envelop from an audio frame
  • Standard audio feature in speech recognition
  • Introduced in music domain by Logan in 2000
  • Computation Steps

DCT Log magnitude DFT (audio frame) Mapping freq. scale to mel

slide-25
SLIDE 25

Mel-Frequency Spectrogram

  • Convert linear frequency to mel scale
  • Usually reduce the dimensionality of spectrum

Spectrum Spectrum (mel-scaled)

slide-26
SLIDE 26

Discrete Cosine Transform

  • Real-valued transform: similar to DFT
  • De-correlate the mel-scaled log spectrum and reduce the dimensionality

again

Spectrum (mel-scaled) MFCC

XDCT (k) = 2 N x(n)cos(πk N (n − 0.5))

n=1 N−1

slide-27
SLIDE 27

Reconstructed Frequency Spectrum from MFCC

Frequency spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Frequency Spectrum (mel-scaled) Reconstructed Frequency spectrum

slide-28
SLIDE 28

Comparison of Spectrogram and MFCC

Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC

slide-29
SLIDE 29

Sound Examples of MFCC

  • Original:
  • MFCC reconstruction (using white-noise as a source):
slide-30
SLIDE 30

Post-processing

  • Adding temporal dynamics
  • Short-term dynamics of features are characterized with delta or double-

delta

  • 39 MFCCs in speech recognition: 13 MFCCs + 13 delta + 13 double-delta

Δx = x(n)− x(n − h) h ΔΔx = Δx(n)− Δx(n − h) h

slide-31
SLIDE 31

Pitch and Chroma

  • The basic assumption in tonal harmony is

that octave-distance notes belong to the same pitch class

  • No dissonance among them
  • As a result, there are “12 pitch class”
  • Shepard represented the octave

equivalence with “pitch helix”

  • Chroma: represents the inherent circularity of

pitch organization

  • Height: naturally increase and have one octave

apart for one rotation

Pitch Helix and Chroma (Shepard, 2001)

slide-32
SLIDE 32

Pitch and Chroma

  • Chroma is independent of the height
  • Shepard tone: single pitch class in harmonics
  • Constant rising and falling

Optical illusion stairs Shepard tone https://vimeo.com/34749558

slide-33
SLIDE 33

Chroma Audio Features

  • Chroma features are audio feature vectors that contain relative

distribution of pitch classes of audio

  • Ideally, they can be obtained by polyphonic note transcription
  • In practice, chroma features are obtained by projecting all time-frequency

energy onto 12 pitch classes

  • Mainly used for chord recognition, key estimation, music and

audio synchronization and “score-level” tasks

  • It is often used for music classification but not as much effective as MFCC
slide-34
SLIDE 34

Chroma Features: FFT-based approach

  • Compute spectrogram and mapping matrix
  • Convert frequency to music pitch scale and get the pitch class
  • Set one to the corresponding pitch class and, otherwise, set zero
  • Adjust non-zeros values such that low-frequency content have more

weights

slide-35
SLIDE 35

Chroma Features: Filter-bank approach

  • A filter-bank can be used to get a log-

scale time-frequency representation

  • Center frequencies are arranged over 88

piano notes

  • band widths are set to have constant-Q and

robust to +/- 25 cent detune

  • The outputs that belong to the same

pitch class are wrapped and summed.

(Müller, 2011)

slide-36
SLIDE 36

Sound Examples of Chroma

  • Original:
  • MFCC reconstruction (using white-noise as a source):
  • Pitch-Invariance
  • Chroma reconstruction:
  • Timbre-Invariance
slide-37
SLIDE 37

Feature Summarization and Normalization

  • Summarization
  • Summary statistics
  • Temporal pooling: mean, variance, min, max over a context window
  • Temporal modulation: DFT of time-trajectory over sub-bands
  • Code-book approach
  • Vector quantization: create codebook by K-means clustering
  • Accumulate the code-book indices as a histogram
  • Normalization
  • Standardization
  • Zero mean: subtract means
  • Unit variance: divided variances for individual dimensions