Music Classification Overview and Audio Features Graduate School of - - PowerPoint PPT Presentation
Music Classification Overview and Audio Features Graduate School of - - PowerPoint PPT Presentation
GCT634: Musical Applications of Machine Learning Music Classification Overview and Audio Features Graduate School of Culture Technology, KAIST Juhan Nam Outlines Definition of Music Classification Tasks Overview of Music Classification
Outlines
- Definition of Music Classification Tasks
- Overview of Music Classification Systems
- Audio Features
Definition
- Categorizing input audio into labels
- Labels can be anything, even including note, chord or beat notations
- However, we limit them to semantic words such as genre, mood,
instrument, era and other word-based descriptions Model Input Output
Types of Music Classification Tasks
- Genre/Mood classification
- Classify music clips into a category
- Single-label classification
- Instrument Identification
- Can be recast as a classification problem
- Polyphonic cases: pre-dominant instrument detection (single-label
classification) or multiple instrument detection (multiple-label classification)
- Music Auto-Tagging
- Labels can be anything (e.g. genre, mood, instrument, era, vocal quality)
- Multi-label classification
Music Genre
- Numerous genres and their sub-genres
- http://research.google.com/bigpicture/music/
- http://en.wikipedia.org/wiki/List_of_popular_music_genres
- Evolutionary and influence-based
- https://frananddavesmusicaladventure.wordpress.com/the-music-tree/
- http://www.historyshots.com/rockmusic/
- http://techno.org/electronic-music-guide/
- Based on cultural context
- Many cultural communities (or countries with homogenous culture) have
different genre distributions
- Unique genres (e.g. trot) and different popularity (e.g. metal)
Genre Categories in MIREX
Blues Jazz Country/Western Baroque Classical Romantic Electronica Hip-Hop Rock HardRock/Metal
US Pop Genre Classification
Axe Bachata Bolero Forro Gaucha Merengue Pagode Salsa Sertaneja Tango
Latin Genre Classification
Ballad Dance Folk Hip-hop R&B Rock Trot
K-pop Classification
http://www.music-ir.org/mirex/wiki/2017:Audio_Classification_(Train/Test)_Tasks
- MIREX (Music Information Retrieval Evaluation eXchange)
- Community-based algorithm evaluation framework and events
Music Mood
- Russel’s circumplex model of affect
- “Arousal-Valence” 2D space
Models in Music Psychology 2/2
Dimensional
Russell’s
circumplex model
10/10/2012
Russell, J. A. 1980. A circumplex model of affect. Social , 39: 1161‐ 1178.
27
Music Mood
- Mood clustering
- Using mood labels for songs ( “allmusic.com”)
- Song by mood matrix à mood by mood correlation matrix à clustering
Mood Label Clustering
Mood labels for albums
10/10/2012
Mood labels for songs
C1 C2 C3 C4 C5
Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with Genre, Artist and Usage Metadata. In
31
10/10/2012 30
(Hu, 2007)
Mood Categories in MIREX
- The five clusters are used in the MIREX mood classification task
Mood Classification
Cluster_1: passionate, rousing, confident, boisterous, rowdy Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry Cluster_5: aggressive, fiery, tense/anxious, intense, volatile, visceral http://www.music-ir.org/mirex/wiki/2017:Audio_Classification_(Train/Test)_Tasks
Overview of Music Classification Systems
Classifier Feature Extraction Audio Representations
“Metal” “Jazz” “Classical” (?)
Overview of Music Classification Systems
Classifier Feature Extraction Audio Representations
- Audio representations
- Low-level representation of audio
- Preserve the majority of information in input data
- e.g. waveform, spectrogram, mel-spectrogram
“Metal” “Jazz” “Classical” (?)
Overview of Music Classification Systems
Classifier Feature Extraction Audio Representations
- Feature extraction
- Summary of acoustic or musical patterns that explain the characteristics of
the audio representations
- e.g. MFCC, chroma, learning-based feature represenations
“Metal” “Jazz” “Classical” (?)
Overview of Music Classification Systems
Classifier Feature Extraction Audio Representations
- Classifiers
- Determine the category based on the extracted features
- A learning algorithm is necessary: e.g. SVM, GMM, NN
- Training and Testing
“Metal” “Jazz” “Classical” (?)
It is important to extract good audio features!
Classifier Feature Extraction Audio Representations
“Metal” “Jazz” “Classical” Feature Space
Good Features Bad Features
“Metal” “Jazz” “Classical” Feature Space
Let’s listen to examples
- What the genre of the music ?
- What the mood of the music ?
- What are the features of the music that explain your answers?
Human Knowledge to Explain Music
- Acoustic Level
- Loudness
- Pitch
- Timbre
- Musical Level
- Instrumentation
- Rhythm
- Key and scale
- Chord and melodic pattern
- Lyrics, structure, singing style, …
Two Approaches in Music Classification
- Feature engineering
- Features are designed based on domain knowledge and heuristics
- Traditional approach: e.g. MFCC+GMM model
- Feature learning
- Features are learned using optimization algorithms
- Recent approach: e.g. deep neural networks
Classifier Feature Extraction Audio Representations
Let’s focus on the feature engineering approach first!
Feature Engineering Model
- Feature extraction is divided into several steps
Normalization Temporal Summarization Frame-Level Audio Features
(G. Tzanetakis)
(Frame-Level) Audio Features
- Loudness
- Root-Mean-Squares (RMS) of audio frames
- Timbre features
- Zero-crossing rate
- MFCC (w/ delta or double-delta): spectral envelop
- Spectral summary: centroid, roll-off, …
- Pitch/Harmony features
- Chroma
- Rhythm features (this is not frame-level)
- Beat histogram, Tempogram
Zero-Crossing Rate (ZCR)
- ZCR is low for harmonic (voiced) sounds and high for noisy
(unvoiced) sounds
- Useful to classify different drum sounds (e.g. bass, snare, high-hat)
- For narrow-band periodic signals, it is related to the F0
Voiced Unvoiced
Spectral Statistics
- Spectral Centroid: “Center of gravity” of the spectrum
- Associated with the brightness of sounds
- Spectral Roll-off: frequency under which 85% or 95% of spectral
energy is concentrated in
SC(t) = fk Xt(k)
k
∑
Xt(k)
k
∑
Xt(k)
k Rt
∑
= 0.85 Xt(k)
k N
∑
Examples of Spectral Centroids
time [sec] frequency [Hz]
0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Classical: “Beethoven String Quartet” Pop: “Video killed the radio star”
time [sec] frequency [Hz]
0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Spectral Statistics
- Spectral Spread(SS): a measure of the bandwidth of the
spectrum
- Spectral flatness (SF): a measure of the noisiness of the
spectrum
- The ratio between the geometric and arithmetic means
SS(t) = ( fk − SC(t))2 Xt(k)
k
∑
Xt(k)
k
∑
SF(t) = Xt(k)
k
∏
K
1 K Xt(k)
k
∑
Mel-Frequency Cepstral Coefficient (MFCC)
- Most popularly used audio feature for timbre feature extraction
- Extract spectral envelop from an audio frame
- Standard audio feature in speech recognition
- Introduced in music domain by Logan in 2000
- Computation Steps
DCT Log magnitude DFT (audio frame) Mapping freq. scale to mel
Mel-Frequency Spectrogram
- Convert linear frequency to mel scale
- Usually reduce the dimensionality of spectrum
Spectrum Spectrum (mel-scaled)
Discrete Cosine Transform
- Real-valued transform: similar to DFT
- De-correlate the mel-scaled log spectrum and reduce the dimensionality
again
Spectrum (mel-scaled) MFCC
XDCT (k) = 2 N x(n)cos(πk N (n − 0.5))
n=1 N−1
∑
Reconstructed Frequency Spectrum from MFCC
Frequency spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Frequency Spectrum (mel-scaled) Reconstructed Frequency spectrum
Comparison of Spectrogram and MFCC
Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC
Sound Examples of MFCC
- Original:
- MFCC reconstruction (using white-noise as a source):
Post-processing
- Adding temporal dynamics
- Short-term dynamics of features are characterized with delta or double-
delta
- 39 MFCCs in speech recognition: 13 MFCCs + 13 delta + 13 double-delta
Δx = x(n)− x(n − h) h ΔΔx = Δx(n)− Δx(n − h) h
Pitch and Chroma
- The basic assumption in tonal harmony is
that octave-distance notes belong to the same pitch class
- No dissonance among them
- As a result, there are “12 pitch class”
- Shepard represented the octave
equivalence with “pitch helix”
- Chroma: represents the inherent circularity of
pitch organization
- Height: naturally increase and have one octave
apart for one rotation
Pitch Helix and Chroma (Shepard, 2001)
Pitch and Chroma
- Chroma is independent of the height
- Shepard tone: single pitch class in harmonics
- Constant rising and falling
Optical illusion stairs Shepard tone https://vimeo.com/34749558
Chroma Audio Features
- Chroma features are audio feature vectors that contain relative
distribution of pitch classes of audio
- Ideally, they can be obtained by polyphonic note transcription
- In practice, chroma features are obtained by projecting all time-frequency
energy onto 12 pitch classes
- Mainly used for chord recognition, key estimation, music and
audio synchronization and “score-level” tasks
- It is often used for music classification but not as much effective as MFCC
Chroma Features: FFT-based approach
- Compute spectrogram and mapping matrix
- Convert frequency to music pitch scale and get the pitch class
- Set one to the corresponding pitch class and, otherwise, set zero
- Adjust non-zeros values such that low-frequency content have more
weights
Chroma Features: Filter-bank approach
- A filter-bank can be used to get a log-
scale time-frequency representation
- Center frequencies are arranged over 88
piano notes
- band widths are set to have constant-Q and
robust to +/- 25 cent detune
- The outputs that belong to the same
pitch class are wrapped and summed.
(Müller, 2011)
Sound Examples of Chroma
- Original:
- MFCC reconstruction (using white-noise as a source):
- Pitch-Invariance
- Chroma reconstruction:
- Timbre-Invariance
Feature Summarization and Normalization
- Summarization
- Summary statistics
- Temporal pooling: mean, variance, min, max over a context window
- Temporal modulation: DFT of time-trajectory over sub-bands
- Code-book approach
- Vector quantization: create codebook by K-means clustering
- Accumulate the code-book indices as a histogram
- Normalization
- Standardization
- Zero mean: subtract means
- Unit variance: divided variances for individual dimensions