Acoustic Scene Classification by Ensembling Gradient Boosting - - PowerPoint PPT Presentation
Acoustic Scene Classification by Ensembling Gradient Boosting - - PowerPoint PPT Presentation
Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks DCASE 2017 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra Outline Introduction
Outline
- Introduction
- Proposed System & Results
- Summary
2
2
Introduction
- Acoustic Scene Classification (ASC)
⇀ 15 acoustic scenes
3
system recording environment
Introduction
- Traditionally: feature engineering
⇀
feature extraction
⇀
classifier
4
Introduction
- Traditionally: feature engineering
⇀
feature extraction
⇀
classifier
- Nowadays: data-driven
⇀
learning representations
5
Introduction
- Traditionally: feature engineering
⇀
feature extraction
⇀
classifier
- Nowadays: data-driven
⇀
learning representations
6
How about combining both approaches for ASC ?
Proposed System
7
splitting acoustic scene 10s segment Freesound Extractor
GBM
score aggregation pre- processing splitting
CNN
score aggregation late fusion
mel-spectrogram
- Freesound Extractor by
- http://essentia.upf.edu/documentation/freesound_extractor.html
Gradient Boosting Machine
8
splitting
audio snippets
score aggregation Freesound Extractor
feature vectors
acoustic scene
n
GBM
n n
- Gradient Boosting Machine:
⇀
effective in Kaggle
⇀
multiple weak learners (decision trees)
Gradient Boosting Machine
9
splitting
audio snippets
score aggregation Freesound Extractor
feature vectors
acoustic scene
n
GBM
n n
- Gradient Boosting Machine:
⇀
effective in Kaggle
⇀
multiple weak learners (decision trees) ⇀ added iteratively
- Implementation:
⇀
LigthGBM https://github.com/Microsoft/LightGBM
Gradient Boosting Machine
10
splitting
audio snippets
score aggregation Freesound Extractor
feature vectors
acoustic scene
n
GBM
n n
- Score aggregation:
⇀
averaging scores across snippets
⇀
argmax
- Results:
⇀
development set ⇀ 4-fold cross-validation provided
⇀
Accuracy: 80.8%
Gradient Boosting Machine
11
splitting
audio snippets
score aggregation Freesound Extractor
feature vectors
acoustic scene
n
GBM
n n
- log-scaled mel-spectrogram
⇀
128 bands
Convolutional Neural Network
12
pre- processing
log-scaled mel-spectrogram
score aggregation
T-F patches
acoustic scene
n
CNN
n
splitting
- log-scaled mel-spectrogram
⇀
128 bands
- Time splitting:
⇀
T-F patches 1.5s
Convolutional Neural Network
13
pre- processing
log-scaled mel-spectrogram
score aggregation
T-F patches
acoustic scene
n
CNN
n
splitting
Convolutional Neural Network
14
pre- processing
log-scaled mel-spectrogram
score aggregation
T-F patches
acoustic scene
n
CNN
n
splitting
Convolutional Neural Network
15
pre- processing
log-scaled mel-spectrogram
score aggregation
T-F patches
acoustic scene
n
CNN
n
splitting
Convolutional Neural Network
16
pre- processing
log-scaled mel-spectrogram
score aggregation
T-F patches
- Global time-domain pooling (Valenti, 2016)
acoustic scene
n
CNN
n
splitting
Convolutional Neural Network
17
- Design of convolutional filters:
⇀
spectro-temporal patterns for ASC?
⇀
different rectangular filters (Pons, 2017) (Phan, 2016)
Convolutional Neural Network
18
- Design of convolutional filters:
⇀
spectro-temporal patterns for ASC?
⇀
different rectangular filters (Pons, 2017) (Phan, 2016)
⇀
multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 1
Convolutional Neural Network
19
- Design of convolutional filters:
⇀
spectro-temporal patterns for ASC?
⇀
different rectangular filters (Pons, 2017) (Phan, 2016)
⇀
multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 4
Recap
- Feature engineering:
⇀
Freesound Extractor
⇀
GBM
- Accuracy 80.8%
20
Recap
- Feature engineering:
⇀
Freesound Extractor
⇀
GBM
- Accuracy 80.8%
- Data-driven
⇀
log-scaled mel-spectrogram ⇀ CNN
- Accuracy: 79.9%
21
Recap
- Feature engineering:
⇀
Freesound Extractor
⇀
GBM
- Accuracy 80.8%
- Data-driven:
⇀
log-scaled mel-spectrogram ⇀ CNN
- Accuracy: 79.9%
22
How different do they behave?
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
23
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
24
GBM performs better CNN performs better
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
25
GBM performs better CNN performs better
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
26
GBM performs better CNN performs better
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
27
GBM performs better CNN performs better
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
28
GBM performs better CNN performs better
Models’ Comparison
- (Confusion matrix by GBM - Confusion matrix by CNN)
29
Late Fusion
- GBM:
⇀
prediction probabilities
- CNN:
⇀
softmax activation values
30
Late Fusion
- GBM:
⇀
prediction probabilities
- CNN:
⇀
softmax activation values
- Late fusion approach:
⇀
arithmetic mean + argmax
- System accuracy on development set:
⇀
83.0 %
31
Results
- residential area
vs park
32
Results
- residential area
vs park
- tram vs train
33
- residential area
vs park
- tram vs train
- grocery store vs
cafe/resto
Results
34
Challenge Ranking
- accuracy drop
- utperforming baseline by absolute 6.3 %
35
Summary
- Ensemble of two models
- Simplicity of models:
⇀
GBM + out-of-box feature extractor
⇀
CNN using domain knowledge ⇀ providing complementary information
- Simple late fusion method
- Reasonable results although room for improvement
⇀ individual models ⇀ fusion approach
36
37
Thank you!
References
- H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max
pooling convolutional neural networks”, arXiv preprint arXiv:1604.06338, 2016.
- J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra, “Timbre Analysis of Music
Audio Signals with Convolutional Neural Networks”, in 25th European Signal Processing Conference (EUSIPCO2017).
- M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016
acoustic scene classification using convolutional neural networks,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, 2016.
38