Acoustic Scene Classification by Ensembling Gradient Boosting - - PowerPoint PPT Presentation

acoustic scene classification by ensembling gradient
SMART_READER_LITE
LIVE PREVIEW

Acoustic Scene Classification by Ensembling Gradient Boosting - - PowerPoint PPT Presentation

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks DCASE 2017 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra Outline Introduction


slide-1
SLIDE 1

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

DCASE 2017

Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra

slide-2
SLIDE 2

Outline

  • Introduction
  • Proposed System & Results
  • Summary

2

2

slide-3
SLIDE 3

Introduction

  • Acoustic Scene Classification (ASC)

⇀ 15 acoustic scenes

3

system recording environment

slide-4
SLIDE 4

Introduction

  • Traditionally: feature engineering

feature extraction

classifier

4

slide-5
SLIDE 5

Introduction

  • Traditionally: feature engineering

feature extraction

classifier

  • Nowadays: data-driven

learning representations

5

slide-6
SLIDE 6

Introduction

  • Traditionally: feature engineering

feature extraction

classifier

  • Nowadays: data-driven

learning representations

6

How about combining both approaches for ASC ?

slide-7
SLIDE 7

Proposed System

7

splitting acoustic scene 10s segment Freesound Extractor

GBM

score aggregation pre- processing splitting

CNN

score aggregation late fusion

mel-spectrogram

slide-8
SLIDE 8
  • Freesound Extractor by
  • http://essentia.upf.edu/documentation/freesound_extractor.html

Gradient Boosting Machine

8

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

slide-9
SLIDE 9
  • Gradient Boosting Machine:

effective in Kaggle

multiple weak learners (decision trees)

Gradient Boosting Machine

9

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

slide-10
SLIDE 10
  • Gradient Boosting Machine:

effective in Kaggle

multiple weak learners (decision trees) ⇀ added iteratively

  • Implementation:

LigthGBM https://github.com/Microsoft/LightGBM

Gradient Boosting Machine

10

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

slide-11
SLIDE 11
  • Score aggregation:

averaging scores across snippets

argmax

  • Results:

development set ⇀ 4-fold cross-validation provided

Accuracy: 80.8%

Gradient Boosting Machine

11

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

slide-12
SLIDE 12
  • log-scaled mel-spectrogram

128 bands

Convolutional Neural Network

12

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

slide-13
SLIDE 13
  • log-scaled mel-spectrogram

128 bands

  • Time splitting:

T-F patches 1.5s

Convolutional Neural Network

13

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

slide-14
SLIDE 14

Convolutional Neural Network

14

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

slide-15
SLIDE 15

Convolutional Neural Network

15

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

slide-16
SLIDE 16

Convolutional Neural Network

16

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

  • Global time-domain pooling (Valenti, 2016)

acoustic scene

n

CNN

n

splitting

slide-17
SLIDE 17

Convolutional Neural Network

17

  • Design of convolutional filters:

spectro-temporal patterns for ASC?

different rectangular filters (Pons, 2017) (Phan, 2016)

slide-18
SLIDE 18

Convolutional Neural Network

18

  • Design of convolutional filters:

spectro-temporal patterns for ASC?

different rectangular filters (Pons, 2017) (Phan, 2016)

multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 1

slide-19
SLIDE 19

Convolutional Neural Network

19

  • Design of convolutional filters:

spectro-temporal patterns for ASC?

different rectangular filters (Pons, 2017) (Phan, 2016)

multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 4

slide-20
SLIDE 20

Recap

  • Feature engineering:

Freesound Extractor

GBM

  • Accuracy 80.8%

20

slide-21
SLIDE 21

Recap

  • Feature engineering:

Freesound Extractor

GBM

  • Accuracy 80.8%
  • Data-driven

log-scaled mel-spectrogram ⇀ CNN

  • Accuracy: 79.9%

21

slide-22
SLIDE 22

Recap

  • Feature engineering:

Freesound Extractor

GBM

  • Accuracy 80.8%
  • Data-driven:

log-scaled mel-spectrogram ⇀ CNN

  • Accuracy: 79.9%

22

How different do they behave?

slide-23
SLIDE 23

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

23

slide-24
SLIDE 24

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

24

GBM performs better CNN performs better

slide-25
SLIDE 25

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

25

GBM performs better CNN performs better

slide-26
SLIDE 26

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

26

GBM performs better CNN performs better

slide-27
SLIDE 27

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

27

GBM performs better CNN performs better

slide-28
SLIDE 28

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

28

GBM performs better CNN performs better

slide-29
SLIDE 29

Models’ Comparison

  • (Confusion matrix by GBM - Confusion matrix by CNN)

29

slide-30
SLIDE 30

Late Fusion

  • GBM:

prediction probabilities

  • CNN:

softmax activation values

30

slide-31
SLIDE 31

Late Fusion

  • GBM:

prediction probabilities

  • CNN:

softmax activation values

  • Late fusion approach:

arithmetic mean + argmax

  • System accuracy on development set:

83.0 %

31

slide-32
SLIDE 32

Results

  • residential area

vs park

32

slide-33
SLIDE 33

Results

  • residential area

vs park

  • tram vs train

33

slide-34
SLIDE 34
  • residential area

vs park

  • tram vs train
  • grocery store vs

cafe/resto

Results

34

slide-35
SLIDE 35

Challenge Ranking

  • accuracy drop
  • utperforming baseline by absolute 6.3 %

35

slide-36
SLIDE 36

Summary

  • Ensemble of two models
  • Simplicity of models:

GBM + out-of-box feature extractor

CNN using domain knowledge ⇀ providing complementary information

  • Simple late fusion method
  • Reasonable results although room for improvement

⇀ individual models ⇀ fusion approach

36

slide-37
SLIDE 37

37

Thank you!

slide-38
SLIDE 38

References

  • H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max

pooling convolutional neural networks”, arXiv preprint arXiv:1604.06338, 2016.

  • J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra, “Timbre Analysis of Music

Audio Signals with Convolutional Neural Networks”, in 25th European Signal Processing Conference (EUSIPCO2017).

  • M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016

acoustic scene classification using convolutional neural networks,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, 2016.

38