[PPT] - Acoustic Scene Classification by Ensembling Gradient Boosting PowerPoint Presentation

SLIDE 1

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

DCASE 2017

Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra

SLIDE 2

Outline

Introduction
Proposed System & Results
Summary

2

SLIDE 3

Introduction

Acoustic Scene Classification (ASC)

⇀ 15 acoustic scenes

3

system recording environment

SLIDE 4

Introduction

Traditionally: feature engineering

⇀

feature extraction

⇀

classifier

4

SLIDE 5

Introduction

Traditionally: feature engineering

⇀

feature extraction

⇀

classifier

Nowadays: data-driven

⇀

learning representations

5

SLIDE 6

Introduction

Traditionally: feature engineering

⇀

feature extraction

⇀

classifier

Nowadays: data-driven

⇀

learning representations

6

How about combining both approaches for ASC ?

SLIDE 7

Proposed System

7

splitting acoustic scene 10s segment Freesound Extractor

GBM

score aggregation pre- processing splitting

CNN

score aggregation late fusion

mel-spectrogram

SLIDE 8

Freesound Extractor by
http://essentia.upf.edu/documentation/freesound_extractor.html

Gradient Boosting Machine

8

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

SLIDE 9

Gradient Boosting Machine:

⇀

effective in Kaggle

⇀

multiple weak learners (decision trees)

Gradient Boosting Machine

9

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

SLIDE 10

Gradient Boosting Machine:

⇀

effective in Kaggle

⇀

multiple weak learners (decision trees) ⇀ added iteratively

Implementation:

⇀

LigthGBM https://github.com/Microsoft/LightGBM

Gradient Boosting Machine

10

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

SLIDE 11

Score aggregation:

⇀

averaging scores across snippets

⇀

argmax

Results:

⇀

development set ⇀ 4-fold cross-validation provided

⇀

Accuracy: 80.8%

Gradient Boosting Machine

11

splitting

audio snippets

score aggregation Freesound Extractor

feature vectors

acoustic scene

n

GBM

n n

SLIDE 12

log-scaled mel-spectrogram

⇀

128 bands

Convolutional Neural Network

12

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

SLIDE 13

log-scaled mel-spectrogram

⇀

128 bands

Time splitting:

⇀

T-F patches 1.5s

Convolutional Neural Network

13

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

SLIDE 14

Convolutional Neural Network

14

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

SLIDE 15

Convolutional Neural Network

15

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

acoustic scene

n

CNN

n

splitting

SLIDE 16

Convolutional Neural Network

16

pre- processing

log-scaled mel-spectrogram

score aggregation

T-F patches

Global time-domain pooling (Valenti, 2016)

acoustic scene

n

CNN

n

splitting

SLIDE 17

Convolutional Neural Network

17

Design of convolutional filters:

⇀

spectro-temporal patterns for ASC?

⇀

different rectangular filters (Pons, 2017) (Phan, 2016)

SLIDE 18

Convolutional Neural Network

18

Design of convolutional filters:

⇀

spectro-temporal patterns for ASC?

⇀

different rectangular filters (Pons, 2017) (Phan, 2016)

⇀

multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 1

SLIDE 19

Convolutional Neural Network

19

Design of convolutional filters:

⇀

spectro-temporal patterns for ASC?

⇀

different rectangular filters (Pons, 2017) (Phan, 2016)

⇀

multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 4

SLIDE 20

Recap

Feature engineering:

⇀

Freesound Extractor

⇀

GBM

Accuracy 80.8%

20

SLIDE 21

Recap

Feature engineering:

⇀

Freesound Extractor

⇀

GBM

Accuracy 80.8%
Data-driven

⇀

log-scaled mel-spectrogram ⇀ CNN

Accuracy: 79.9%

21

SLIDE 22

Recap

Feature engineering:

⇀

Freesound Extractor

⇀

GBM

Accuracy 80.8%
Data-driven:

⇀

log-scaled mel-spectrogram ⇀ CNN

Accuracy: 79.9%

22

How different do they behave?

SLIDE 23

Models’ Comparison

(Confusion matrix by GBM - Confusion matrix by CNN)

23

SLIDE 24