 
              DCASE 2016 CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION Michele Valenti 1 (valenti.michele.w@gmail.com), Aleksandr Diment 2 , Giambattista Parascandolo 2 , Stefano Squartini 1 , Tuomas Virtanen 2 1 Università Politecnica delle Marche, Italy 2 Tampere University of T echnology, Finland
DCASE 2016 CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION Michele Valenti 1 (valenti.michele.w@gmail.com), Aleksandr Diment 2 , Giambattista Parascandolo 2 , Stefano Squartini 1 , Tuomas Virtanen 2 1 Università Politecnica delle Marche, Italy 2 Tampere University of T echnology, Finland
Outline • Introduction • Our system • Training modes • Results • Challenge ranking
Introduction What is “acoustic scene classification”?
Introduction What is “acoustic scene classification”? Home Car Forest path Audio
Our system Overview Audio Label Feature Sequence Scores CNN extraction splitting averaging
Our system Audio Features Features Raw audio Log-mel spectrogram
Our system Features Sequence splitting Sequence splitting Sequence Raw audio segment Log-mel spectrogram
Our system Convolutional neural network Sequence
Our system Sequences CNN Convolutional neural network 128 Sequence Feature maps
Our system Sequences CNN Convolutional neural network 128 Batch normalization Sequence Feature maps
Our system Sequences CNN Convolutional neural network 128 128 Sequence Feature maps Subsampled feature maps
Our system Sequences CNN Convolutional neural network 128 128 256 Sequence Feature maps New Subsampled feature maps feature maps
Our system Sequences CNN Convolutional neural network Time shrinking 128 128 256 Sequence Feature maps New Subsampled feature maps feature maps
Our system Sequences CNN Convolutional neural network Flattening 128 128 256 Sequence Feature maps New Subsampled feature maps feature maps
Our system Sequences CNN Convolutional neural network Fully-connected softmax layer 128 128 256 Sequence Feature maps New Subsampled feature maps feature maps
Our system Sequences CNN Convolutional neural network 128 128 256 Sequence Feature maps New Subsampled feature maps feature maps
Our system Prediction Scores scores averaging Scores averaging Class prediction scores
Our system Prediction Scores scores averaging Scores averaging ! " Σ Class prediction scores argmax File’s class
T raining
T raining Cross-validation setup Fold 1 Training + validation T est T est Fold 2 T est Fold 3 T est Fold 4
T est T raining + validation T raining Fold n Non-full training Training Validation
T est T raining + validation T raining Fold n Non-full training Non-full training Training Validation
T est T raining + validation T raining Fold n Non-full training Training Accuracies Training Validation Validation Epochs
T est T raining + validation T raining Fold n Non-full training Training Accuracies Training Validation Validation Convergence time Epochs
T est T raining + validation T raining Fold n Non-full training Training Training Validation
T est T raining + validation T raining Fold n Non-full training Full training Training Training Validation
Results Test data Fold 1 Training + validation T est T est Fold 2 T est Fold 3 T est Fold 4
Results Sequence length Non-full training Full training 80 Accuracy (%) 75 70 65 0,5 1,5 3 5 10 30 Sequence length (s)
Results Sequence length Non-full training Full training 80 Accuracy (%) 75 70 65 0,5 1,5 3 5 10 30 Sequence length (s)
Results Sequence length Non-full training Full training 80 Accuracy (%) 75 70 65 0,5 1,5 3 5 10 30 Sequence length (s)
Results Class accuracies Class Accuracy (%) Class Accuracy (%) Beach 75.6 Library 66.6 Bus 76.9 Metro station 96.2 Café/Restaurant 74.4 Office 97.4 Car 91.0 Park 59.0 City center 93.6 Residential area 73.1 Forest path 96.2 T rain 46.2 Grocery store 88.5 T ram 78.2 Home 80.8
Results Class accuracies Class Accuracy (%) Class Accuracy (%) Library 66.6 Beach 75.6 Metro station 96.2 Bus 76.9 34.6% Residential area Café/Restaurant 74.4 Office 97.4 Car 91.0 Park 59.0 Residential area 73.1 City center 93.6 Train 46.2 Forest path 96.2 Tram 78.2 Grocery store 88.5 29.5% Bus Home 80.8
Results Other classifiers Sequence Accuracy (%) System length (s) Non-full training Full training Baseline GMM (MFCC) - - 72.6 T wo-layer CNN (MFCC) 5 67.7 72.6 T wo-layer MLP (log-mel) - 66.6 69.3 One-layer CNN (log-mel) 3 70.3 74.8 Two-layer CNN (log-mel) 3 75.9 79.0
Challenge ranking Final training Extended training set Evaluation set Training + validation + test Secret challenge data
Challenge ranking Final training Extended training set Evaluation set Training + validation + test Secret challenge data New training New validation
Challenge ranking Final training Extended training set Evaluation set Training + validation + test Secret challenge data 400 epochs New training New validation convergence
Challenge ranking Final training Extended training set Evaluation set Training + validation + test Secret challenge data Final training for 400 epochs
Challenge ranking 100 89,7 88,7 87,7 87,2 86,4 86,4 86,2 85,9 85,6 85,4 84,6 84,1 90 77,2 80 70 62,8 60 50 40 30 20 10 0
DCASE 2016 CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION Michele Valenti 1 (valenti.michele.w@gmail.com), Aleksandr Diment 2 , Giambattista Parascandolo 2 , Stefano Squartini 1 , Tuomas Virtanen 2 1 Università Politecnica delle Marche, Italy 2 Tampere University of T echnology, Finland
Results Feature comparison Sequence Accuracy (%) System length (s) Non-full training Full training T wo-layer CNN (MFCC) 5 67.7 72.6 T wo-layer CNN (log-mel) 5 74.1 78.3
Recommend
More recommend