Inferring phonemic classes from CNN activation maps using clustering - PowerPoint PPT Presentation

Inferring phonemic classes from CNN activation maps using clustering techniques Thomas Pellegrini, Sandrine Mouysset Universit´ e de Toulouse; UPS; IRIT; Toulouse, France thomas.pellegrini@irit.fr, sandrine.mouysset@irit.fr 1 / 19

Motivation Slide from Surya Ganguli, http://goo.gl/YmmqCg 2 / 19

Related work in speech: with DNNs Source : Nagamine et al. Exploring How Deep Neural Networks Form Phonemic Categories. INTERSPEECH 2015 3 / 19

Related work in speech: with DNNs ◮ Single nodes and populations of nodes in a layer are selective to phonetic features ◮ Node selectivity to phonetic features becomes more explicit in deeper layers 4 / 19

Related work in speech: with DNNs ◮ Single nodes and populations of nodes in a layer are selective to phonetic features ◮ Node selectivity to phonetic features becomes more explicit in deeper layers ◮ Do these findings still hold with convolutional neural networks? 5 / 19

CNN Model used in this study ◮ BREF corpus: 100 hours, 120 native French speakers ◮ Train / Dev sets: 90%/10%, 1.8M/150K samples ◮ PER: 20% → good accuracy, allows the analysis of the model 6 / 19

Study workflow Does a CNN encode phonemic categories such as a DNN does? ◮ 100 input samples per phone feed-forwarded through the network ◮ The outputs of each layer extracted and fed to either k-means or spectral clustering, with optional front-end dimension reduction ◮ Remark: 4-d tensors reshaped into 2-d matrices 7 / 19

Study workflow Does a CNN encode phonemic categories such as a DNN does? ◮ 100 input samples per phone feed-forwarded through the network ◮ The outputs of each layer extracted and fed to either k-means or spectral clustering, with optional front-end dimension reduction ◮ Remark: 4-d tensors reshaped into 2-d matrices ◮ Experiment 1: fixed number of 33 clusters (French phone set size) ◮ Experiment 2: optimal number of clusters determined automatically 8 / 19

Dimension reduction ◮ Principal Component Analysis (PCA) processed on the whole activation maps: the number of principal components that keeps at least 90% of the covariance matrix spectrum PCA projections of averaged activations http://goo.gl/bbuZn9 9 / 19

Dimension reduction ◮ t-Distributed Stochastic Neighbor Embedding (t-SNE): relies on random walks on neighborhood graphs to extract the local structure of the data and also reveal important global structure t-SNE projections of averaged activations http://goo.gl/4f3nZ3 10 / 19

Clustering methods Consider the two most popular clustering techniques based on either linear separation or non-linear separation: ◮ Kmeans computed with the Manhattan distance ◮ Spectral Clustering selects dominant eigenvectors of the Gaussian affinity matrix in order to build a low-dimensional data space wherein data points are grouped into clusters 11 / 19

Clustering methods Consider the two most popular clustering techniques based on either linear separation or non-linear separation: ◮ Kmeans computed with the Manhattan distance ◮ Spectral Clustering selects dominant eigenvectors of the Gaussian affinity matrix in order to build a low-dimensional data space wherein data points are grouped into clusters Choice of the number of clusters: ◮ Kmeans: within- and between-cluster sums of point-to-centro¨ ıd distances ◮ Spectral Clustering: within- and between-cluster affinity measure 12 / 19

Evaluation for experiment 1 Evaluate the resulting clusters with a fixed number of 33 clusters: tp tp + fn , F = 2 P . R tp P = tp + fp , R = P + R where tp , fp and fn respectively represent the number of true positives, false positives and false negatives 13 / 19

Experiment 1: 33 clusters → Phone-specific clusters become more explicit with layer depth 14 / 19

Experiment 2: optimal number of clusters 7 clusters with SC ◮ 3 clusters for the vowels: 1. 93% of the medium to open vowels [a], [E], [9] 2. 83% of the closed vowels: [y], [i], [e] 3. 60% of the nasal vowels /a � /, /o � /, /U � / ◮ 4 clusters for the consonants: 1. 92% of the nasal consonants: /n/, /m/ and /J/ 2. 81% of the fricatives: /S/, /s/, /f/, /Z/ 3. 76% of the rounded vowels /o/, /u/, /O/, /w/ 4. 68% of the plosives consonants: /p/, /t/, /k/, /b/, /d/, /g/ k-means: similar clusters → Broad phonetic classes are learned by the network 15 / 19

Average activation map example of layer ”conv1” ◮ Vowels ◮ This map encodes the mouth aperture (F1) but not the vowel anteriority (F2) 16 / 19

Average activation map example of layer ”conv1” ◮ Plosives 17 / 19

Conclusions and future work Findings with CNNs similar to previous work by Nagamine with DNNs: 1. Phone-specific clusters become more explicit with layer depth 2. Broad phonetic classes are learned by the network Ongoing/future work: ◮ Studying the maps that do not correspond to phonemic categories ◮ What is the ”gist” of the phone representations for a CNN? 18 / 19

Thank you! Q&A thomas.pellegrini@irit.fr 19 / 19

Inferring phonemic classes from CNN activation maps using clustering - PowerPoint PPT Presentation

Inferring phonemic classes from CNN activation maps using clustering techniques Thomas Pellegrini, Sandrine Mouysset Universit e de Toulouse; UPS; IRIT; Toulouse, France thomas.pellegrini@irit.fr, sandrine.mouysset@irit.fr 1 / 19 Motivation

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Compilers Activation Records Alex Aiken Activation Records The information needed to manage

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Building Blocks of Reading: Effective Phonemic Awareness and Decoding Instruction Breda

CS371m - Mobile Computing Maps Using Google Maps This lecture focuses on using Google Maps

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

3 4 5 6 K Classes K Classes K Classes K Classes Student-Teacher Ratio 24 :1 72 96 120

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

APPENDICES appendix 1. Systems maps appendix 1. Systems maps appendix 1. Systems maps appendix

Inferring Internet Inferring Internet Denial- -of of- -Service Activity Service Activity

On Inferring and Characterizing On Inferring and Characterizing Internet Routing Policies

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Aggregate Analysis of Vowel Pronunciation Introduction The Goal in Swedish Dialects Language

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d Introduction VARIATION The central

LING 205 Practical Phonetics* Instructor: Geoff Morrison [df mrsn] *

Predicting Thread Discourse Structure over Technical Web Forums Li Wang, Marco Lui,

Phonology of Pitch Change Elizabeth Selkirk (1995) Sentence Prosody: Intonation, Stress, and

building a concrete alternative to ida 1 were sorry raxcity.com Shellphish

Introduction to Linux Command Line Interface Family of Unix-like Operating Systems Source:

rs rrt st

Sambuz

Useful Links

Newsletter

Mail Us

Inferring phonemic classes from CNN activation maps using clustering - PowerPoint PPT Presentation

Inferring phonemic classes from CNN activation maps using clustering techniques Thomas Pellegrini, Sandrine Mouysset Universit e de Toulouse; UPS; IRIT; Toulouse, France thomas.pellegrini@irit.fr, sandrine.mouysset@irit.fr 1 / 19 Motivation

Sub- &amp; Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Compilers Activation Records Alex Aiken Activation Records The information needed to manage

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Building Blocks of Reading: Effective Phonemic Awareness and Decoding Instruction Breda

CS371m - Mobile Computing Maps Using Google Maps This lecture focuses on using Google Maps

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

3 4 5 6 K Classes K Classes K Classes K Classes Student-Teacher Ratio 24 :1 72 96 120

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

APPENDICES appendix 1. Systems maps appendix 1. Systems maps appendix 1. Systems maps appendix

Inferring Internet Inferring Internet Denial- -of of- -Service Activity Service Activity

On Inferring and Characterizing On Inferring and Characterizing Internet Routing Policies

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Aggregate Analysis of Vowel Pronunciation Introduction The Goal in Swedish Dialects Language

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d Introduction VARIATION The central

LING 205 Practical Phonetics* Instructor: Geoff Morrison [df mrsn] *

Predicting Thread Discourse Structure over Technical Web Forums Li Wang, Marco Lui,

Phonology of Pitch Change Elizabeth Selkirk (1995) Sentence Prosody: Intonation, Stress, and

building a concrete alternative to ida 1 were sorry raxcity.com Shellphish

Introduction to Linux Command Line Interface Family of Unix-like Operating Systems Source:

rs rrt st

Sambuz

Useful Links

Newsletter

Mail Us

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of