Understanding human speech recognition: Reverse-engineering the - PowerPoint PPT Presentation

Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines   University of Cambridge 25 November 2015 Understanding human speech recognition:   Reverse-engineering the engineering solution using EMEG and RSA

Understanding speech recognition 25 November 2015 Brains and Machines ▸ We’ve seen from previous speakers how: ▸ Machine systems are designed to perform the same tasks as humans. ▸ The architecture of machine models of (e.g.) vision may relate to those of biological systems. ▸ By using methods such as RSA, intermediate-level derived representations in one may be compared to those in the other.

Understanding speech recognition 25 November 2015 Speech and vision ▸ Unlike visual objects, speech stimuli are time-sensitive. ▸ There’s no standard neurocomputational model of speech comprehension. ▸ Humans alone amongst animals have this faculty. ▸ The most effective artificial systems’ designs don’t tend to relate to biological models. ▸ However, machines provide a computational model of the process.

Understanding speech recognition 25 November 2015 Speech recognition ▸ Both human brains and machines can recognise speech accurately. ▸ Transforms raw acoustic input into abstract word “objects”. ▸ Artificial (ASR) systems are nearly as good as humans. ▸ In brains, this is mediated by some complex, poorly understood neurobiological process. ▸ We will compare intermediate-level representations in an ASR and human auditory cortex using RSA. “what a lovely day”

Understanding speech recognition 25 November 2015 HTK: GMM-HMM [sil-aa-b] p [sil-aa-k] p ⋮ [sil-aa-d] p ⋮ ⋮ HMM ⋮ [ih-s-jh] p [ih-s-k] p ⋮ GMM ⋮ ⋮ ⋮ ⋮ ⋮ [uh-zh-uh] p [uh-zh-uw] p Young et al. (1997)   [uh-zh-sil] p [sil-w-oh] [w-oh-t] [oh-t-sil] The HTK Book WHAT

Understanding speech recognition 25 November 2015 Searchlight GLM RSA [ ɑ ] = β [ ɑ ] + [æ] + … + [z] + β [æ] β [z] E Data RDM Searchlight Dynamic phonetic model RDMs from HTK's state patch [ ɑ ] [z] β Contributions of individual phonetic Su et al. (2014)   models Frontiers in Neuroscience

� Understanding speech recognition 25 November 2015 Evidence for sensitivity to phonetic features a Subject 1 Subject 2 Dorsal Subject 3 Subject 4 Anterior Normalized classifier weights /ba/ vs /da/ /da/ vs /ga/ /ba/ vs /ga/ 0 1 Chang et al. (2010)   Mesgarani et al. (2014)   Nature Neuroscience Science

Understanding speech recognition 25 November 2015 categories Sonorant Broad   Evidence for sensitivity to phonetic features Voiced Obstruent Labial ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Place HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Coronal Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Dorsal Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Plosive Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Manner 1 1 1 1 1 Labial Fricative 1 1 1 1 1 1 1 1 Coronal Sibilant Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Nasal Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Front frontness Nasal 1 1 1 Vowel   Central Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 Close closeness 1 1 1 1 1 1 Close-mid Vowel   Close-mid 1 1 1 1 1 1 1 1 1 Open-mid Open-mid Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Open Rounded

ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Understanding speech recognition 25 November 2015 HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Labial 1 1 1 1 1 Searchlight GLM RSA 1 1 1 1 1 1 1 1 Coronal Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Nasal 1 1 1 Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 1 1 1 1 1 1 Close-mid Open-mid 1 1 1 1 1 1 1 1 1 Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Map of feature fit f = χ f · β [ ɑ ] [z] β Contributions of individual phonetic models

Understanding speech recognition 25 November 2015 Speech recognition ▸ 16 subjects, 400 words, EMEG. ▸ Most features we tested showed significant fit in auditory cortex. ▸ Bilateral HG, STG, STS. ▸ Broad category features fit best on the right. ▸ Regions on the left tended to be more focussed. ▸ Within-category features showed fits bilaterally. [100, 170] ms Wingfield et al. (in prep.)

Moving forward: DNN-based ASR (work in progress) ▸ DNNs have proved very effective in ▸ Hidden-layer representations visual domain.   provide “bottom-up” features which are used to disambiguate speech.

Understanding speech recognition 25 November 2015 Zhang & Woodland (2015)   HTK: DNN-HMM Submission to InterSpeech Δ 2 MFCCS Δ MFCCS MFCCS ~6000 1000 1000 1000 1000 1000 720 (f.c.) 26 BN layer [0,25]ms ±40ms

Understanding speech recognition 25 November 2015 Individual node responses - 0 + ▸ BN architecture provides a low- dimensional feature space sufficient to accurately determine 6000+ phonetic labels. ▸ Dynamic inputs elicit dynamic BN responses. ▸ Can we investigate this BN representation space, and compare words it to brain representations? time Node 04

Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Sibilance Node 04

Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Vowel frontness Node 20

Understanding speech recognition 25 November 2015 BN–feature similarity +1 MDS Similarities between nodes and features 0 -1

Understanding speech recognition 25 November 2015 Summary ▸ We found evidence of regions of articulatory feature representation in human auditory cortex. ▸ We modelled speech-recognition-relevant features using machine systems which perform the task well. ▸ RSA allows comparison of brain states and machine states at the level of representations. ▸ EMEG records rich brain response data over time, non-invasively. ▸ The processes of sound-to-meaning mapping are still poorly understood.

Understanding speech recognition 25 November 2015 Department of Psychology Andrew Thwaites Elisabeth Fonteneau Cai Wingfield William Marslen-Wilson Department of Engineering Xunying Liu Chao Zhang Phil Woodland Department of Psychiatry Li Su

Understanding human speech recognition: Reverse-engineering the - PowerPoint PPT Presentation

Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines University of Cambridge 25 November 2015 Understanding human speech recognition: Reverse-engineering the engineering solution using EMEG

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Developmental Systems Companion slides for the book Bio-Inspired Artificial Intelligence: Theories,

10/26/2011 The Journal of Experimental Biology Distinct startle responses are associated with

10b Machine Learning: Symbol-based 10.0 Introduction 10.5 Knowledge and Learning 10.1 A

spinocerebellar tract Vestibulospinal tract Anterior Fasciculus Gracilis Corticospinal Tract

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Pedicle Subtraction Osteotomy: Maximizing correction and reducing complications Munish C. Gupta,

Disclosure Mini-Open Transpedicular Consultant: Globus Corpectomy for the Treatment of

Body Chart Initial Hypothesis? Orthopaedic Manual Physical Therapy Series 2017-2018