understanding human speech recognition reverse
play

Understanding human speech recognition: Reverse-engineering the - PowerPoint PPT Presentation

Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines University of Cambridge 25 November 2015 Understanding human speech recognition: Reverse-engineering the engineering solution using EMEG


  1. Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines 
 University of Cambridge 25 November 2015 Understanding human speech recognition: 
 Reverse-engineering the engineering solution using EMEG and RSA

  2. Understanding speech recognition 25 November 2015 Brains and Machines ▸ We’ve seen from previous speakers how: ▸ Machine systems are designed to perform the same tasks as humans. ▸ The architecture of machine models of (e.g.) vision may relate to those of biological systems. ▸ By using methods such as RSA, intermediate-level derived representations in one may be compared to those in the other.

  3. Understanding speech recognition 25 November 2015 Speech and vision ▸ Unlike visual objects, speech stimuli are time-sensitive. ▸ There’s no standard neurocomputational model of speech comprehension. ▸ Humans alone amongst animals have this faculty. ▸ The most effective artificial systems’ designs don’t tend to relate to biological models. ▸ However, machines provide a computational model of the process.

  4. Understanding speech recognition 25 November 2015 Speech recognition ▸ Both human brains and machines can recognise speech accurately. ▸ Transforms raw acoustic input into abstract word “objects”. ▸ Artificial (ASR) systems are nearly as good as humans. ▸ In brains, this is mediated by some complex, poorly understood neurobiological process. ▸ We will compare intermediate-level representations in an ASR and human auditory cortex using RSA. “what a lovely day”

  5. Understanding speech recognition 25 November 2015 HTK: GMM-HMM [sil-aa-b] p [sil-aa-k] p ⋮ [sil-aa-d] p ⋮ ⋮ HMM ⋮ [ih-s-jh] p [ih-s-k] p ⋮ GMM ⋮ ⋮ ⋮ ⋮ ⋮ [uh-zh-uh] p [uh-zh-uw] p Young et al. (1997) 
 [uh-zh-sil] p [sil-w-oh] [w-oh-t] [oh-t-sil] The HTK Book WHAT

  6. Understanding speech recognition 25 November 2015 Searchlight GLM RSA [ ɑ ] = β [ ɑ ] + [æ] + … + [z] + β [æ] β [z] E Data RDM Searchlight Dynamic phonetic model RDMs from HTK's state patch [ ɑ ] [z] β Contributions of individual phonetic Su et al. (2014) 
 models Frontiers in Neuroscience

  7. � Understanding speech recognition 25 November 2015 Evidence for sensitivity to phonetic features a Subject 1 Subject 2 Dorsal Subject 3 Subject 4 Anterior Normalized classifier weights /ba/ vs /da/ /da/ vs /ga/ /ba/ vs /ga/ 0 1 Chang et al. (2010) 
 Mesgarani et al. (2014) 
 Nature Neuroscience Science

  8. Understanding speech recognition 25 November 2015 categories Sonorant Broad 
 Evidence for sensitivity to phonetic features Voiced Obstruent Labial ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Place HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Coronal Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Dorsal Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Plosive Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Manner 1 1 1 1 1 Labial Fricative 1 1 1 1 1 1 1 1 Coronal Sibilant Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Nasal Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Front frontness Nasal 1 1 1 Vowel 
 Central Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 Close closeness 1 1 1 1 1 1 Close-mid Vowel 
 Close-mid 1 1 1 1 1 1 1 1 1 Open-mid Open-mid Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Open Rounded

  9. ɑː ʌ ɔː a ʊ a ɪ t ʃ e ə ɜː e ɪ ɪə ɪ i ː d ʒ ŋ ɒ əʊ ɔɪ ɹ ʃ ʊ u ː IPA j z æ b d e f g h k l m n p s t � v w Understanding speech recognition 25 November 2015 HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng oh ow oy p r s sh t th uh uw v w y z Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Labial 1 1 1 1 1 Searchlight GLM RSA 1 1 1 1 1 1 1 1 Coronal Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Nasal 1 1 1 Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 1 1 1 1 1 1 Close-mid Open-mid 1 1 1 1 1 1 1 1 1 Open 1 1 1 1 Rounded 1 1 1 1 1 1 1 Map of feature fit f = χ f · β [ ɑ ] [z] β Contributions of individual phonetic models

  10. Understanding speech recognition 25 November 2015 Speech recognition ▸ 16 subjects, 400 words, EMEG. ▸ Most features we tested showed significant fit in auditory cortex. ▸ Bilateral HG, STG, STS. ▸ Broad category features fit best on the right. ▸ Regions on the left tended to be more focussed. ▸ Within-category features showed fits bilaterally. [100, 170] ms Wingfield et al. (in prep.)

  11. Moving forward: DNN-based ASR (work in progress) ▸ DNNs have proved very effective in ▸ Hidden-layer representations visual domain. 
 provide “bottom-up” features which are used to disambiguate speech.

  12. Understanding speech recognition 25 November 2015 Zhang & Woodland (2015) 
 HTK: DNN-HMM Submission to InterSpeech Δ 2 MFCCS Δ MFCCS MFCCS ~6000 1000 1000 1000 1000 1000 720 (f.c.) 26 BN layer [0,25]ms ±40ms

  13. Understanding speech recognition 25 November 2015 Individual node responses - 0 + ▸ BN architecture provides a low- dimensional feature space sufficient to accurately determine 6000+ phonetic labels. ▸ Dynamic inputs elicit dynamic BN responses. ▸ Can we investigate this BN representation space, and compare words it to brain representations? time Node 04

  14. Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Sibilance Node 04

  15. Understanding speech recognition 25 November 2015 Nodes track phonetic features? - 0 + words words time time Vowel frontness Node 20

  16. Understanding speech recognition 25 November 2015 BN–feature similarity +1 MDS Similarities between nodes and features 0 -1

  17. Understanding speech recognition 25 November 2015 Summary ▸ We found evidence of regions of articulatory feature representation in human auditory cortex. ▸ We modelled speech-recognition-relevant features using machine systems which perform the task well. ▸ RSA allows comparison of brain states and machine states at the level of representations. ▸ EMEG records rich brain response data over time, non-invasively. ▸ The processes of sound-to-meaning mapping are still poorly understood.

  18. Understanding speech recognition 25 November 2015 Department of Psychology Andrew Thwaites Elisabeth Fonteneau Cai Wingfield William Marslen-Wilson Department of Engineering Xunying Liu Chao Zhang Phil Woodland Department of Psychiatry Li Su

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend