Understanding human speech recognition: Reverse-engineering the - - PowerPoint PPT Presentation

understanding human speech recognition reverse
SMART_READER_LITE
LIVE PREVIEW

Understanding human speech recognition: Reverse-engineering the - - PowerPoint PPT Presentation

Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines University of Cambridge 25 November 2015 Understanding human speech recognition: Reverse-engineering the engineering solution using EMEG


slide-1
SLIDE 1

Understanding human speech recognition:
 Reverse-engineering the engineering solution using EMEG and RSA

Cai Wingfield CSLB, Department of Psychology Workshop on Neurocomputation: From Brains to Machines
 University of Cambridge 25 November 2015

slide-2
SLIDE 2

Understanding speech recognition 25 November 2015

Brains and Machines

▸ We’ve seen from previous speakers how: ▸ Machine systems are designed to perform the same tasks as humans. ▸ The architecture of machine models of (e.g.) vision may relate to those of

biological systems.

▸ By using methods such as RSA, intermediate-level derived representations in

  • ne may be compared to those in the other.
slide-3
SLIDE 3

Understanding speech recognition 25 November 2015

Speech and vision

▸ Unlike visual objects, speech stimuli are time-sensitive. ▸ There’s no standard neurocomputational model of speech comprehension. ▸ Humans alone amongst animals have this faculty. ▸ The most effective artificial systems’ designs don’t tend to relate to biological

models.

▸ However, machines provide a computational model of the process.

slide-4
SLIDE 4

Understanding speech recognition 25 November 2015

Speech recognition

▸ Both human brains and machines can recognise

speech accurately.

▸ Transforms raw acoustic input into abstract

word “objects”.

▸ Artificial (ASR) systems are nearly as good as

humans.

▸ In brains, this is mediated by some complex,

poorly understood neurobiological process.

▸ We will compare intermediate-level

representations in an ASR and human auditory cortex using RSA.

“what a lovely day”

slide-5
SLIDE 5

Understanding speech recognition 25 November 2015

HTK: GMM-HMM

[sil-aa-b] p [sil-aa-k] p [sil-aa-d] p [ih-s-jh] p [ih-s-k] p [uh-zh-uh] p [uh-zh-uw] p [uh-zh-sil] p

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

HMM GMM

[sil-w-oh] [w-oh-t] [oh-t-sil] WHAT Young et al. (1997)
 The HTK Book

slide-6
SLIDE 6

Understanding speech recognition 25 November 2015

Searchlight GLM RSA

Searchlight patch Data RDM Dynamic phonetic model RDMs from HTK's state

β[ɑ] β[æ] β[z] + + … + E + = [ɑ] [æ] [z] β

Contributions of individual phonetic models Su et al. (2014)
 Frontiers in Neuroscience

[ɑ] [z]

slide-7
SLIDE 7

Understanding speech recognition 25 November 2015

Evidence for sensitivity to phonetic features

Dorsal Anterior Subject 1 Subject 3 Subject 2 Subject 4 /ba/ vs /ga/ /da/ vs /ga/ Normalized classifier weights /ba/ vs /da/ 1

a

  • Chang et al. (2010)


Nature Neuroscience Mesgarani et al. (2014)
 Science

slide-8
SLIDE 8

Understanding speech recognition 25 November 2015

Evidence for sensitivity to phonetic features

Sonorant Voiced Obstruent

Broad
 categories

IPA ɑː æ ʌ ɔː aʊ aɪ b tʃ d eə e ɜː eɪ f g h ɪə ɪ iː dʒ k l m n ŋ ɒ əʊ ɔɪ p ɹ s ʃ t
  • ʊ
uː v w j z HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng
  • h
  • w
  • y
p r s sh t th uh uw v w y z Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Labial 1 1 1 1 1 Coronal 1 1 1 1 1 1 1 1 Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Nasal 1 1 1 Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 Close-mid 1 1 1 1 1 1 Open-mid 1 1 1 1 1 1 1 1 1 Open 1 1 1 1 Rounded 1 1 1 1 1 1 1

Place Manner

Labial Coronal Dorsal Plosive Fricative Sibilant Nasal

Vowel
 frontness Vowel 
 closeness

Front Central Back Close Close-mid Open-mid Open Rounded

slide-9
SLIDE 9

Understanding speech recognition 25 November 2015

Searchlight GLM RSA

β

Contributions of individual phonetic models

fitf = χf · β

Map of feature

[ɑ] [z]

IPA ɑː æ ʌ ɔː aʊ aɪ b tʃ d eə e ɜː eɪ f g h ɪə ɪ iː dʒ k l m n ŋ ɒ əʊ ɔɪ p ɹ s ʃ t
  • ʊ
uː v w j z HTK aa ae ah ao aw ay b ch d ea eh er ey f g hh ia ih iy jh k l m n ng
  • h
  • w
  • y
p r s sh t th uh uw v w y z Sonorant 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Voiced 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Obstruent 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Labial 1 1 1 1 1 Coronal 1 1 1 1 1 1 1 1 Dorsal 1 1 1 1 1 Plosive 1 1 1 1 1 1 Fricative 1 1 1 1 1 1 1 1 1 Sibilant 1 1 1 1 1 1 1 1 Nasal 1 1 1 Front 1 1 1 1 1 1 1 Central 1 1 1 1 1 1 1 Back 1 1 1 1 1 1 1 1 1 Close 1 1 1 1 1 1 Close-mid 1 1 1 1 1 1 Open-mid 1 1 1 1 1 1 1 1 1 Open 1 1 1 1 Rounded 1 1 1 1 1 1 1
slide-10
SLIDE 10

Understanding speech recognition 25 November 2015

Speech recognition

▸ 16 subjects, 400 words, EMEG. ▸ Most features we tested showed

significant fit in auditory cortex.

▸ Bilateral HG, STG, STS. ▸ Broad category features fit best

  • n the right.

▸ Regions on the left tended to be

more focussed.

▸ Within-category features

showed fits bilaterally.

Wingfield et al. (in prep.) [100, 170] ms

slide-11
SLIDE 11

Moving forward: DNN-based ASR

(work in progress)

▸ DNNs have proved very effective in

visual domain.


▸ Hidden-layer representations

provide “bottom-up” features which are used to disambiguate speech.

slide-12
SLIDE 12

Understanding speech recognition 25 November 2015

HTK: DNN-HMM

Zhang & Woodland (2015)
 Submission to InterSpeech

MFCCS ΔMFCCS Δ2MFCCS

1000 1000 1000 1000 1000 26 ~6000 BN layer 720 [0,25]ms ±40ms (f.c.)

slide-13
SLIDE 13

Understanding speech recognition 25 November 2015

Individual node responses

▸ BN architecture provides a low-

dimensional feature space sufficient to accurately determine 6000+ phonetic labels.

▸ Dynamic inputs elicit dynamic BN

responses.

▸ Can we investigate this BN

representation space, and compare it to brain representations?

Node 04 time words

  • +
slide-14
SLIDE 14

Understanding speech recognition 25 November 2015

Nodes track phonetic features?

Node 04 Sibilance time words time words

  • +
slide-15
SLIDE 15

Understanding speech recognition 25 November 2015

Nodes track phonetic features?

Node 20 Vowel frontness time words time words

  • +
slide-16
SLIDE 16

Understanding speech recognition 25 November 2015

BN–feature similarity

Similarities between nodes and features +1

  • 1

MDS

slide-17
SLIDE 17

Understanding speech recognition 25 November 2015

Summary

▸ We found evidence of regions of articulatory feature representation in human

auditory cortex.

▸ We modelled speech-recognition-relevant features using machine systems which

perform the task well.

▸ RSA allows comparison of brain states and machine states at the level of

representations.

▸ EMEG records rich brain response data over time, non-invasively. ▸ The processes of sound-to-meaning mapping are still poorly understood.

slide-18
SLIDE 18

Understanding speech recognition 25 November 2015

Chao Zhang William Marslen-Wilson Elisabeth Fonteneau Andrew Thwaites Xunying Liu Phil Woodland Li Su

Department of Psychology Department of Engineering Department of Psychiatry

Cai Wingfield