From Speech Perception to Language
Andrew Nevins (Harvard University) Lectures at Universidadte Federal do Rio de Janiero May 2006
From Speech Perception to Language Andrew Nevins (Harvard - - PowerPoint PPT Presentation
From Speech Perception to Language Andrew Nevins (Harvard University) Lectures at Universidadte Federal do Rio de Janiero May 2006 Your background? Syllable? Heavy syllable? Stress? Secondary stress? Vowel reduction? Graphs/quantitative
Andrew Nevins (Harvard University) Lectures at Universidadte Federal do Rio de Janiero May 2006
Syllable? Heavy syllable? Stress? Secondary stress? Vowel reduction? Graphs/quantitative data? Top-down vs. bottom-up processing?
Abso-bloomin’-lutely
perVERT PERvert Stress can indicate lexical contrasts Its acoustic correlates involve greater duration, greater amplitude, and pitch contour on the stressed syllable (the vowel carries most of this)
In languages like English and Russian, stress is not always fixed in the same position, so it can be used to contrast different words (e.g. trústy vs. trustée; or pi.sál vs. pí.sal, a mistake to be careful of!) In other languages (Czech, French, Turkish, Polish, Finnish,...) stress is always in a fixed location (e.g. always on the 1st syllable in Czech, always on the last syllable in French, etc.)
Canyoureadthiseasilywithoutpunctuationorspaces? Vroomen 1998: Learners can use stress as a cue for word boundaries in an artificial word monitoring task Infants at 7.5 months can already segment words from fluent speech the bnick strategy is one way (“stab nick”) Metrical segmentation strategy...
English speakers sometimes (accidentally) take the stressed syllable to be evidence for a word boundary Thus in a must to a.vóid, a common “slip of the ear” is something like a muscular void 9-month infants prefer to listen to strong-weak words (róbin) than weak-strong words (giráffe) [Jusczyk, Cutler & Redanz 93]
How do infants learn new words? How do they separate the target word from the surrounding context? “Fast mapping” and Carey’s chromium study Brent & Siskind: one-word utterances occur only 9% Boundaries between words are not marked by acoustic events
Two groups of infants: one heard cup and dog during familiarization phase, other group heard feet and bike “The cup was bright and shiny” “Meg put her cup back on the table” During test phase, both groups heard sentences with all 4 A later experiment showed they had no preference for tup,bog,zeet,gike: they are not storing words “coarsely”
Allophony: aspiration vs. flapping vs. glottalization (notate,notable,note) Transitional probabilities? P(AB)/P(A): in “prettybaby” -- TP(pre,tty) > TP(tty,ba) “Local minima” of TPs might be used to find word boundaries Based on 2 minutes(!) of exposure: pulikiberagafodaru infants have a preference for words with high internal TP (Saffran et.al) Note that statistics are not a panacea...
The Unique Stress Constraint: chewbácca vs. dárthváder Take the sequence WSSSW, the USC will automatically segment this as [WS][S][SW]. Take SWWWS. Already you know there are 2 words, and probabilities can work on the medial W’s.
Yang & Gambell 2005
If you already know “big” then extracting “snake” is easy in bigsnake Kids seem to do this, saying “two dults”, perhaps after doing subtraction on adult and “I was hajve” after behave
Yang & Gambell 2005
How much of perception is guided by “training”: what language you already speak? “Top-down” influences on processing
The effects of contrastive status (a linguistic property about the way the lexicon is built up)
It is well known that one’s native phonology affects one’s ability to perceive segmental contrasts; e.g. the difficulty of [l]/[r] perception by Japanese speakers Dupoux & Peperkamp suggest that it may also affect one’s ability to perceive suprasegmental contrasts
Subjects required to learn 2 CVCV nonwords that differ only in (a) place of articulation of C2 or (b) stress, and transcribe auditorally presented sequences i.e. kúpi-kúti vs. mípa-mipá
Longer duration for stressed σ Higher F0 for stressed σ Dupoux & Peperkamp
Speakers of fixed- stress languages are comparatively bad at perceiving contrastive stress
Note that Finnish has initial fixed stress and Spanish has final fixed stress
“Those guys talk fast!” “I can’t find the word boundaries!”
Syllable-timed rhythm (Sp., It.) vs. stress-timed rhythm (Eng,Du) Lloyd James/Kenneth Pike: “Machine-gun languages versus Morse-code languages”
Durational Isochrony (“even spacing”) not experimentally upheld Phonological characteristics (Dauer 1983)(a) more syllable types in stress-timed languages (e.g. CCVC, VCC, etc.) (b) reduction of unstressed syllables Yet: Catalan has same syllable structure as Spanish, but has vowel reduction; Polish allows complex syllable types, but has no reduction
Take a look at the spectrogram...which is more salient? Ramus et. al measured vowel/consonant intervals “Next Tuesday on”: [n][e][kst][u][sd][eio][n] %V and Variance(C)
The next local elections will take place during the winter Le prossime elezioni locali avranno luogo in inverno Tsugi no chiho senkyo wa haruni okonawareru daru Infants hear speech filtered at 400 Hz...
Homework assignment distribution: three parts Feel free to ask questions! nevins@fas.harvard.edu Individual appointments possible Requests for next week’s discussion are encouraged
What makes something a category? How does “speech mode” influence perception?
A,B a pair of sounds are used contrastively in a language, A,B only differ along a single acoustic dimension Tokens of sounds produced in between the extremes of “A”-ness and “B”-ness may be perceived differently depending on whether they are used contrastively in the language
The only acoustic difference: [la] has falling F3 and [ra] has rising F3 Liberman et. al presented a continuum
linguistic stimuli and non-linguistic stimuli.
items 5-8 are categorized as ”A” 100% items 5-8 are categorized as “B” 0%
items 1-4 are categorized as ”A” 100% items 1-4 are categorized as “B” 0%
Idealized Categorization: 8 stimuli vary along an acoustic dimension in even steps Nonetheless, they are perceived as belong to 2 distinct groups
There is a point between each adjacent stimulus
to correctly guess “identical” or “not identical” Within “Category”, subjects cannot reliably discriminate two acoustically different
But across “category”, they are perfect, even though the acoustic difference here is the same as other pairs
English speakers: Stimuli 1-6 categorized as [ra] around100% Stimuli 7-8 not reliably categorized Stimuli 9-13 categorized as [ra] around 0% Discrimination of Stimuli 3 steps apart varied near category boundary for English speakers; discrimination function shows no pattern for Japanese speakers
MMN only for Hindi speakers when -50ms stimulus presented after sequence of -10ms stimuli
On the stimuli that were F3 transitions alone, both populations had non-categorial perception
If a 2-way distinction is contrastive in the language, will it show the unimodal pattern, which has the most actual utterances most centered around the middle
Hint: think about humans’ identification function when there are two contrastive categories along such a continuum
These “bell curved” distribution functions, with highest frequency centered symmetrically around a mean are called Gaussian distributions.
Infants heard: 16 tokens on 8-point ta-da continuum, 4 ma, 4 la 2.3 minutes total Then, they were presented tokens 3 & 6, and tokens 1 & 8 Infants in the bimodal condition looked longer in general They also looked longer when there were 3/6 presented in sequence than 1 or 8 presented alone.
These are also not in a unimodal distribution (though we don’t really have evidence that they are “as bimodal” as contrastive pairs) Learning that two categories are allophonic requires noticing that they are found in completely distinct environments (Notice Maye et.al’s kids heard the stimuli in identical environments: word-inital and followed by the same vowel)
Going back to the l/r study, it was interesting that Japanese speakers could distinguish F3 transitions when presented alone
da vs. ga also distinguished by F3 Duplex Perception (Liberman et. al): third formant of da/ga continuum played to one ear, and the rest of sound played to other ear
Subjects report hearing both a whistle/tone (F3 transition alone) and a da or ga What does this suggest about which “modules” are passed/process which information?
When played in isolation, the whistles were not perceived categorically
Duplex Perception: Two Modes at Once
Location {head, chin, nose, chest} Movement {circle, arc, wiggle, fingers} Handshape {5,A,G} Palm orientation {out, in, side} Baker, Idsardi, Michnick-Golinkoff, and Pettito (2005)...
15 English-speaking, non-ASL adults 15 ASL-speaking adults Continuum between two handshapes created by measuring finger distances and dividing into even steps 11 points along continuum, with 5/6 as category boundary (as determined in separate identification task) All of the points along the continua were meaningless in ASL (as “ba” and “pa” are in English)
On Discrimination task: If they answer different 100%, they have perfect accuracy on different pairs, but 0 accuracy on same pairs So, instead, measure using d’: a number which measures ability to accurately & reliably tell when two stimuli are different Correct rejection: Stimuli are same, subject says same. False alarm: Stimuli are same, subject says they’re diff. Miss: Stimuli are diff., subject says they’re same
Handshape Continuum from “5” to “Flat 0”
ASL speakers show radically different signal detection rates when within vs. across a contrastive category
English speakers show no such trend
Handshape Continuum from “B-bar” to “A-bar”
ASL speakers show radically different signal detection rates when within vs. across a contrastive category
English speakers show no such trend
A linguistic “illusion” based on Japanese tendency for vowel epenthesis: icecream: isu-kurim, christmas: kurisumasu Percept of [u] judgements along a durational continuum for nonce words like [abuno], [ebuzo] Again, this reflects top-down influence
perception of a vowel that isn’t there!
[hjumanz kan ditekt up to tweni faiv segments evri sekond] But for non-linguistic stimuli, sounds can be identified no faster than 7-9 items per second How can linguistic segments be perceived so much faster? Perhaps if they are not perceived at the level
units...
These two instances of “d” have little in common acoustically. The consonant is “carried” as part of the formant transitions of the vowel
Mehler 1981: subjects told to detect pa were faster to do so in pa.lace than in pal.mier; subjects told to detect pal were faster to do so in pal.mier than in pa.lace Ferrand 1994 replicated this result with a naming task
Syllable structure plays a role in lexical access above & beyond that
(a) Uses a reduced set of possible sounds found in spoken language (b) Organized around CV sequences (c) Used without apparent meaning or reference Is it a fundamentally motoric behavior, akin to crawling, a “motor flexing” of the mouth and jaw muscles? Or is it a linguistic activity which rehearses the syllabary?
In response to Petitto’s claim that manual babbling exists, other researchers proposed it also just reflects motoric development Petitto (2004): compared two groups of hearing babies: one group had deaf parents If babbling is linguistic, then hearing babies of deaf parents should exhibit (a) a distinction between linguistic and non- linguistic hand movements at 7-months and (b) non-linguistic hand movements similar to those of the other group
Infrared emitting diodes with 0.1mm sensitivity placed on babies’ hands while they were in play sessions; videotaped too. Any movement segment (e.g. open-close, waving, etc.) counted; any time objects were in their hands did not count Finally, ASL syllables are only produced within a limited space in front of the signer’s body, basically from above shoulders to below sternum, in front of body. Their finding: both groups produced sets of 2.5-3 Hz movements. Only the babies exposed to ASL produced a consistent set of 1 Hz movements.
Notice sign-exposed group had less 3Hz activity than the speech-exposed group
Babbling correlates with the ambient language
Auditory ba + visual ga = da
The effect works on perceivers with all language backgrounds (e.g., Massaro, Cohen, Gesi, Heredia, & Tsuzaki, 1993; Sekiyama. & Tokhura, 1993) The effect works on young infants (Rosenblum, Schmuckler, & Johnson, 1997). The effect works when the visual and auditory components are from speakers of different genders (Green, Kuhl, Meltzoff, & Stevens, 1991). The effect works with highly reduced face images (Rosenblum & Saldaña, 1996). The effect works when observers are unaware that they are looking at a face (Rosenblum & Saldaña, 1996). The effect works when observers touch—rather than look—at the face (Fowler & Dekle, 1991).
The McGurk effect is an additional example of top- down influence on perception
It has also been taken to support the motor theory of speech perception, in which all percepts of speech involve perceiving the motoric gestures that were required to make them, too
And more on vowels!
Categorical perception more robustly found for consonants than vowels, which may be perceived based on “prototypes” (More on how vowel perception works in a bit) Do consonants “matter more” for lexical contrast?
Cultural effects on gender differences in vowels
Speaker normalization
A discontinuous three-consonant root embodies core encyclopedic “concepts”
Roots are put into patterns which give them functional and argument structure. The overarching pattern is that consonants are thus used lexically while vowels are used functionally
No lgs. are known to be like this with the difference that they have vowels-
Most languages have many more Cs than Vs (though cf. Swedish) Harmony is more common for vowels Disharmony/dissimilation is more common for consonants (Lyman’s Law in Japanese, Grassman’s Law, Semitic root constraints) Vowel reduction is a widespread phenomenon Is consonant neutralization as common? Unknown..
eai oiiy i eoe ue eeio
www.uebersetzung.at.
O rato roeu a rolha da garrafa do rei da Rússia.
Appilan pappilan apupapin papupata pankolla kiehuu ja kuohuu. Pappilan paksuposki piski pisti paksun papukeiton poskeensa.
Caramazza 2000 double dissociation: Two Italian aphasics. AS made 3x more errors on Vs than Cs, IFA made 5x more on Cs than Vs pastore minatore
Word Reconstruction Task Is kebra more like cobra or zebra? Listeners hear nonce words lik kebra and are told to name the first real word they find Cutler & van Oijen: Dutch (16 V, 19 C) vs. Spanish (5 V, 20 C)
Dutch Spanish
For the CELEX English database, words from 2 to 15 phonemes in length; there are 2.2 times as many neighbors resulting from a consonant replacement (e.g., pat as a neighbor for cat) as from a vowel replacement (e.g., kit as a neighbor for cat). The same calculation for Dutch in CELEX produced 1.72 neighbors from consonant replacement for every neighbor from vowel replacement, whereas for a Spanish lexical database of over 75,000 words (Sebastián-Gallés et al., 1996), there were 2.07 neighbors from consonant replacement for every neighbor from a vowel replacement. These ratios are comparable and reflect that fact that across vowel/consonant inventories, the “paths” to lexical neighbors are largely paved by consonants
Consonants: provide lexical contrast Vowels: provide rhythmic scaffold, allow for speaker identification, emotive content
http://www.people.fas.harvard.edu/~nevins/speechpercep I will post many of the relevant papers there