1 Notes on the corpus Notes on the corpus Sources of noisebut the - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Notes on the corpus Notes on the corpus Sources of noisebut the - - PDF document

Tagalog (Western Austronesian, Philippines) A frequency effect conditioned by phonological grammar Kie Zuraw, UCLA kie@ucla.edu QITL June 2006 1 2 Tagalog (Western Austronesian, Tagalog tapping Philippines) Schachter & Otanes 1972


slide-1
SLIDE 1

1

1

A frequency effect conditioned by phonological grammar

Kie Zuraw, UCLA kie@ucla.edu

QITL June 2006

2

Tagalog (Western Austronesian, Philippines)

3

Tagalog (Western Austronesian, Philippines)

About 16 million native speakers

(Ethnologue), plus many second-language speakers

Roman alphabet introduced during Spanish

rule; encodes distinctions not represented in

  • lder writing and probably not contrastive

until recently (d vs. r; u vs. o; i vs. e)

4

Tagalog tapping

Schachter & Otanes 1972 (Tagalog Reference Grammar):

  • Spanish and English loans have introduced a contrast between

d and r ([])

disko ‘disc’ risko ‘risk’ kantod ‘limp’ kantor ‘singer’ seda ‘silk’ sera ‘wax’

  • But in native words, there’s (near-)complementary distribution,

with r between vowels and d elsewhere

daliri ‘finger’ dapat ‘should’ likod ‘back’ isda ‘fish’ kadkad ‘unfurled’

5

Tagalog tapping

d → r / V__V d optionally becomes r between vowels

lakad ‘walk’ lakar-an ‘to be walked on’

6

Overview of talk

Tapping is morphologically governed: required in

  • ne environment, forbidden in another, optional in a

third.

– I’ll propose a prosodic account of this.

In the environment where tapping is variable, it’s tied

to word-frequency facts.

– I’ll argue that this reflects the outcome of lexical access.

Grammar needs to be able to refer to outcome of

lexical access.

Data on variation are taken from a written corpus,

compiled from the Web.

slide-2
SLIDE 2

2

7

Notes on the corpus

“Probably Tagalog” web pages identified by

automated queries to Google Web APIs service, then collected.

~100,000 web pages ~20,000,000 Tagalog words Variety of genres, especially blogs,

discussion forums, and newspaper articles.

8

Notes on the corpus

Sources of noise—but the data nevertheless seem

fairly clean

– second-language users of Tagalog – typing errors – erroneously identified pages (especially from other

Philippine languages)

– page-internal repetition

Do spelling choices reliably reflect writers’ preferred

pronunciations? Probably depends on the phenomenon in question—future lab research planned on this.

9

Notes on the corpus

Thanks to:

– Rosie Jones for supplying seed corpus, from

CorpusBuilder project (Ghani, Jones, & Mladeni 2004), which inspired the construction of this corpus

– Undergraduate R.A. Ivan Tam for programming – Undergraduate R.A. Nikki Foster for data entry

from dictionary (English 1987)

10

Stem+suffix: obligatory tapping

almost all tap

lakad ‘walk’ lakar-an ‘be walked on’ tamad ‘lazy’ tamar-in ‘be lazy about’

200 400 600 800 1000 1200 1400 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: basic from allsuffixedforms.svd

Histogram: how many suffixed words (of frequency 10) display each possible rate of tapping?

11

Prefix+stem: optional tapping

dumi ‘dirt’ ma-rumi ‘dirty’ dahon ‘leaf’ ma-dahon ‘leafy’

12

Prefix+stem: optional tapping

some don’t tap

20 40 60 80 100 120 140 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: N

  • D

uplicatesMinimumF requencyN ativ

some do tap

Prefixed words (freq. 10) that

  • ccur in dictionary

Prefixed words (freq. 10) identified by morphological parser

slide-3
SLIDE 3

3

13

Conditioning factors?

Stress is not predictive Vowel quality is not predictive But frequency is promising...

14

Frequency effect within prefixed words (dictionary words only)

words more frequent than their roots words less frequent than their roots majority tap

5 10 15 20 25 30 35 40 45 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: NoDupNativeOnlyNoMinim um from AllDictIte 20 40 60 80 100 120 140 160 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: NoDupNativeOnlyNoMinim um from AllDictIte

majority don’t tap (Effect of relative frequency exists independent of raw frequency.)

15

Frequency effect within prefixed words (all words identified by morph. parser)

words more frequent than their roots words less frequent than their roots majority tap majority don’t tap

16

Hay’s explanation (of similar effects in English)

Two routes compete in processing of complex words: decomposed route un unhappy happy direct route unhappy

(Hay 2003, Causes and Consequences of Word Structure)

17

marumi more frequent than dumi: direct route wins dumi

ma

maDumi

The faster route wins

Speed is determined by, among other factors, resting activation level (approximated by frequency). madahon less frequent than dahon: decomposed route wins

dahon

ma

maDahon

18

marumi > dumi direct route tapping

marumi Frequency effect—first approximation

Assume tapping inapplicable if VdV sequence comes from two different accessed lexical units. madahon < dahon indirect route no tapping

ma dahon

slide-4
SLIDE 4

4

19

Frequency ratio

Tapped words have higher frequency ratio.

20

Word frequency alone

Tapped words have higher raw frequency.

21

Root frequency alone

No real difference—see Hay & Baayen 2001 for discussion.

22

Affix frequency

Tapped words have less-frequent affixes.

23

Affix productivity

Baayen’s P: if any difference, it’s in the non- predicted direction.

24

Affix productivity

But...

Lüdeling & Evert 2003 caution against using

P to compare processes with different token frequencies.

Lüdeling, Evert & Heid 2000 show that P

requires a well-processed corpus and differentiation of homophonous processes.

slide-5
SLIDE 5

5

25

Affix productivity

I plan to look at some other measures of affix productivity:

Hay & Baayen’s parsing line Lüdeling/Evert vocabulary growth curves

26

Effect of morphology

Recall that suffixed words tap no matter what. Somehow, suffixed words are treated as a unit regardless of lexical access route.

27

Prefix-suffix asymmetries

Common cross-linguistically for rules to apply more readily across a stem-suffix boundary than a prefix-stem boundary. Peperkamp 1997 (Prosodic Words) cites Choctaw, Polish, Hungarian, Indonesian, Japanese, Korean, French (p. 55).

28

Proposed prosodic structures

(similar to Nespor & Vogel 1986, Peperkamp 1997 for Italian) p-word p-word p-word prefix stem prefix stem ma rumi ma dahon direct route: indirect route: tapping applies tapping doesn’t apply

  • Alignment constraint: accessed lexical unit initiates prosodic word
  • Otherwise, constraint against recursion prefers simple structure.
  • Tapping applies to VdV sequence iff not interrupted by p-word

boundary Two choices for prefixed word, depending on

  • utcome of lexical

access:

29

Proposed prosodic structures

p-word p-word p-word p-word p-word stem suffix stem suffix stem suffix lakar an tapping applies

  • Minimality constraint: p-word must dominate at least one foot,

and foot must dominate at least two syllables

  • Suffixes are monosyllabic and so can’t head a p-word
  • Adjoining the suffix is no help

Only one choice for suffixed word, regardless of

  • utcome of lexical

access:

30

Stem+stem: no tapping

Most are d-initial stems: dala-dala ‘load carried’ Few are d-final stems: agad-agad ‘at once’

nearly all don’t tap

20 40 60 80 100 120 140 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: attested from reduplicatedFormsVariantsC

Probably not a reduplicative identity effect: ka-agad-agar-an ka-raga-dagan

slide-6
SLIDE 6

6

31

Proposed prosodic structure

p-word p-word stem stem dala dala tapping doesn’t apply

  • Heading constraint: each stem must head a p-word

Only one choice for compounding reduplication, regardless of

  • utcome of lexical

access:

32

Local summary

Frequency effect is allowed to surface only in

prefixed words.

Otherwise, morphology determines outcome.

There is a grammar (it’s not all processing), but it can refer to the units accessed during lexical retrieval, not just to syntactic units.

33

Online vs. lexicalized

Recall the polarized behavior of prefixed

words:

(frequency threshold of 7, to be consistent with upcoming slide...)

34

Online vs. lexicalized

This is very different from what Baroni (1998,

2001) found for Northern Italian intervocalic s-voicing, which is otherwise similar to Tagalog tapping.

Baroni documented robust variation within

item, within speaker

35

Online vs. lexicalized

So is the frequency effect active online, or

merely lexicalized?

For established prefixed words, perhaps they

are lexicalized (and unestablished words are too infrequent in the corpus to see if they vary).

But in two other realms, there seems to be

an online frequency effect...

36

Enclitics

One more environment for tapping:

Enclitics daw, din can be raw, rin after vowel-

final words (and, less frequently, after consonant-final words)

ako rin ~ ako din ‘me too’

So far, I’ve looked only at bigrams with din/rin

(‘too’), not daw/raw (reported speech)

slide-7
SLIDE 7

7

37

Word + enclitic

We see the full range of behaviors:

5 10 15 20 25 30 35 40 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: Word1VFinalFreqAtLeast10RinDinOnl

This suggests that the choice between din and rin is made online (at least in a lot of cases).

38

Frequency effect applies

  • Correlation between frequency ratio and tapping rate is weak,

but significant: Spearman’s rho = .197, p<.001

  • .2

.2 .4 .6 .8 1 1.2 rate of tapping

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 real log (phrase/w ord) Bivariate Scattergram with Supersmoother Inclusion criteria: Word1VFinalFreqAtLeast10RinDinOnly from rindinRevisedAgain.svd

39

Enclitic + enclitic

Enclitics can combine:

bakit pa rin ‘why still also’

Counting disyllabic pronouns as clitics (which might be wrong) there are 23 enclitic+din combinations

40

Enclitic + enclitic

Stronger correlation: Spearman’s rho = .527, p<.05

.2 .3 .4 .5 .6 .7 .8 .9 1 percentage tapped

  • 4
  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • .5

log (r+d)/w d1 Bivariate Scattergram with Supersmoother Row exclusion: cliticdin.svd

41

Two types of variation, both showing a frequency effect

polarized/lexicalized

  • vs. continuous/on-the-fly

(prefix+stem) (word+enclitic)

5 10 15 20 25 30 35 40 Count .2 .4 .6 .8 1 rate of tapping Inclusion criteria: Word1VFinalFreqAtLeast10RinDinOnl

42

Suffixed words

There are 124 words with < 100% tapping These “errors” show correlation between tapping

rate and word-to-root (log) frequency ratio

The more frequent the word as compared to its root, the more tapping Spearman’s rho=.534, p<.001

  • .01

rate of tapping

  • 4
  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • .5

.5 1 1.5 w ord/root log ratio Bivariate Scattergram with Low ess Tension = 33 Inclusion criteria: some nontapping, freq>=10 from allsuffixedforms.svd

slide-8
SLIDE 8

8

43

Summary

Tagalog requires tapping in one

morphological environment, forbids it in another, and allows variation in a third.

Where variation is allowed, it is conditioned

(at least in part) by lexical-access factors.

Some of the variation is lexicalized, and

  • ther variation seems to be on-line.