Unit 7: A multivariate approach to linguistic variation Statistics - - PowerPoint PPT Presentation

unit 7 a multivariate approach to linguistic variation
SMART_READER_LITE
LIVE PREVIEW

Unit 7: A multivariate approach to linguistic variation Statistics - - PowerPoint PPT Presentation

Unit 7: A multivariate approach to linguistic variation Statistics for Linguists with R A SIGIL Course Stefan Evert Computational Corpus Linguistics Group FAU Erlangen-Nrnberg SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de


slide-1
SLIDE 1

Unit 7: A multivariate approach to linguistic variation

Statistics for Linguists with R – A SIGIL Course Stefan Evert

Computational Corpus Linguistics Group FAU Erlangen-Nürnberg

www.linguistik.fau.de | www.stefan-evert.de 1 SIGIL Unit #7

slide-2
SLIDE 2

Linguistic variation

Variation of a quantitative linguistic feature

– frequency of passive, past perfect, split infinitive, … – frequency of expression, semantic field, topic, … – association strength, lexical density, productivity, …

across

– languages and language varieties – regions & social strata – time (diachronic change) – individual speakers & discourses

www.linguistik.fau.de | www.stefan-evert.de 2 SIGIL Unit #7

slide-3
SLIDE 3

Studying linguistic variation

§ Univariate approach

– compare single feature across two or more conditions – e.g. AmE vs. BrE vs. IndE vs. … / male vs. female / etc. – corpus frequency comparison

§ Regression approach

– predict single quantity from multiple explanatory factors

§ Multivariate approach

– identify common patterns of variation across multiple different features ➞ correlation analysis – inductive techniques don't require pre-defined conditions

www.linguistik.fau.de | www.stefan-evert.de 3 SIGIL Unit #7

slide-4
SLIDE 4

Variation as a nuisance parameter

§ Many aspects of linguistic variation are nuisance parameters in corpus linguistics

– e.g. difference in frequency of passives between AmE and BrE, as well as development from 1960s to 1990s (Unit #2) – ignore other dimensions such as genre/register variation by pooling frequency data from all texts of each corpus – corpus is analyzed as a random sample of VP tokens

§ Consequences

– variation ➞ non-randomness ➞ overestimate significance – discussed in much more detail in Unit #8

www.linguistik.fau.de | www.stefan-evert.de 4 SIGIL Unit #7

slide-5
SLIDE 5

The multivariate approach

§ Different linguistic features often show similar patterns of variation § E.g. passives and nominalizations

  • 10

20 30 40 10 20 30 40 50 60 passives / 1000 words nominalizations / 1000 words

www.linguistik.fau.de | www.stefan-evert.de 5 SIGIL Unit #7

slide-6
SLIDE 6

The multivariate approach

§ Different linguistic features often show similar patterns of variation § E.g. passives and nominalizations § Such correlations can be exploited to determine major dimensions of var.

  • 10

20 30 40 10 20 30 40 50 60 passives / 1000 words nominalizations / 1000 words

www.linguistik.fau.de | www.stefan-evert.de 6 SIGIL Unit #7

slide-7
SLIDE 7

The multivariate approach

www.linguistik.fau.de | www.stefan-evert.de 7 SIGIL Unit #7

slide-8
SLIDE 8

The multivariate approach

§ Multivariate analysis exploits correlations between features in order to determine latent dimensions

– interpreted as underlying “causes” of variation

§ An inductive, data-driven approach

– no theoretical assumptions about linguistic variation and categories / sub-corpora to be compared

§ Pioneering work by Doug Biber (1988, 1993, 1995, …)

– “multidimensional analysis” of register variation

§ Related approaches: correspondence analysis, distributional semantics, topic modelling, …

www.linguistik.fau.de | www.stefan-evert.de 8 SIGIL Unit #7

slide-9
SLIDE 9

Biber's multidimensional analysis

www.linguistik.fau.de | www.stefan-evert.de 9 SIGIL Unit #7

5.3 Linguistic features 95

Table 5.7 Linguistic features used in the analysis of English

  • A. Tense and aspect markers

1 Past tense 2 Perfect aspect 3 Present tense

  • B. Place and time adverbials

4 Place adverbials (e.g., above, beside, outdoors) 5 Time adverbials (e.g., early, instantly, soon)

  • C. Pronouns and pro-verbs

6 First-person pronouns 7 Second-person pronouns 8 Third-person personal pronouns (excluding it) 9 Pronoun it 10 Demonstrative pronouns (that, this, these, those as pronouns) 11 Indefinite pronouns (e.g., anybody, nothing, someone) 12 Pro-verb do

  • D. Questions

13 Direct WH questions

  • E. Nominal forms

14 Nominalizations (ending in -tion, -ment, -ness, -ity) 15 Gerunds (participial forms functioning as nouns) 16 Total other nouns

  • F. Passives

17 Agentless passives 18 fy-passives

  • G. Stative forms

19 be as main verb 20 Existential there

  • H. Subordination features

21 that verb complements (e.g., / said that he went) 22 that adjective complements (e.g., I'm glad that you like it) 23 WH-clauses (e.g., / believed what he told me) 24 Infinitives 25 Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe ran out the door) 26 Past participial adverbial clauses (e.g., Built in a single week, the house would stand for fifty years) 27 Past participial postnominal (reduced relative) clauses (e.g., the solution produced by this process) 28 Present participial postnominal (reduced relative) clauses (e.g., The event causing this decline was ...) 29 that relative clauses on subject position (e.g., the dog that bit me) 30 that relative clauses on object position (e.g., the dog that I saw) 31 WH relatives on subject position (e.g., the man who likes popcorn) 32 WH relatives on object position (e.g., the man who Sally likes) 33 Pied-piping relative clauses (e.g., the manner in which he was told)

96 Methodology

Table 5.7 (cont.) 34 Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgusting thing I've ever heard of) 35 Causative adverbial subordinator (because) 36 Concessive adverbial subordinators (although, though) 37 Conditional adverbial subordinators (if unless) 38 Other adverbial subordinators (e.g., since, while, whereas) I. Prepositional phrases, adjectives, and adverbs 39 Total prepositional phrases 40 Attributive adjectives (e.g., the big horse) 41 Predicative adjectives (e.g., The horse is big.) 42 Total adverbs J. Lexical specificity 43 Type-token ratio 44 Mean word length

  • K. Lexical classes

45 Conjuncts (e.g., consequently, furthermore, however) 46 Downtoners (e.g., barely, nearly, slightly) 47 Hedges (e.g., at about, something like, almost) 48 Amplifiers (e.g., absolutely, extremely, perfectly) 49 Emphatics (e.g., a lot, for sure, really) 50 Discourse particles (e.g., sentence-initial well, now, anyway) 51 Demonstratives

  • L. Modals

52 Possibility modals (can, may, might, could) 53 Necessity modals (ought, should, must) 54 Predictive modals (will, would, shall)

  • M. Specialized verb classes

55 Public verbs (e.g., assert, declare, mention) 56 Private verbs (e.g., assume, believe, doubt, know) 57 Suasive verbs (e.g., command, insist, propose) 58 seem and appear

  • N. Reduced forms and dispreferred structures

59 Contractions 60 Subordinator that deletion (e.g., / think [that] he went) 61 Stranded prepositions (e.g., the candidate that I was thinking of) 62 Split infinitives (e.g., He wants to convincingly prove that...) 63 Split auxiliaries (e.g., They were apparently shown'to ...)

  • O. Co-ordination

64 Phrasal co-ordination (NOUN and NOUN; ADJ; and ADJ; VERB and VERB; ADV

and ADV) 65 Independent clause co-ordination (clause-initial and) P. Negation 66 Synthetic negation (e.g., No answer is good enough for Jones) 67 Analytic negation (e.g., That's not likely)

slide-10
SLIDE 10

Biber's multidimensional analysis

www.linguistik.fau.de | www.stefan-evert.de 10

factor analysis (FA)

SIGIL Unit #7

slide-11
SLIDE 11

Biber's multidimensional analysis

www.linguistik.fau.de | www.stefan-evert.de 11 SIGIL Unit #7

THE MULTI-DIMENSIONAL APPROACH TO LINGUISTIC ANALYSES OF GENRE VARIATION

335

co-occur frequently in texts because they serve some shared, underlying communicative functions associated with the situational contexts of the texts. Table 2 summarizes the co-occurring features associated with each of the five dimensions. The decimal numbers on this table represent the factor "loadings" for each linguistic feature. Loadings can run from --1.0 to +1.0; the further from 0.0 a loading is, the more one can generalize from the factor in question to the particular linguistic

  • feature. Features with higher loadings are thus

better representatives of the dimension underlying a factor. In Table 2, only features with loadings larger than 0.35 (plus or minus) are included. Most of the dimensions consist of two group- ings of features, having positive and negative

  • loadings. Positive or negative sign does not indi-

cate a more-or-less relationship; rather, these two groups represent sets of features that occur in a complementary pattern. That is, when the features in one group occur together frequently in a text, the features in the other group are markedly less frequent in that text, and vice versa. To interpret the dimensions, it is important to consider likely reasons for the complementary distribution of these two groups of features as well as the reasons for the co-occurrence pattem within each group. 3 For example, consider Dimension 2. The fea- tures in the top group (the positive loadings above the dashed line on Table 2) are past tense verbs, perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom group (the negative loadings) are present tense verbs and adjectives. Considering all of the features on Dimension 2, this dimension is interpreted as distinguishing narrative discourse from other types of discourse, TABLE 2 Summary of the co-occurrence patterns underlying five major dimensions

  • f English.

DIMENSION 1 DIMENSION 2 DIMENSION 3 DIMENSION 4 DIMENSION 5 (Informational vs. (Narrative versus (Elaborated vs. (Overt Expression (Abstract versus Involved) Non-Narrative) Situated Reference)

  • f Persuasion)

Non-Abstract Style) nouns 0.80 past tense verbs 0.90 word length 0.58 third person pronouns 0.73 prepositional phrases 0.54 perfect aspect verbs 0.48 type / token ratio 0.54 public verbs 0.43 attributive adjs. 0.47 synthetic negation 0.40 present participial private verbs

  • -0.96

clauses 0.39

that deletions

  • -0.91

contractions

  • -0.90

present tense verbs

  • -0.47

present tense verbs

  • -0.86

attributive adjs.

  • -0.41

2nd person pronouns

  • -0.86

do as pro-verb

  • -0.82

analytic negation

  • -0.78

demonstrative pronouns

  • -0.76

general emphatics

  • -0.74

first person pronouns

  • -0.74

pronoun it

  • -0.71

be as main verb

  • -0.71

causative subordination

  • -0.66

discourse particles _0.66 indefinite pronouns

  • -0.62

general hedges

  • -0.58

amplifiers

  • -0.56

sentence relatives

  • -0.55

WH questions

  • -0.52

possibility modals

  • -0.50

non-phrasal coordination

  • -0.48

WH clauses

  • -0.47

final prepositions

  • -0.43

WH relative clauses on infinitives 0.76 conjuncts 0.48

  • bject positions

0.63 prediction modals 0.54 agentless passives 0.43 pied piping suasive verbs 0.49 past participial constructions 0.61 conditional clauses 0.42 WH relative clauses on subordination 0.47 BY-passives 0.41 subject position 0.45 necessity modals 0.46 past participial phrasal coordination 0.36 split auxiliaries 0.44 WHIZ deletions 0.40 nominalizations 0.36 possibility modals 0.37

  • ther adverbial

subordinators 0.39 time adverbials

  • -0.60

place adverbials

  • -0.49
  • ther adverbs
  • 0.46

[No complementary features] [No complementary features]

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S

  • 5 +

I

i

N I

  • i0 +

h

  • 15

+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts

Q e

Fiction Professional letters * Personal * letters spontaneous * speeches

Q

Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....

  • 9
  • 7
  • 5
  • 3
  • i

1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1

Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230

slide-12
SLIDE 12

Pitfalls

§ Design bias: choice of quantitative features § Design bias: selection of text samples § Involves a miracle

– not clear what quantitative patterns are captured by FA – magic number: how many factor dimensions?

§ Interpretation bias

– arbitrary cutoff for feature weights (“loadings”) – risk of reading one's own expectations into features

§ More subtle patterns of variation invisible § Significance & reproducibility of results?

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 12

slide-13
SLIDE 13

Reproducing Biber's dimensions

§ Sample of 923 medium-length published texts from written part of British National Corpus (BNC) § Covers 4 different text types + male/female authors

– academic writing, non-academic prose, fiction, misc.

§ Biber features extracted automatically with Python script (Gasthaus 2007)

– all frequencies normalized per 1000 words – data available in R package corpora (BNCbiber)

§ Factor analysis with 4 latent dimensions + varimax

– seems to yield the most clearly structured analysis

www.linguistik.fau.de | www.stefan-evert.de 13 SIGIL Unit #7

slide-14
SLIDE 14

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 14

f26 past participle f43 type token f34 sentence relatives f36 though f39 prepositions f44 mean word length f40 adj attr f27 past participle whiz f18 by passives f17 agentless passives f64 phrasal coordination f14 nominalization f45 conjuncts f16 other nouns f38 other adv sub f31 wh subj f32 wh obj f33 pied piping f51 demonstratives f57 verb suasive f22 that adj comp f30 that obj f21 that verb comp f04 place adverbials f05 time adverbials f25 present participle f47 hedges f01 past tense f02 perfect aspect f08 third person pronouns f61 stranded preposition f13 wh question f07 second person pronouns f23 wh clause f42 adverbs f50 discourse particles f59 contractions f06 first person pronouns f12 proverb do f11 indefinite pronoun f09 pronoun it f67 neg analytic f56 verb private f49 emphatics f55 verb public f58 verb seem f66 neg synthetic f28 present participle whiz f15 gerunds f46 downtoners f48 amplifiers f62 split infinitve f29 that subj f20 existential there f35 because f03 present tense f53 modal necessity f52 modal possibility f24 infinitives f63 split auxiliary f54 modal predictive f37 if f10 demonstrative pronoun f19 be main verb f41 adj pred f41 adj pred f19 be main verb f10 demonstrative pronoun f37 if f54 modal predictive f63 split auxiliary f24 infinitives f52 modal possibility f53 modal necessity f03 present tense f35 because f20 existential there f29 that subj f62 split infinitve f48 amplifiers f46 downtoners f15 gerunds f28 present participle whiz f66 neg synthetic f58 verb seem f55 verb public f49 emphatics f56 verb private f67 neg analytic f09 pronoun it f11 indefinite pronoun f12 proverb do f06 first person pronouns f59 contractions f50 discourse particles f42 adverbs f23 wh clause f07 second person pronouns f13 wh question f61 stranded preposition f08 third person pronouns f02 perfect aspect f01 past tense f47 hedges f25 present participle f05 time adverbials f04 place adverbials f21 that verb comp f30 that obj f22 that adj comp f57 verb suasive f51 demonstratives f33 pied piping f32 wh obj f31 wh subj f38 other adv sub f16 other nouns f45 conjuncts f14 nominalization f64 phrasal coordination f17 agentless passives f18 by passives f27 past participle whiz f40 adj attr f44 mean word length f39 prepositions f36 though f34 sentence relatives f43 type token f26 past participle

Design bias: choice of features

correlated with verb frequency correlated with noun frequency (all feat's measured per 1000 words)

slide-15
SLIDE 15

www.linguistik.fau.de | www.stefan-evert.de 15

−2 −1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

Design bias: choice of text samples

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S
  • 5 +
I

i

N I

  • i0 +

h

  • 15
+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts Q e Fiction Professional letters * Personal * letters spontaneous * speeches Q Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....
  • 9
  • 7
  • 5
  • 3
  • i
1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230

? ?

SIGIL Unit #7

slide-16
SLIDE 16

www.linguistik.fau.de | www.stefan-evert.de 16

−2 −1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

Interpretation bias

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S
  • 5 +
I

i

N I

  • i0 +

h

  • 15
+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts Q e Fiction Professional letters * Personal * letters spontaneous * speeches Q Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....
  • 9
  • 7
  • 5
  • 3
  • i
1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230 ?

? ?

SIGIL Unit #7

slide-17
SLIDE 17

www.linguistik.fau.de | www.stefan-evert.de 17

−2 −1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

Variation between texts is ignored

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S
  • 5 +
I

i

N I

  • i0 +

h

  • 15
+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts Q e Fiction Professional letters * Personal * letters spontaneous * speeches Q Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....
  • 9
  • 7
  • 5
  • 3
  • i
1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

“confidence” ellipse (➞ significance)

SIGIL Unit #7

slide-18
SLIDE 18

www.linguistik.fau.de | www.stefan-evert.de 18

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis

latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • academic

fiction misc_published prose

Design bias: choice of texts (redux)

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S
  • 5 +
I

i

N I

  • i0 +

h

  • 15
+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts Q e Fiction Professional letters * Personal * letters spontaneous * speeches Q Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....
  • 9
  • 7
  • 5
  • 3
  • i
1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis (bootstrap sample #1)

latent dimension 2 latent dimension 1

  • academic

fiction misc_published prose

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis (bootstrap sample #4)

latent dimension 2 latent dimension 1

  • academic

fiction misc_published prose

  • −2

−1 1 2 3 −2 −1 1 2 3

4−Factor Analysis (bootstrap sample #3)

latent dimension 2 latent dimension 1

  • academic

fiction misc_published prose

Bootstrapping

SIGIL Unit #7

slide-19
SLIDE 19

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 19

  • −2

−1 1 2 3 −2 −1 1 2 3

3−Factor Analysis (bootstrap sample #3)

latent dimension 2 latent dimension 1

  • academic

fiction misc_published prose

And there's the magic number …

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l

15 +

t

10 +

i

5 +

  • !

I 0 +

E N S
  • 5 +
I

i

N I

  • i0 +

h

  • 15
+

I

  • 20

+

I

i

  • 25

+

i

  • 30

+

I

  • 35

+

I

INVOLVED

Newspaper reportage * Academic * prose Newspaper * editorials Broadcasts Q e Fiction Professional letters * Personal * letters spontaneous * speeches Q Conversations ..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + ....
  • 9
  • 7
  • 5
  • 3
  • i
1 3 5 7 SITUATED ELABORATED DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference'). 230

3-factor analysis (instead of 4 factors)

slide-20
SLIDE 20

Blindness to subtle patterns

§ But research shows that author gender can be identified with high accuracy

– Koppel et al. (2003): 77.3% with function words + POS n-grams – Gasthaus (2007): 82.9% with SVM on Biber features

§ This dataset: 82.3% accuracy

– baseline: 73.1%

www.linguistik.fau.de | www.stefan-evert.de 20

  • −2

−1 1 2 3 −2 −1 1 2 3 latent dimension 2: overt persuasion + other latent dimension 1: narrative/involved vs. non−narrative/informational

  • female

male

SIGIL Unit #7

slide-21
SLIDE 21

A uniform methodology

(Diwersy, Evert & Neumann 2014; Evert & Neumann 2017)

  • nline supplement:

http://www.stefan-evert.de/ PUB/EvertNeumann2017/

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 21

slide-22
SLIDE 22

A uniform methodology

(Diwersy, Evert & Neumann 2014; Evert & Neumann 2017)

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 22

§ Axiom: (Euclidean) distance = similarity of texts

– depends crucially on theoretically motivated features

§ Visualization ➞ interpret geometric configuration

– latent dimensions as geometric projections – orthogonal projection = perspective on data – method: principal component analysis (PCA)

§ Minimally supervised intervention

– based on externally observable, theory-neutral information – method: linear discriminant analysis (LDA)

§ Bootstrapping / cross-validation to assess significance § Cautious interpretation of feature weights

not a recipe

slide-23
SLIDE 23

Case study: CroCo

§ CroCo: parallel corpus English/German

– English-German and German-English translation pairs – we use 298 texts from 5 different genres (excluded: instruction manuals, tourism, fiction)

§ 28 lexico-grammatical features (Neumann 2013)

– comparable between languages – inspired by SFL and translation studies

§ Text = point in 28-dimensional feature space § Research hypotheses: shining through (Teich 2003), prestige effect (Toury 2012)

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 23

(Neumann 2013; Evert & Neumann 2017)

slide-24
SLIDE 24

Feature design: avoid “obvious” correlations

  • nn / T

adja / T nominal / T finites / S past / F passive / V modals / V imperatives / S interrogatives / S coordination / T subordination / T pronouns / T place adv / T time adv / T adv theme / TH text theme / TH

  • bj theme / TH

verb theme / TH subj theme / TH prep / T modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S 20 40 60 80 relative frequency (%) SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 24

suitable unit of measurement (not always per 1000 words!)

slide-25
SLIDE 25

Feature scaling: same contribution to Euclidean distances

  • nn / T

adja / T nominal / T finites / S past / F passive / V modals / V imperatives / S interrogatives / S coordination / T subordination / T pronouns / T place adv / T time adv / T adv theme / TH text theme / TH

  • bj theme / TH

verb theme / TH subj theme / TH prep / T modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S −5 5 z−score = standardized relative frequency SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 25

slide-26
SLIDE 26

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 26

lexical density nn / T nominal / T adja / T subj theme / TH coordination / T titles / T passive / V adv theme / TH prep / T past / F token / S

  • bj theme / TH

modal adv / T text theme / TH finites / S interrogatives / S imperatives / S verb theme / TH time adv / T place adv / T contractions / T colloquialism / T pronouns / T modals / V lexical TTR subordination / T subordination / T lexical TTR modals / V pronouns / T colloquialism / T contractions / T place adv / T time adv / T verb theme / TH imperatives / S interrogatives / S finites / S text theme / TH modal adv / T

  • bj theme / TH

token / S past / F prep / T adv theme / TH passive / V titles / T coordination / T subj theme / TH adja / T nominal / T nn / T lexical density

CroCo: correlation matrix

slide-27
SLIDE 27

Latent dimensions as perspective on data configuration

§ Instead of “magical” latent dimensions we focus on

  • rthogonal projections as perspectives on the data

– cf. photograph as 2D perspective on 3D object

§ Different perspectives highlight different aspects § Multivariate analysis ➞ choice of perspective

– principal component analysis (PCA) = perspective that reflects distances between texts as accurately as possible – should reveal major dimensions of variation – advantage over factor analysis (FA): dimensionality does not have to be fixed a priori

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 27

slide-28
SLIDE 28

CroCo: 3-dimensional projection

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 28

  • DE

EN essay popsci share speech web

slide-29
SLIDE 29

CroCo: 4-dimensional projection

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 29

−4 −2 2 4

  • −6

−4 −2 2 4 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

essay popsci share speech web

−4 −2 2 4 6

slide-30
SLIDE 30

CroCo: genre distribution

§ Focus on latent dim's 1 and 3 (register variation) § Describe genre by centroid + ellipse § Comparison with Hotelling's t2 test

– essays vs. Web – t2=4.21, df=2/141, p=.0167 *

  • −2

2 4 −4 −2 2 4 latent dimension 3 latent dimension 1

  • essay

popsci share speech web

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 30

  • −2

2 4 −4 −2 2 4 latent dimension 3 latent dimension 1

  • essay

popsci share speech web

  • −2

2 4 −4 −2 2 4 latent dimension 3 latent dimension 1

  • essay

popsci share speech web

slide-31
SLIDE 31

§ PCA dim's can't sep- arate translations from original texts

– 62.1% accuracy on first 3 PCA dim's

§ But SVM machine learner can do this with >80% accuracy

– RBF kernel – 10-fold c.v.

§ Hints at shining through, but no clear-cut evidence

−4 −2 2 4

  • −6

−4 −2 2 4 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • SIGIL Unit #7

www.linguistik.fau.de | www.stefan-evert.de 31

How about subtle patterns?

slide-32
SLIDE 32

Minimally supervised LDA

§ Add minimal amount of supervised knowledge to find a more informative perspective

– evidence for shining through hypothesis from dimension that corresponds to contrast German vs. English – supervised knowledge: language of original texts only

§ Linear discriminant analysis (LDA)

– maximize separation between German / English originals – minimize variability within each group – classical technique related to PCA and ANOVA

§ Project all texts onto LDA discriminant

– complemented by additional PCA dim's for visualization

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 32

slide-33
SLIDE 33

CroCo: LDA perspective

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 33

  • DE

EN

  • rig

trans

English German

slide-34
SLIDE 34

Discriminant for DE vs. EN confirms shining through & prestige effect

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 34

−4 −2 2 4 0.0 0.2 0.4 0.6 0.8 discriminant score density

DE: orig DE: trans EN: orig EN: trans

–1.1 +1.3

acc = 76.8%

English German

slide-35
SLIDE 35

LDA significance: bootstrapping / cross-validation

§ LDA is a supervised ML technique ➞ overtrained?

– Would we find the same discriminant if we trained

  • n a different set of texts?

§ Verification with bootstrap resampling or 10-fold cross-validation

– LDA trained on 90% of data – discriminant axis shows “wobble” of approx. 10°

§ Discriminant scores from c.v. (10% test data per fold)

−4 −2 2 4

  • −4

−2 2 4 −6 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • SIGIL Unit #7

www.linguistik.fau.de | www.stefan-evert.de 35

−4 −2 2 4

  • −4

−2 2 4 −6 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • −4

−2 2 4

  • −6

−4 −2 2 4 −6 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • −4

−2 2 4

−4 −2 2 4 −6 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • −4

−2 2 4

  • −4

−2 2 4 −6 −4 −2 2 4

  • −4

−2 2 4

  • DE

EN

  • −4

−2 2 4 6

  • rig

trans

−4 −2 2 4 6

  • CV fold #1

CV fold #2 CV fold #3 CV fold #4 CV fold #5

slide-36
SLIDE 36

Cross-validated discriminant

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 36

−4 −2 2 4 0.0 0.2 0.4 0.6 0.8 discriminant score density

DE: orig DE: trans EN: orig EN: trans

acc = 76.8%

−4 −2 2 4 0.0 0.2 0.4 0.6 discriminant score density

DE: orig DE: trans EN: orig EN: trans

–1.1 +1.3

LDA on full data set acc = 73.8%

English German

10-fold cross-validation

slide-37
SLIDE 37

Interpreting discriminant features

−0.2 0.0 0.2 EN / DE discriminant n n _ T a d j a _ T n

  • m

i n a l _ T f i n i t e s _ S p a s t _ F p a s s i v e _ V m

  • d

a l s _ V i m p e r a t i v e s _ S i n t e r r

  • g

a t i v e s _ S c

  • r

d i n a t i

  • n

_ T s u b

  • r

d i n a t i

  • n

_ T p r

  • n
  • u

n s _ T p l a c e . a d v _ T t i m e . a d v _ T a d v . t h e m e _ T H t e x t . t h e m e _ T H

  • b

j . t h e m e _ T H v e r b . t h e m e _ T H s u b j . t h e m e _ T H p r e p _ T m

  • d

a l . a d v _ T c

  • n

t r a c t i

  • n

s _ T c

  • l

l

  • q

u i a l i s m _ T t i t l e s _ T l e x i c a l . d e n s i t y l e x i c a l . T T R t

  • k

e n _ S

normalized feature weights

−0.2 0.0 0.2 weight

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 37

slide-38
SLIDE 38

Interpreting discriminant features

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 38

nn / T adja / T nominal / T finites / S past / F passive / V modals / V imperatives / S interrogatives / S coordination / T subordination / T pronouns / T place adv / T time adv / T adv theme / TH text theme / TH

  • bj theme / TH

verb theme / TH subj theme / TH prep / T modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S −5 5 −5 5 D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N D E E N

feature values

group DE EN

DE / EN discriminant (original texts)

slide-39
SLIDE 39

Interpreting discriminant features

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 39

nn / T (−) adja / T nominal / T (−) finites / S (−) past / F (−) passive / V (−) modals / V (−) imperatives / S (−) interrogatives / S (−) coordination / T subordination / T (−) pronouns / T place adv / T time adv / T adv theme / TH text theme / TH (−) obj theme / TH verb theme / TH subj theme / TH prep / T (−) modal adv / T contractions / T colloquialism / T titles / T lexical density lexical TTR token / S −1 1 2 −1 1 2 DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN

contribution to axis scores group

DE EN

DE / EN discriminant (original texts)

slide-40
SLIDE 40

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 40

  • DE

EN

  • rig

trans

Unravelling translationese

German English LDA for trans vs. orig in each language

slide-41
SLIDE 41

Case study 2: French regional varieties

§ Lexical differences in regional varieties of French § Two nation-wide newspapers each from 6 countries

– Cameroon, France, Ivory Coast, Morocco, Senegal, Tunisia – two consecutive volumes from each newspaper – total size approx. 14.5 million tokens

§ Text samples = one week each § Features: frequencies of shared colligations

– colligation = lemma-function pairs – must occur in all subcorpora with f ≥ 100

www.linguistik.fau.de | www.stefan-evert.de 41

(Diwersy, Evert & Neumann 2014)

SIGIL Unit #7

slide-42
SLIDE 42

FRV: poor choice of features

www.linguistik.fau.de | www.stefan-evert.de 42

−50 50

  • −60

−40 −20 20 40 −100 −50 50

  • MUTA

TRIB FRAT VOIE LFI LM AJD MAT SOL WALFA LAPRE TEMPS

  • CAM

CIV FRA MAR SEN TUN

−80 −40 20 40

  • PCA not excluding

country-specific words as features: perfect separation Design bias results in a completely uninteresting model

FA not applicable: features >> texts

SIGIL Unit #7

slide-43
SLIDE 43

FRV: PCA dimensions

www.linguistik.fau.de | www.stefan-evert.de 43

−60 −20 20 40 60

  • −60

−40 −20 20 40 −80 −40 20 40 60

  • MUTA

TRIB FRAT VOIE LFI LM AJD MAT SOL WALFA LAPRE TEMPS

  • CAM

CIV FRA MAR SEN TUN −60 −20 20 40 60

  • Using only shared

words as features, PCA no longer reveals any patterns (just a few outliers) Use LDA to find a meaningful per- spective, based on newspaper source

Country would presume regional varieties exist!

SIGIL Unit #7

slide-44
SLIDE 44

www.linguistik.fau.de | www.stefan-evert.de 44

MUTA TRIB FRAT VOIE LFI LM AJD MAT SOL WALFA LAPRE TEMPS CAM CIV FRA MAR SEN TUN

FRV: LDA dimensions (newspapers)

SIGIL Unit #7

slide-45
SLIDE 45

FRV: LDA dimensions (newspapers)

www.linguistik.fau.de | www.stefan-evert.de 45 SIGIL Unit #7

−10 −5 5 10

  • −10

−5 5 10 −10 −5 5 10

  • MUTA

TRIB FRAT VOIE LFI LM AJD MAT SOL WALFA LAPRE TEMPS

  • ● ●
  • CAM

CIV FRA MAR SEN TUN

−10 −5 5 10

slide-46
SLIDE 46

FRV: discriminant axes

www.linguistik.fau.de | www.stefan-evert.de 46 SIGIL Unit #7

−6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5

CAM

discriminant score density CAM CIV FRA MAR SEN TUN −6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5

CIV

discriminant score density CAM CIV FRA MAR SEN TUN −5 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

FRA

discriminant score density CAM CIV FRA MAR SEN TUN −8 −6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5

MAR

discriminant score density CAM CIV FRA MAR SEN TUN −6 −4 −2 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5

SEN

discriminant score density CAM CIV FRA MAR SEN TUN −4 −2 2 4 6 0.0 0.2 0.4 0.6 0.8

TUN

discriminant score density CAM CIV FRA MAR SEN TUN

slide-47
SLIDE 47

References

Biber, Douglas (1988). Variation Across Speech and Writing. Cambridge University Press, Cambridge. Diwersy, Sascha; Evert, Stefan; Neumann, Stella (2014). A weakly supervised multivariate approach to the study of language variation. In B. Szmrecsanyi &

  • B. Wälchli (eds.), Aggregating Dialectology, Typology, and Register Analysis. Linguistic

Variation in Text and Speech. De Gruyter, Berlin. Evert, Stefan & Neumann, Stella (2017). The impact of translation direction on the characteristics

  • f translated texts: a multivariate analysis for English and German. In G. De Sutter, M.-A. Lefer

& I. Delaere (eds.), Empirical Translation Studies. New Theoretical and Methodological Traditions (TiLSM 300), pages 47–80. Mouton de Gruyter, Berlin. Gasthaus, Jan (2007). Prototype-Based Relevance Learning for Genre Classification. B.Sc. thesis, Universität Osnabrück, Institute of Cognitive Science. Koppel, Moshe; Argamon, Shlomo; Shimoni, Anat R. (2003). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401–412. Neumann, Stella (2013). Contrastive Register Variation. A Quantitative Approach to the Comparison of English and German. de Gruyter Mouton, Berlin. Teich, Elke (2003). Cross-linguistic variation in system and text. A methodology for the investigation of translations and comparable texts. Berlin: Mouton de Gruyter. Toury, Gideon (2012). Descriptive Translation Studies – and beyond: Revised edition. 2nd ed. Amsterdam: John Benjamins.

SIGIL Unit #7 www.linguistik.fau.de | www.stefan-evert.de 47