[PPT] - D Exploring the internal heterogeneity of a corpus of Classical PowerPoint Presentation

SLIDE 1

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

D

Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo

Bryan Jurish Annette Gerstenberg

Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit¨ at Berlin

jurish@bbaw.de annette.gerstenberg@fu-berlin.de Global Philology Open Conference Universit¨ at Leipzig 22nd February, 2016

SLIDE 2

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

Overview

The Situation

p Diachronic (heterogeneous) Text Corpora p Collocation Profiling p Diachronic Collocation Profiling

DiaCollo

p Requests & Parameters p Profile, Diffs & Indices

APWCF Corpus

p Background p Subcorpora p Sources & Enrichments

Examples Summary & Conclusion

SLIDE 3

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [2]

The Situation: Diachronic Text Corpora

p heterogeneous text collections, especially with respect to date of origin t other partitionings may be relevant too, e.g. by genre, location, etc. p increasing number available for linguistic & humanities research, e.g. t Deutsches Textarchiv (DTA)

(Geyken 2013)

t Referenzkorpus Altdeutsch (DDD)

(Richling 2011)

t Corpus of Historical American English (COHA)

(Davies 2012)

p . . . but even putatively “synchronic” corpora have a temporal extension, e.g. t DWDS/ZEIT (Kohl ∼ politician vs. “cabbage”)

(1946–2016)

t DDR Presseportal (Ausreise ∼ “departure”)

(1945–1993)

t DWDS/Blogs (“Browser”)

(1994–2016)

p should expose temporal effects of e.g. semantic shift, discourse trends p problematic for conventional natural language processing tools t implicit assumptions of homogeneity

SLIDE 4

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [3]

The Situation: Collocation Profiling

“You shall know a word by the company it keeps” — J. R. Firth Basic Idea

(Church & Hanks 1990; Manning & Sch¨ utze 1999; Evert 2005)

p lookup all candidate collocates (w2) occurring with the target term (w1) p rank candidates by association score t “chance” co-occurrences with high-frequency items must be filtered out! t statistical methods require large data sample

What for?

p computational lexicography

(Kilgarriff & Tugwell 2002; Didakowski & Geyken 2013)

p neologism detection

(Kilgarriff et al. 2015)

p distributional semantics

(Sch¨ utze 1992; Sahlgren 2006)

p “text mining” / “distant reading”

(Heyer et al. 2006; Moretti 2013)

SLIDE 5

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [4]

Diachronic Collocation Profiling

The Problem: (temporal) heterogeneity

p conventional collocation extractors assume corpus homogeneity p co-occurrence frequencies are computed only for word-pairs (w1, w2) p influence of occurrence date (and other document properties) is irrevocably lost

A Solution (sketch)

p represent terms as n-tuples of independent attributes (including occurrence date) t alternative: “document” level co-occurrences over sparse TDF matrix p partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”) p collect independent slice-wise profiles into final result set

Advantages Drawbacks

t full support for diachronic axis t sparse data requires larger corpora t variable query-level granularity t computationally expensive t flexible attribute selection t large index size t multiple association scores t no syntactic relations (yet)

SLIDE 6

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [5]

DiaCollo: Requests & Parameters

p Perl API, RESTful web-service (Fielding 2000) + web-form GUI p accepts user requests as set of parameter=value pairs p parameter passing via URL query string or HTTP POST request p common parameters:

Parameter Description query target collocant lemma(ta), regular expression, or DDC query date target date(s), interval, or regular expression slice epoch granularity or “0” (zero) for a date-independent profile groupby projected collocate attributes with optional restrictions score association score function for collocate ranking kbest maximum number of collocate items to return per epoch diff binary score comparison operation for diff profiles global request global profile pruning (vs. default epoch-local pruning) profile profile type to be computed ({native,tdf,ddc} × {unary,diff}) format

utput format or visualization mode (e.g. TSV, JSON, HTML, d3-cloud, . . .)

SLIDE 7

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [6]

DiaCollo: Profiles, Diffs & Indices

Profiles & Diffs

p simple request → unary profile for target term(s)

(profile, query)

t filtered & projected to selected attribute(s)

(groupby)

t trimmed to k-best collocates for target word(s)

(score, kbest, global)

t aggregated into independent epoch-wise sub-intervals

(date, slice)

p diff request → comparison of two independent targets

(profile, bquery, . . . )

t highlights differences or similarities of target queries

(diff)

t can be used to compare different words

(query = bquery) . . . or different corpus subsets w.r.t. a given word (e.g. date = bdate)

Indices & Attributes

p compile-time filtering of native indices: frequency threshholds, PoS-tags p default index attributes: Lemma (l), Pos (p) p finer-grained queries possible with TDF or DDC back-ends p batteries not included: corpus preprocessing, analysis, & full-text search index t see e.g. Jurish (2003); Geyken & Hanneforth (2006); Jurish et al. (2014), . . .

SLIDE 8

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [7]

DiaCollo: Scoring & Comparison Functions

Selected Association Score Functions

p f

raw collocation frequency = f12

p lf

collocation log-frequency = log2(f12 + ε)

p mi1

pointwise mutual information ≈ log2

f12×N f1×f2

p milf

pointwise MI × log-frequency ≈ log2

f12×N f1×f2 × log2 f12

p ll

log-likelihood (Dunning 1993) ≈ sgn(f12|f1, f2) × log(1 + log λ)

p ld

log-Dice coefficient (Rychl´ y 2008) ≈ 14 + log2

2×f12 f1+f2

Selected Diff Operations

p diff

raw score difference = sa − sb

p adiff absolute score difference

= |sa − sb|

p avg

arithmetic average = 1

2(sa + sb)

p max

maximum = max{sa, sb}

p min

minimum = min{sa, sb}

p havg harmonic average

≈ 2×sa×sb

sa+sb

SLIDE 9

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [8]

APWCF: From diplomatic correspondence to a corpus of Classical French

p Classical French: same structures as modern French, but t linguistic norm is not yet stable t variation and patterns of usage of grammatical features t linguistic change on the levels of semantics and pragmatics p Acta Pacis Westphalicae (1643–1648): The French correspondence t Diplomatic letters between Paris (government) and diplomats at M¨

unster

t Ambassadors are committed to achieving diplomatic goals

convincing the government to adapt the instructions

t Diplomatic letters: formal constraints versus expressive needs p Linguistic interest t Diachronic variation: comparison with existing resources of Classical French t Synchronic variation: genre-internal heterogeneity p Genre-internal heterogeneity: hypothesis of different levels of formality t Two subcorpora: ”government” (Paris) and ”ambassadors” (M¨

unster)

t Register-variation reflected in the use of linguistic variables

SLIDE 10

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [9]

APWCF: Correspondence

SLIDE 11

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [10]

APWCF: Subcorpora

SLIDE 12

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [11]

APWCF: Data

p Letters mostly conserved in French Archives t Archives du Minist`

ere des Affaires Etrang` eres

t microforms: Zentrum f¨

ur Historische Forschung, Bonn

p Digital edition (PDF, XML) t Bayerische Staatsbibliothek t Zentrum f¨

ur historische Forschung Bonn

p Linguistic corpus: AG t Part-of-Speech Tagging (PRESTO, Cologne/Lyon) t XML / TXM (Lyon) p Corpus size: 8 volumes of French edition, 2.4M Tokens

SLIDE 13

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [12]

APWCF: Transcription

Diplomatic transcription, spelling variants preserved

p traitt´

es vs. traitt´ ez vs. traittez

p estat (old) or ´

etat (mod.) as appearing in the manuscript

p Punctuation almost preserved, but. . .

Modernized

p Some adaptations of punctuation p u/v, i/j modernized p Capitalization of proper names and titles p Diacritics normalized (lavis → l’avis, francais → franc

¸ais)

p Abbreviated titles/words: full form

SLIDE 14

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Examples

SLIDE 15

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Example 1: ledict (chancellery style)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=mi1&f=html ... query: doc.loc=Paris slice: 0 ∼query: doc.loc=M¨ unster ∼slice: 0 score: mi1 groupby: w=ledict

Paris M¨ unster N 2,462,443 2,462,443 f1 746,786 1,153,939 f2 = f12 284 414 score (mi1) 1.721 1.093 diff (Paris - M¨ unster) 0.6278

p simple example uses “unigrams” comparison profile (f2 = f12) p pointwise mutual information (mi1) score function p “Paris” shows definite preference the archaic form ledict (chancellery style)

SLIDE 16

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [14]

Example 2: PLAIRE (situational context)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ... query: doc.loc=Paris slice: 0 ∼query: doc.loc=M¨ unster kbest: 25 score: ll groupby: w,l=PLAIRE

plaira

plairoit

plaisoit

plait

plaist

plue

plaire

plaisant

plaisante

plaîtplaise

plaisent

plaict

p situational context: lemma PLAIRE, e.g. s’il vous plaˆ

ıt (“please”)

p log-likelihood association scores (robust) p attribute-cloud visualization: warm colors ∼ Paris, cool colors ∼ M¨

unster

SLIDE 17

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [15]

Example 3: plaˆ ıt vs. plaist (orthographic variation)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ... query: doc.loc=Paris slice: 0 ∼query: doc.loc=M¨ unster score: ll groupby: w=plaˆ ıt|plaist

p zoom: “plaist/plaˆ

ıt de ’s’il vous plaˆ ıst”

p more frequent in M¨

unster

p relatively more frequent use of the archaic variant plaist p transcription in general respects orthographic variation t typically not transparent in historical editions

SLIDE 18

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [16]

Example 4: POUVOIR (request & response)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ... query: doc.loc=Paris slice: 0 ∼query: doc.loc=M¨ unster kbest: 25 score: ll groupby: w,l=POUVOIR

pouvant

pouvions

peust

pouvons

pourrions

pouvez

peuvent

peult

pourrons

pourroit

pourrez

pouvoir

puissiez

pouvoirs

puissions

pourront

puisse

pussent

puissent

pu

pourra

peut

pouvois

pouvans

pouvoient

p Paris: request “you could” (puissiez/pourrez) p M¨

unster: response “we could/will be able” (pouvions/pourrons)

SLIDE 19

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [17]

Example 5: Speech Acts

parlasmes

escrivit

parleront

parlois

escriroit

communiquer

escriptes

écrire

parlera

parloit

escrittes

parlons

escrivismes

escript

escrive

escrira

escrivoit

escrivons

escriroient

parlions parlerons

escriront

discuté

communiqué

communicqué

parlent

escris

parlé

escriray

escrite

escrivois

escrivisse

escrivis

escrivés

escripre

parlèrent

escrivez

discuter

parliez

parleray

comuniquer

parloient

escripte

escrivent

escrirons

escrites

parla

parlay

parlant

escriviez

p Diplomatic negotiations: overt speech act verbs p Paris: discuter, discut´

e, escrivez, escrivois, . . .

p M¨

unster: comuniquer, escrivons, escrivismes, parlasmes, parl` erent, . . .

SLIDE 20

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [18]

Summary & Conclusion

Diachronic Collocation Profiling

p diachronic text corpora

semantic shift, discourse trends

p conventional tools

implicit assumptions of homogeneity

p diachronic profiling

date-dependent lexemes DiaCollo

p on-the-fly corpus partitioning

arbitrary query granularity

p DDC/D* integration

fine-grained queries, corpus KWIC links

p RESTful web service

external API, online visualization APWCF + DiaCollo

p metadata-based filtering

location-specific profiles

p “diff” profile mode

inter-subcorpus comparisons

p metadata-based aggregation

subcorpus preference profiles

SLIDE 21

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [19]

— The End —

treu

wirklich

lieb

herzlich

lächeln

gut

schön

persönlich

warm

letzte

lieb

danken

klein glücklich

kurz

liebenswürdig

jung ganz

freundschaftlich

freundlich

Thank you for listening!

http://kaskade.dwds.de/dstar/apwcf/diacollo http://metacpan.org/release/DiaColloDB APWCF: http://wikis.fu-berlin.de/pages/viewpage.action?pageId=594936338