Using for Historical Research Bryan Jurish Maret Niel ander - - PowerPoint PPT Presentation

using for historical research
SMART_READER_LITE
LIVE PREVIEW

Using for Historical Research Bryan Jurish Maret Niel ander - - PowerPoint PPT Presentation

D IA C OLLO Using for Historical Research Bryan Jurish Maret Niel ander Berlin-Brandenburg Academy of Georg Eckert Institute for International Sciences and Humanities, Berlin Textbook Research, Braunschweig jurish@bbaw.de


slide-1
SLIDE 1

Using

DIACOLLO

for Historical Research

Bryan Jurish Maret Niel¨ ander

Berlin-Brandenburg Academy of Sciences and Humanities, Berlin Georg Eckert Institute for International Textbook Research, Braunschweig

jurish@bbaw.de nielaender@leibniz-gei.de CLARIN Annual Conference 2019 Leipzig, Germany 1st October, 2019

slide-2
SLIDE 2

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 1

Overview

p Collaborative software development p Corpora & collocations p DiaCollo: diachronic collocation profiling p Use case: Education policy in Die Grenzboten p Summary & conclusion
slide-3
SLIDE 3

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 2

Software Development Cycle

? ? Planning

  • identify desiderata & bugs
  • sketch next steps

Implementation

  • coding & documentation
  • release & deployment
slide-4
SLIDE 4

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 3

Corpora & Collocations

Diachronic Text Corpora

p heterogeneous with respect to to date of origin p should expose temporal effects of e.g. semantic shift, discourse trends p problematic for conventional NLP tools (which assume homogeneity)

Collocation Profiling

(Church & Hanks 1990; Manning & Sch¨ utze 1999; Evert 2005)

“You shall know a word by the company it keeps” — J. R. Firth

p prompt user for target collocant term(s) of interest (w1) p lookup all candidate collocates (w2) co-occurring with w1 p rank candidates by association score t score function ϕ(f1, f2, f12, N) approximates relevance of w2 to w1 t “chance” co-occurrences with high-frequency w2 should be filtered out! t statistical method requires large data sample
slide-5
SLIDE 5

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 4

Diachronic Collocation Profiling

The Problem: (temporal) heterogeneity

p conventional collocation extractors assume corpus homogeneity p co-occurrence frequencies are computed only for word-pairs (w1, w2) p influence of occurrence date (and other document properties) is irrevocably lost

A Solution (sketch)

p represent terms as n-tuples of independent attributes, including occurrence date p partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”) p collect independent epoch-wise profiles into final result set

Advantages Drawbacks

t full support for diachronic axis t sparse data requires larger corpora t variable query-level granularity t computationally expensive t flexible attribute selection t large index size t multiple association scores t no syntactic relations (yet)
slide-6
SLIDE 6

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 5

DiaCollo: Development

Planning & Evaluation

p in collaboration with DWDS lexicographers & CLARIN-D historians

Implementation

p Perl+PDL API, CLI, client/server t RESTful D* web-service + GUI p various output & visualization formats, e.g. t TSV, JSON , HTML, Highcharts, d3-cloud, . . . p batteries not included t tokenization, annotation, full-text search, . . . p garbage in garbage out t “messy” corpora unsatisfying results

Deployment

p successfully applied to 70 distinct curated corpora at the BBAW, including: t Royal Society Philosophical Transactions

(1665–1869, 9.8K documents, 35M tokens)

t Deutsches Textarchiv

(1600–1900, 3.6K documents, 205M tokens)

t DWDS Zeitungen

(1946–2019, 16M documents, 6.3G tokens)

slide-7
SLIDE 7

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 6

DiaCollo: Scoring & Comparison Functions

Selected Score Functions

p f

raw collocation frequency = f12

p lf

collocation log-frequency = log2(f12 + ε)

p mi

pointwise MI × log-frequency ≈ log2

f12×N f1×f2 × log2 f12

p ll

log-likelihood (Dunning 1993) ≈ sgn(f12|f1, f2) × log L(H0)

L(H1)

p ld

log-Dice coefficient (Rychl´ y 2008) ≈ 14 + log2

2×f12 f1+f2

Selected Diff Operations

p diff

raw score difference = sa − sb

p adiff absolute score difference

= |sa − sb|

p avg

arithmetic average = sa+sb

2

p max

maximum = max{sa, sb}

p min

minimum = min{sa, sb}

p havg harmonic average

≈ 2sasb

sa+sb

slide-8
SLIDE 8

Use Case: Education Policy in Die Grenzboten

slide-9
SLIDE 9

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 7

‘Schule’: DiaCollo Query (DTA)

slide-10
SLIDE 10

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 8

‘Schule’: DiaCollo Collocates (DTA: HTML)

1560–1569

p association with religious institutions t Kloster (“cloister”) t Pfarrherr (“pastor”) t Kirche (“church”)
slide-11
SLIDE 11

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 8

‘Schule’: DiaCollo Collocates (DTA: HTML)

1560–1569 1710–1719

p association with religious institutions t Kloster (“cloister”) t Pfarrherr (“pastor”) t Kirche (“church”) p stronger secular associations t Inspektor (“inspector”) t preußisch (“prussian”) t Universit¨

at (“university”)

p trend continues as time progresses
slide-12
SLIDE 12

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 9

‘Schule’: DiaCollo Collocates (DTA: lemma-cloud)

1560s:

1560 1570 1580 1590 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710

1560 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Kirche

Schule

Kloster

partikular

Schulmeister

Pfarrherr

Ordnung

Flecken

Knabe

Fleiß

1710s:

1560 1570 1580 1590 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710

1710 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Kirche

Schule

Jugend

Universität Lehrer

mechanisch

preußisch

Inspektor

Besserung

Besuchung

slide-13
SLIDE 13

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 10

Die Grenzboten Corpus

http://brema.suub.uni-bremen.de/grenzboten http://www.deutschestextarchiv.de/doku/textquellen#grenzboten

Image: SuUB Bremen

p Die Grenzboten (“the messengers from the border(s)”) was a bi-weekly

national-liberal German language periodical published 1841–1922

p covered a wide range of politics, literature, and the arts throughout the

‘long’ nineteenth Century

p 270 volumes (ca. 187,000 pages) digitized, OCR’ed, and structured by the

SuUB Bremen in the context of a DFG-Project

t integrated into the corpus research infrastructure of the Deutsches Textarchiv

at the BBAW CLARIN Service Center

slide-14
SLIDE 14

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 11

Are Die Grenzboten concerned with education?

Step 1: query corpus vocabulary database (LexDB)

p identify relevant terms in the corpus, e.g. Schule (“school”), 1840–1899 t . . . in the Deutsches Textarchiv: 101.52 per million tokens t . . . in Die Grenzboten

: 237.29 per million tokens

Step 2: query DiaCollo

p identify strong collocates for Schule (“school”) p identify possible debates in the corpus via query results p close reading in the texts via “keyword-in-context” (KWIC) hyperlinks
slide-15
SLIDE 15

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 12

Education Policy & Religion

Collocate ‘Kirche’ (“church”)

p persistently prominent throughout the entire Grenzboten corpus p 1850s–1880s: konfessionell (“confessional”) p 1890s–1910s: Religionsunterricht (“religious education”)

Refining the Search

p restrict to attributive adjective collocates (groupby: l,p=ADJA) t protestantisch

(“protestant”) 1860s

t katholisch

(“Catholic”) 1860s-1870s

t evangelisch

(“Protestant, Evangelical”) 1860s-1870s

t konfessionell

(“confessional”) 1860s-1880s

t kirchlich

(“churchly”) 1870s

p collocates related to church & religious confession peak in the 1860s–1870s p also prominent: ¨
  • ffentlich (“public”; 1840s, 1870s–1900s)
t KWIC stance of publicly funded schools w.r.t. church influence in education
slide-16
SLIDE 16

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 13

Education Policy: Kulturkampf

Kulturkampf (“cultural struggle”)

p rights & influences of state (Prussia) vs. church (Pope Pius IX) p ultramontan (“ultramontane”) staunch supporters of the Catholic Church

Raw Frequency ultramontan Kulturkampf 100 25 50 75 Date (Year) 1845 1850 1855 1860 1865 1870 1875 1880 1885 1890 1895 1900 1905 1910 1915 1920

Refining the Search: GermaNet thesaurus + paragraph search window

(Hamp & Feldweg 1997; Henrich & Hinrichs 2010)

p corpus hits show evidence for anti-Catholic opinions in debates on education t who should be in charge of education and curricula? t how to deal with different religious denominations in schools?

Upshot

p some important aspects of debate are not apparent from initial na¨

ıve DiaCollo queries

p informed curiosity & focused investigation leads to very satisfying results
slide-17
SLIDE 17

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 14

Summary & Conclusion

Collaborative Development

p cyclic process

feedback loop

p elusive common ground

terminology, research methodology DiaCollo

p diachronic text corpora

semantic shift, discourse trends

p conventional tools

implicit assumptions of homogeneity

p diachronic profiling

date-dependent lexemes . . . as a tool for historical research

p fluent “blended”/“scalable” reading

distant ↔ close reading

p digital corpora (sources)

quantity, quality, legal issues

slide-18
SLIDE 18

— The End — Thank you for listening!

http://kaskade.dwds.de/˜jurish/diacollo http://metacpan.org/release/DiaColloDB

slide-19
SLIDE 19

References

slide-20
SLIDE 20

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 15

References

  • K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational

Linguistics, 16(1):22–29, 1990.

  • S. Evert. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, Institut f¨

ur maschinelle Sprachverarbeitung, Universit¨ at Stuttgart, 2005. URL http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/.

  • J. R. Firth. Papers in Linguistics 1934–1951. Oxford University Press, London, 1957.
  • B. Hamp and H. Feldweg. GermaNet – a lexical-semantic net for German. In Proceedings of the ACL

workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997.

  • V. Henrich and E. Hinrichs. GernEdiT – the GermaNet editing tool. In Proceedings LREC 2010, pages

2228–2235, 2010. URL http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf.

  • B. Jurish. DiaCollo: On the trail of diachronic collocations. In K. De Smedt, editor, CLARIN Annual

Conference 2015 (Wroc law, Poland, October 14–16 2015), pages 28–31, 2015. URL http://www.clarin.eu/sites/default/files/book%20of%20abstracts%202015.pdf.

  • B. Jurish, A. Geyken, and T. Werneke. DiaCollo: diachronen Kollokationen auf der Spur. In Proceedings

DHd 2016: Modellierung – Vernetzung – Visualisierung, pages 172–175, March 2016. URL http://dhd2016.de/boa.pdf#page=172.

  • H. Kermes, S. Degaetano, A. Khamis, J. Knappen, and E. Teich. The Royal Society corpus: From uncharted

data to corpus. In Proceedings of LREC 2016, Portoroz, Slovenia, 2016. URL http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.

slide-21
SLIDE 21

2019-10-01 / CAC-2019 / Jurish & Niel¨ ander / 16

References

  • C. D. Manning and H. Sch¨
  • utze. Foundations of Statistical Natural Language Processing. MIT Press,

Cambridge, MA, 1999.

  • A. Stulpe and M. Lemke. Blended reading. In M. Lemke and G. Wiedemann, editors, Text Mining in den

Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitativer und quantitativer Diskursanalyse, pages 17–61. Springer, 2016. doi:10.1007/978-3-658-07224-7 2.