Constructing a Canonicalized Corpus of Historical German by Text - PowerPoint PPT Presentation

Constructing a Canonicalized Corpus of Historical German by Text Alignment     Bryan Jurish,     Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften     http://deutschestextarchiv.de   Henriette Ast New Methods in Historical Corpora Manchester, 29 th -30 th April, 2011 NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 1/20

Overview The Big Picture Canonicalization Desiderata Proposal Construction Sources Text Alignment Manual Annotation Applications Test Corpus Canonicalization Lexicon Conclusion NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 2/20

— The Big Picture — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 3/20

Canonicalization a.k.a. (orthographic) ‘standardization’, ‘normalization’, ‘modernization’, . . . The Problem Historical text �∋ orthographic conventions Conventional NLP tools ⇒ strict orthography Fixed lexicon keyed by orthographic form Extant lexemes only þ e Olde Wydgette Shoppe ↓ ↓ ↓ ↓ the old widget shop The Approach Map each word w to a unique canonical cognate � w Synchronically active “extant equivalents” Preserve both root and relevant features of input Defer application analysis to canonical forms NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 4/20

Desiderata Evaluation Compare various canonicalization functions c ( · ) Task : information retrieval = ⇒ (precision, recall) Retrieval via canonical equivalence: � c ◦ c − 1 � � � retrieved c ( w ) := ( w ) = v : c ( v ) = c ( w ) Relevance requires manual verification! relevant( w ) := ? Ground-Truth Corpus Manually verified canonicalization pairs ( w, � w ) “Gold standard” � c ( · ) for training & evaluation c ( w ) = { v : � v = � relevant( w ) := retrieved � w } Minimize manual annotation effort . . . but how? NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 5/20

Proposal Intuitions Contemporary editions of historical works ⇒ already standardized = Expect mostly identity canonicalizations w = � w (at least for 18 th -19 th century German) Construction (Sketch) Align historical text with a contemporary edition maximize identity alignments Confirm or Reject type-wise alignments exploit Heaps’ Law Manually annotate only unconfirmed tokens don’t lose “interesting” anomalous material NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 6/20

— Construction — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 7/20

Sources Text Resources Source texts: Deutsches Textarchiv (DTA) Belles lettres , drama, verse, philosophy Target texts: gutenberg.org , zeno.org Prototype Corpus 13 volumes, published 1780–1880 ca. 350k tokens ∼ 28k types (words only) Ongoing Construction (‘full’ corpus) 129 volumes, published 1780–1901 ca. 5.2M tokens ∼ 219k types (words only) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 8/20

Text Alignment Preprocessing Tokenization (1 word / line) e Transliteration e.g. ( S �→ s), ( o �→ ö) Basic Alignment (GNU diff ) Token-wise LCS > 77% identity, > 94% transliterated identity Heuristic Alignment For each change change hunk multi-token alignments e.g. (zwei und vierzig �→ zweiundvierzig) character-wise ‘best’ match (Levenshtein) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 9/20

Type-wise Confirmation Idea Manually confirm or reject non-identity alignments Exploit Heaps’ Law vocabulary grows logarithmically with corpus size Conservative acceptance only Results (prototype corpus) Available: 18k tokens ∼ 5.8k types Confirmed: 16k tokens (90%) ∼ 4.5k types (77%) Throughput ca. 3.95 seconds / pair ≈ 15 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 10/20

Token-wise Annotation Idea Resolve remaining uncanonicalized tokens (ca. 2%) Retain anomalous canonicalization patterns Preprocessing Filters Block pruning ( ≈ 2.2%) Closed-class lexicon Annotations Canonical form + administrative flags Expert review for problematic cases Throughput (total) ≈ 1.3 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 11/20

— Experiments — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 12/20

Materials Prototype Corpus ❀ Ground-Truth Relevance Most thoroughly annotated corpus subset 340k tokens; 29k types (words only) Full Corpus ❀ Canonicalization Lexicon ( LEX )   arg max � w f ( w, � w ) if f ( w ) > 0 LEX ( w ) =  w otherwise Strictly disjoint from test corpus (by author) Partially annotated (no expert review) 2.4M tokens; 140k types (words only) HMM Canonicalization Cascade (Jurish, 2010c) Robust finite-state canonicalizer Tested methods: ID , LEX , HMM , HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 13/20

Results �� % Types % Tokens pr rc F pr rc F 98.3 57.1 72.2 99.7 79.1 88.2 ID 97.9 93.2 95.5 99.5 98.9 99.2 HMM 98.3 85.3 91.3 99.7 98.2 98.9 LEX 98.3 94.8 96.5 99.7 99.2 99.5 HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 14/20

Conclusion Construction Alignment with contemporary edition Type-wise confirmation Token-wise annotation ❀ minimal-effort corpus bootstrapping Applications Simple corpus-based lexicon ⇒ surprisingly effective very high precision mediocre recall for unknown types (sparse data) ‘Exception’ lexicon for HMM canonicalizer best overall performance corpus-based and generative techniques complement one another NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 15/20

þ e Olde LaĆe Slyde (“The End”) Thank you for listening! NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 16/20

— Addenda — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 17/20

Token Annotation GUI NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 18/20

GUI: Batch Editor Window NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 19/20

Administrivia Class N %Edited �� 2684 59.22 % LEX �� 874 19.29 % �� NE �� 792 17.48 % JOIN �� 101 2.23 % �� GRAPH 72 1.59 % SPLIT 40 0.88 % BUG 8 0.18 % GONE 1 0.02 % FM NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 20/20

Constructing a Canonicalized Corpus of Historical German by Text - PowerPoint PPT Presentation

Constructing a Canonicalized Corpus of Historical German by Text Alignment Bryan Jurish, Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Company Presentation December 2016 Market German Residential Safe Harbor and Low Risk German

Annotating and querying the Icelandic Parsed Historical Corpus and closely related

Constructing Inverse Probability Weights for Static Constructing Inverse Probability Weights for

Constructing Error- -Correction Codes Correction Codes Constructing Error from Scale- -Free

Identifying and Identifying and Constructing a Constructing a Dredged Material Dredged Material

Constructing a spanning tree Toni Kylml toni.kylmala@tkk.fi 1 Constructing a spanning tree

(yes, again! ) Stephan van Staden Outline The Views framework The motivation for

Constructing noncommutative topology David Kruml Masaryk University, Brno Constructing

Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh)

Annotation of an Early New High German Corpus: The LangBank Pipeline Zarah Wei and Gohar

SH 358 IMPROVEMENTS Corpus Christi District Updated October 2018 SH 358 Improvements Corpus

Library of Congress Classification Module 10.3 Using the Translation Table & Texts in

From Display Calculi to Decision Procedures for Full Intuitionistic Linear Logic Ranald Clouston,

From Stochastic Calculus of Variations on Wiener space to Stochastic Calculus of

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

? A New Frontier Martin Kay Stanford University and The University of the Saarland Martin Kay

Mod elisation math ematique des vagues David Lannes Institut de Math ematiques de

Privacy harms Privacy harms Engineering & Public Policy Lorrie Faith Cranor September 2,

Hopf algebra of discrete representation type Shijie Zhu (Joint with M. Iovanov, E.Sen, A. Sistko)

Sambuz

Useful Links

Newsletter

Mail Us

Constructing a Canonicalized Corpus of Historical German by Text - PowerPoint PPT Presentation

Constructing a Canonicalized Corpus of Historical German by Text Alignment Bryan Jurish, Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Company Presentation December 2016 Market German Residential Safe Harbor and Low Risk German

Annotating and querying the Icelandic Parsed Historical Corpus and closely related

Constructing Inverse Probability Weights for Static Constructing Inverse Probability Weights for

Constructing Error- -Correction Codes Correction Codes Constructing Error from Scale- -Free

Identifying and Identifying and Constructing a Constructing a Dredged Material Dredged Material

Constructing a spanning tree Toni Kylml toni.kylmala@tkk.fi 1 Constructing a spanning tree

(yes, again! ) Stephan van Staden Outline The Views framework The motivation for

Constructing noncommutative topology David Kruml Masaryk University, Brno Constructing

Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh)

Annotation of an Early New High German Corpus: The LangBank Pipeline Zarah Wei and Gohar

SH 358 IMPROVEMENTS Corpus Christi District Updated October 2018 SH 358 Improvements Corpus

Library of Congress Classification Module 10.3 Using the Translation Table &amp; Texts in

From Display Calculi to Decision Procedures for Full Intuitionistic Linear Logic Ranald Clouston,

From Stochastic Calculus of Variations on Wiener space to Stochastic Calculus of

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

? A New Frontier Martin Kay Stanford University and The University of the Saarland Martin Kay

Mod elisation math ematique des vagues David Lannes Institut de Math ematiques de

Privacy harms Privacy harms Engineering &amp; Public Policy Lorrie Faith Cranor September 2,

Hopf algebra of discrete representation type Shijie Zhu (Joint with M. Iovanov, E.Sen, A. Sistko)

Sambuz

Useful Links

Newsletter

Mail Us

Library of Congress Classification Module 10.3 Using the Translation Table & Texts in

Privacy harms Privacy harms Engineering & Public Policy Lorrie Faith Cranor September 2,