constructing a canonicalized corpus of historical german
play

Constructing a Canonicalized Corpus of Historical German by Text - PowerPoint PPT Presentation

Constructing a Canonicalized Corpus of Historical German by Text Alignment Bryan Jurish, Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften


  1. Constructing a Canonicalized Corpus of Historical German by Text Alignment     Bryan Jurish,     Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften     http://deutschestextarchiv.de   Henriette Ast New Methods in Historical Corpora Manchester, 29 th -30 th April, 2011 NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 1/20

  2. Overview The Big Picture Canonicalization Desiderata Proposal Construction Sources Text Alignment Manual Annotation Applications Test Corpus Canonicalization Lexicon Conclusion NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 2/20

  3. — The Big Picture — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 3/20

  4. Canonicalization a.k.a. (orthographic) ‘standardization’, ‘normalization’, ‘modernization’, . . . The Problem Historical text �∋ orthographic conventions Conventional NLP tools ⇒ strict orthography Fixed lexicon keyed by orthographic form Extant lexemes only þ e Olde Wydgette Shoppe ↓ ↓ ↓ ↓ the old widget shop The Approach Map each word w to a unique canonical cognate � w Synchronically active “extant equivalents” Preserve both root and relevant features of input Defer application analysis to canonical forms NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 4/20

  5. Desiderata Evaluation Compare various canonicalization functions c ( · ) Task : information retrieval = ⇒ (precision, recall) Retrieval via canonical equivalence: � c ◦ c − 1 � � � retrieved c ( w ) := ( w ) = v : c ( v ) = c ( w ) Relevance requires manual verification! relevant( w ) := ? Ground-Truth Corpus Manually verified canonicalization pairs ( w, � w ) “Gold standard” � c ( · ) for training & evaluation c ( w ) = { v : � v = � relevant( w ) := retrieved � w } Minimize manual annotation effort . . . but how? NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 5/20

  6. Proposal Intuitions Contemporary editions of historical works ⇒ already standardized = Expect mostly identity canonicalizations w = � w (at least for 18 th -19 th century German) Construction (Sketch) Align historical text with a contemporary edition maximize identity alignments Confirm or Reject type-wise alignments exploit Heaps’ Law Manually annotate only unconfirmed tokens don’t lose “interesting” anomalous material NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 6/20

  7. — Construction — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 7/20

  8. Sources Text Resources Source texts: Deutsches Textarchiv (DTA) Belles lettres , drama, verse, philosophy Target texts: gutenberg.org , zeno.org Prototype Corpus 13 volumes, published 1780–1880 ca. 350k tokens ∼ 28k types (words only) Ongoing Construction (‘full’ corpus) 129 volumes, published 1780–1901 ca. 5.2M tokens ∼ 219k types (words only) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 8/20

  9. Text Alignment Preprocessing Tokenization (1 word / line) e Transliteration e.g. ( S �→ s), ( o �→ ö) Basic Alignment (GNU diff ) Token-wise LCS > 77% identity, > 94% transliterated identity Heuristic Alignment For each change change hunk multi-token alignments e.g. (zwei und vierzig �→ zweiundvierzig) character-wise ‘best’ match (Levenshtein) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 9/20

  10. Type-wise Confirmation Idea Manually confirm or reject non-identity alignments Exploit Heaps’ Law vocabulary grows logarithmically with corpus size Conservative acceptance only Results (prototype corpus) Available: 18k tokens ∼ 5.8k types Confirmed: 16k tokens (90%) ∼ 4.5k types (77%) Throughput ca. 3.95 seconds / pair ≈ 15 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 10/20

  11. Token-wise Annotation Idea Resolve remaining uncanonicalized tokens (ca. 2%) Retain anomalous canonicalization patterns Preprocessing Filters Block pruning ( ≈ 2.2%) Closed-class lexicon Annotations Canonical form + administrative flags Expert review for problematic cases Throughput (total) ≈ 1.3 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 11/20

  12. — Experiments — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 12/20

  13. Materials Prototype Corpus ❀ Ground-Truth Relevance Most thoroughly annotated corpus subset 340k tokens; 29k types (words only) Full Corpus ❀ Canonicalization Lexicon ( LEX )   arg max � w f ( w, � w ) if f ( w ) > 0 LEX ( w ) =  w otherwise Strictly disjoint from test corpus (by author) Partially annotated (no expert review) 2.4M tokens; 140k types (words only) HMM Canonicalization Cascade (Jurish, 2010c) Robust finite-state canonicalizer Tested methods: ID , LEX , HMM , HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 13/20

  14. Results ����� � ����� ���� ����� ����� ���� �� �� ����� �� �� ���� � � ����� ���� ����� ��� ��� ������� ��� ��� ������� % Types % Tokens pr rc F pr rc F 98.3 57.1 72.2 99.7 79.1 88.2 ID 97.9 93.2 95.5 99.5 98.9 99.2 HMM 98.3 85.3 91.3 99.7 98.2 98.9 LEX 98.3 94.8 96.5 99.7 99.2 99.5 HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 14/20

  15. Conclusion Construction Alignment with contemporary edition Type-wise confirmation Token-wise annotation ❀ minimal-effort corpus bootstrapping Applications Simple corpus-based lexicon ⇒ surprisingly effective very high precision mediocre recall for unknown types (sparse data) ‘Exception’ lexicon for HMM canonicalizer best overall performance corpus-based and generative techniques complement one another NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 15/20

  16. þ e Olde LaĆe Slyde (“The End”) Thank you for listening! NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 16/20

  17. — Addenda — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 17/20

  18. Token Annotation GUI NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 18/20

  19. GUI: Batch Editor Window NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 19/20

  20. Administrivia Class N %Edited ��� 2684 59.22 % LEX �� 874 19.29 % ���� NE ����� 792 17.48 % JOIN ����� 101 2.23 % ����� GRAPH 72 1.59 % SPLIT 40 0.88 % BUG 8 0.18 % GONE 1 0.02 % FM NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 20/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend