corpus assembly as text data integration
play

Corpus Assembly as Text Data Integration from Digital Libraries and - PowerPoint PPT Presentation

Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School Romanticism as a Model


  1. Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany Jun 3 2019 – Urbana-Champaign IL JCDL 19‘ – Session 1A – Generation and Linking

  2. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) General Li Literature Gazette, ALZ Jena/Halle Germany Very important historical text source for literary studies in German Romanticism (1790-1830)

  3. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Corpus • Analyse Research Result

  4. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow Printed • Scan Book Scanned • OCR Picture • Encode Full Text • Assemble Corpus • Analyse Research Result

  5. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encod Full Text • Assemble Corpus • Analyse Research Result

  6. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encode Cost- and Time- Full Text • Assemble Consuming Corpus • Analyse Research Result

  7. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Alternative Workflow Digital Libraries • Encode Full Text • Assemble Corpus • Analyse Research Result

  8. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library

  9. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library

  10. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library Switzerland: University of Lausanne

  11. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library Switzerland: University of Lausanne

  12. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  13. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  14. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  15. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library 1,200+ Volumes Princeton University Stanford University University of Illinois 600,000+ Pages University of Michigan 600,000,000+ Tokens

  16. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata

  17. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata https://archive.org/details/bub_gb_udTjAAAAMAAJ/

  18. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select

  19. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select 14 different full-text versions for this page!

  20. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble

  21. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble Target- Corpus

  22. Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality • Encode Full-Texts • Assemble Target- Corpus

  23. Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens

  24. Result The Largest Corpus for German Romanticism Digital Libraries • Collect https://github.com/JULIELab/ALZ and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens

  25. Problems • Restricted Accessibility • Heterogeneous Digitizing Conditions and OCR-Qualities

  26. Conclusion • The Largest Corpus for German Romanticism • Big Potential of DLs for Computational Literary Studies • More Cooperation Between DLs Desirable • Better Metadata and OCR-Quality are Desirable

  27. Corpus Assembly as Text Data Integration from Digital Libraries and the Web Thank you! Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend