Corpus Assembly as Text Data Integration from Digital Libraries and - - PowerPoint PPT Presentation

corpus assembly as text data integration
SMART_READER_LITE
LIVE PREVIEW

Corpus Assembly as Text Data Integration from Digital Libraries and - - PowerPoint PPT Presentation

Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School Romanticism as a Model


slide-1
SLIDE 1

Corpus Assembly as Text Data Integration from Digital Libraries and the Web

Jena University Language & Information Engineering (JULIE) Lab

https://julielab.de/

DFG Graduate School „Romanticism as a Model“

http://modellromantik.uni-jena.de

Friedrich Schiller University Jena, Germany

Jun 3 2019 – Urbana-Champaign IL JCDL 19‘ – Session 1A – Generation and Linking

Udo Hahn & Tinghui Duan

slide-2
SLIDE 2

Jena/Halle Germany

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Very important historical text source for literary studies in German Romanticism (1790-1830)

General Li Literature Gazette, ALZ

slide-3
SLIDE 3

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Corpus

  • Analyse

Research Result

slide-4
SLIDE 4

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Traditional Workflow

Printed Book

  • Scan

Scanned Picture

  • OCR

Full Text

  • Encode
  • Assemble

Corpus

  • Analyse

Research Result

slide-5
SLIDE 5

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Traditional Workflow

Printed Book

  • Scan

Scanned Picture

  • OCR

Full Text

  • Encod
  • Assemble

Corpus

  • Analyse

Research Result

315 Volumes ≈ 150,000 Pages ≈ 150,000,000 Tokens

slide-6
SLIDE 6

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Traditional Workflow

Printed Book

  • Scan

Scanned Picture

  • OCR

Full Text

  • Encode
  • Assemble

Corpus

  • Analyse

Research Result

Cost- and Time- Consuming

315 Volumes ≈ 150,000 Pages ≈ 150,000,000 Tokens

slide-7
SLIDE 7

All llgemeine Lit iteratur-Zeitung (1785-1849) 1849)

Full Text

  • Encode
  • Assemble

Digital Libraries Corpus

  • Analyse

Research Result

Alternative Workflow

slide-8
SLIDE 8

Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-9
SLIDE 9

Austria: Austrian National Library Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-10
SLIDE 10

Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-11
SLIDE 11

UK: University of Oxford Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-12
SLIDE 12

USA: Harvard University Indiana University New York Public Library Princeton University Stanford University University of Illinois University of Michigan UK: University of Oxford Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-13
SLIDE 13

USA: Harvard University Indiana University New York Public Library Princeton University Stanford University University of Illinois University of Michigan UK: University of Oxford Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-14
SLIDE 14

USA: Harvard University Indiana University New York Public Library Princeton University Stanford University University of Illinois University of Michigan UK: University of Oxford Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

slide-15
SLIDE 15

USA: Harvard University Indiana University New York Public Library Princeton University Stanford University University of Illinois University of Michigan UK: University of Oxford Austria: Austrian National Library Switzerland: University of Lausanne Germany: Bavarian State Library

Scattered Dig igital Resources of ALZ

1,200+ Volumes 600,000+ Pages 600,000,000+ Tokens

slide-16
SLIDE 16

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata
slide-17
SLIDE 17

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata

https://archive.org/details/bub_gb_udTjAAAAMAAJ/

slide-18
SLIDE 18

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select
slide-19
SLIDE 19

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

14 different full-text versions for this page!

slide-20
SLIDE 20

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

Best-Quality Full-Texts

  • Encode
  • Assemble
slide-21
SLIDE 21

Proposed Workflow

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

Best-Quality Full-Texts

  • Encode
  • Assemble

Target- Corpus

slide-22
SLIDE 22

Result

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

Best-Quality Full-Texts

  • Encode
  • Assemble

Target- Corpus

261 Volumes 126,612 Pages 120,369,005 Tokens

slide-23
SLIDE 23

Result

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

Best-Quality Full-Texts

  • Encode
  • Assemble

Target- Corpus

315 Volumes ≈ 150,000 Pages ≈ 150,000,000 Tokens 261 Volumes 126,612 Pages 120,369,005 Tokens

≈ 82% coverage

slide-24
SLIDE 24

Result

Digital Libraries and the Web

  • Collect
  • Correct Metadata

Full-Texts

  • Evaluate
  • Select

Best-Quality Full-Texts

  • Encode
  • Assemble

Target- Corpus

The Largest Corpus for German Romanticism

https://github.com/JULIELab/ALZ

315 Volumes ≈ 150,000 Pages ≈ 150,000,000 Tokens 261 Volumes 126,612 Pages 120,369,005 Tokens

≈ 82% coverage

slide-25
SLIDE 25

Problems

  • Restricted Accessibility
  • Heterogeneous Digitizing Conditions and OCR-Qualities
slide-26
SLIDE 26

Conclusion

  • The Largest Corpus for German Romanticism
  • Big Potential of DLs for Computational Literary Studies
  • More Cooperation Between DLs Desirable
  • Better Metadata and OCR-Quality are Desirable
slide-27
SLIDE 27

Corpus Assembly as Text Data Integration from Digital Libraries and the Web

Jena University Language & Information Engineering (JULIE) Lab

https://julielab.de/

DFG Graduate School „Romanticism as a Model“

http://modellromantik.uni-jena.de

Friedrich Schiller University Jena, Germany

Udo Hahn & Tinghui Duan

Thank you!