scholarly text curation robust anchoring requirements
play

Scholarly Text Curation & Robust Anchoring Requirements Timothy - PowerPoint PPT Presentation

W3C Workshop on Annotations San Francisco, California USA 2 April 2014 Scholarly Text Curation & Robust Anchoring Requirements Timothy W. Cole (t-cole3@illinois.edu) Thomas G. Habing (thabing@illinois.edu) Anchoring Methods to Support


  1. W3C Workshop on Annotations San Francisco, California USA 2 April 2014 Scholarly Text Curation & Robust Anchoring Requirements Timothy W. Cole (t-cole3@illinois.edu) Thomas G. Habing (thabing@illinois.edu)

  2. Anchoring Methods to Support Curatorial Annotation of Scholarly Text Resources • Should be fine-grained – for text this means individual words and phrases • Should ensure persistence, e.g., even as adjacent content is updated / corrected • Can be aligned across derivative formats and serializations, even across repository boundaries • Can support search & replace, e.g., the target is set of all instances found in a specific context • Should help distinguish curatorial annotations of a specific digitization & its derivatives from annotations of intellectual substance W3C Annotation Workshop 4/2/2014 2 t-cole3@illinois.edu

  3. Use cases • Correction of OCR & manual transcriptions – HathiTrust DL: 11 million digitized volumes automated OCR – Text Creation Partnership (TCP): 50,000 manually transcribed – Support corrections by curators, outside experts, the crowd • Correction of automated annotation, e.g., part-of- speech tagging of TEI • Distinct from proposed targeting schemes for other kinds of scholarly use cases, e.g., commentary W3C Annotation Workshop 4/2/2014 3 t-cole3@illinois.edu

  4. Why Annotations? • So proposed corrections can be reviewed and themselves annotated as needed • To share with other repositories • To maintain portability of provenance W3C Annotation Workshop 4/2/2014 4 t-cole3@illinois.edu

  5. W3C Annotation Workshop 4/2/2014 5 t-cole3@illinois.edu

  6. Veridian Digital Library Software

  7. W3C Annotation Workshop 4/2/2014 7 t-cole3@illinois.edu

  8. Align across representations • Treat the OCR as an annotation of a segment of PDF / JPEG page image – Annotating agent is OCR program • Proposed OCR correction is then an annotation of the OCR annotation of the page image • Complicating factor – OCR outputs at page level, correction is usually done at line level W3C Annotation Workshop 4/2/2014 8 t-cole3@illinois.edu

  9. Annotating repeated errors? The string “Jrbana” appears 782 times in OCR texts of the Urbana Daily Courier (1903-1935) Do we need to require that users find every instance of “Jrbana”....? Can we have search-and-replace annotations? * “Urbana” appears ~ 200,000 times W3C Annotation Workshop 4/2/2014 9 t-cole3@illinois.edu

  10. http://annolex.at.northwestern.edu/

  11. W3C Annotation Workshop 4/2/2014 11 t-cole3@illinois.edu

  12. W3C Annotation Workshop 4/2/2014 13 t-cole3@illinois.edu

  13. From TCP-EEBO Collection as hosted at the University of Michigan W3C Annotation Workshop 4/2/2014 14 t-cole3@illinois.edu

  14. //*[@id="doccontent"]/div/div[39] <div class="sp"> <div class="speaker">Valer.</div> <p>Oh Collatin <span class="gap">•…</span>! I am a true Cittizen and in this I will best shew my selfe to be one, to take part with the stronger. If <span class="rend- italic">Se<span class="gap">•…</span> ius</span> ore-come, I am Liegeman to <span class="rend-italic">Serutus,</span> &amp; if <span class="rend- italic">Ta<span class="gap">•…</span> quin</span> subdue, I am for <span class="rend-italic">Viue Tarquinius.</span> </p> </div> W3C Annotation Workshop 4/2/2014 15 t-cole3@illinois.edu

  15. Relation between anchoring method & what’s being targeted • 1 st challenge – understanding the annotator’s intention. • 2 nd challenge – using a targeting approach that is consistent with annotator’s intent • Some schemes limit possible range of interpretations – Chapter, verse & line approaches (e.g., CTS) W3C Annotation Workshop 4/2/2014 16 t-cole3@illinois.edu

  16. W3C Annotation Workshop 4/2/2014 17 t-cole3@illinois.edu

  17. W3C Annotation Workshop 4/2/2014 19 t-cole3@illinois.edu

  18. Anchoring Methods to Support Curatorial Annotation of Scholarly Text Resources • Should be fine-grained – for text this means individual words and phrases • Should ensure persistence, e.g., even as adjacent content is updated / corrected • Can be aligned across derivative formats and serializations, even across repository boundaries • Can support search & replace, e.g., the target is set of all instances found in a specific context • Should help distinguish curatorial annotations of a specific digitization & its derivatives from annotations of intellectual substance W3C Annotation Workshop 4/2/2014 20 t-cole3@illinois.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend