From monolithic XML for print/web to lean XML for data: realising - - PowerPoint PPT Presentation

from monolithic xml for print web to lean xml for data
SMART_READER_LITE
LIVE PREVIEW

From monolithic XML for print/web to lean XML for data: realising - - PowerPoint PPT Presentation

From monolithic XML for print/web to lean XML for data: realising linked data for dictionaries Matt Kohl & Sandro Cirulli Language Technologists Oxford University Press (OUP) 7 June 2014 Introduction Oxford University Press


slide-1
SLIDE 1

From monolithic XML for print/web to lean XML for data: realising linked data for dictionaries

Matt Kohl & Sandro Cirulli Language Technologists Oxford University Press (OUP) 7 June 2014

slide-2
SLIDE 2

Introduction Oxford University Press

◮ World-renowned dictionary publisher ◮ Licensing partner for lexical data

2/18

slide-3
SLIDE 3

Introduction Shifts in Publishing

◮ New trends & demands ◮ Emerging technologies & markets ◮ Importance of well-structured, semantically-rich data ◮ Speed!

3/18

slide-4
SLIDE 4

Data Modelling Our Current Dictionary Data Models

◮ Print-oriented: designed to capture dictionary layout ◮ Monolithic: one enormous document ◮ Permissive: continually loosened to accommodate new texts

Can’t give us the flexibility we need

4/18

slide-5
SLIDE 5

Data Modelling Requirements

A new approach should:

◮ Represent language concepts, not layouts ◮ Enable data reusability for different products & services ◮ Allow only one, clear way to model any given lexical item

5/18

slide-6
SLIDE 6

Data Modelling The New Lexical Schema

6/18

slide-7
SLIDE 7

Data Conversion Moving Data into the Lexical Schema

Conversion Framework Requirements

◮ Scalability: convert 40+ data-sets ◮ Standardization: harmonize variation inside the data-sets ◮ Modularity: enable customization, slotting in & out of QA,

etc.

7/18

slide-8
SLIDE 8

Data Conversion Tools

◮ XProc ◮ XSpec ◮ Schematron & XML Schema ◮ Jenkins CI ◮ Agile methodology

8/18

slide-9
SLIDE 9

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es"

9/18

slide-10
SLIDE 10

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es" print-focused XML +xml:lang = "es"

XSL transformations

9/18

slide-11
SLIDE 11

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es" print-focused XML +xml:lang = "es"

XSL transformations

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

9/18

slide-12
SLIDE 12

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es" print-focused XML +xml:lang = "es"

XSL transformations

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

enhanced XML

9/18

slide-13
SLIDE 13

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es" print-focused XML +xml:lang = "es"

XSL transformations

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

enhanced XML

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation XML Schema validation Schematron validation

enhanced XML

XSL transformations

9/18

slide-14
SLIDE 14

Data Conversion Simplified XProc pipeline

print-focused XML +xml:lang = "es" print-focused XML +xml:lang = "es"

XSL transformations

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation

enhanced XML

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation XML Schema validation Schematron validation

enhanced XML

XSL transformations

print-focused XML +xml:lang = "es"

XSL transformations Schematron validation XML Schema validation Schematron validation

enhanced XML

XSL transformations

Lexical Data

9/18

slide-15
SLIDE 15

Data Conversion Build Workflow

Check code in SVN Jenkins build

  • Ant script
  • XSpec unit tests
  • XProc pipeline

10/18

slide-16
SLIDE 16

Data Conversion Build Workflow

Check code in SVN Jenkins build

  • Ant script
  • XSpec unit tests
  • XProc pipeline

Check code in SVN Jenkins build Linguistic QA

10/18

slide-17
SLIDE 17

Data Conversion Build Workflow

Check code in SVN Jenkins build

  • Ant script
  • XSpec unit tests
  • XProc pipeline

Check code in SVN Jenkins build Linguistic QA Check code in SVN Jenkins build Linguistic QA Update code Passes? No

10/18

slide-18
SLIDE 18

Data Conversion Build Workflow

Check code in SVN Jenkins build

  • Ant script
  • XSpec unit tests
  • XProc pipeline

Check code in SVN Jenkins build Linguistic QA Check code in SVN Jenkins build Linguistic QA Update code Passes? No Check code in SVN Jenkins build Linguistic QA Update code Archive artefacts Tag release in SVN via Jenkins Passes? Yes No

10/18

slide-19
SLIDE 19

Results & Discussion Source data

A sense of ’mala´ uva’ from a monolingual Spanish dictionary <ACEPCIO ACEP="2"> <AREA-GEO>Esp</AREA-GEO> <NIVELL>coloquial</NIVELL> <SIGNIFICAT>Persona que tiene mal car´ acter o mala intenci´

  • n.</SIGNIFICAT>

<SINONIM>malaleche.</SINONIM> </ACEPCIO>

11/18

slide-20
SLIDE 20

12/18

Results & Discussion OUP XML

Print-focused DTD <se2 num="2"> <lg> <ge>Esp</ge> <reg>coloquial</reg> </lg> <msDict type="core"> <df>Persona que tiene mal car´ acter o mala intenci´

  • n.</df>

<syn>malaleche</syn> </msDict> </se2>

slide-21
SLIDE 21

12/18

Results & Discussion OUP XML

Print-focused DTD <se2 num="2"> <lg> <ge>Esp</ge> <reg>coloquial</reg> </lg> <msDict type="core"> <df>Persona que tiene mal car´ acter o mala intenci´

  • n.</df>

<syn>malaleche</syn> </msDict> </se2> New Lexical XSD <sense register="informal" region="ES"> <definitions> <definition> <text>Persona que tiene mal car´ acter o mala intenci´

  • n</text>

</definition> </definitions> <synonyms> <synonym>malaleche</ synonym> </synonyms> </sense>

slide-22
SLIDE 22

12/18

Results & Discussion OUP XML

Print-focused DTD <se2 num="2"> <lg> <ge>Esp</ge> <reg>coloquial</ reg> </lg> <msDict type="core"> <df>Persona que tiene mal car´ acter o mala intenci´

  • n.</df>

<syn>malaleche</syn> </msDict> </se2> New Lexical XSD <sense register="informal" region="ES"> <definitions> <definition> <text>Persona que tiene mal car´ acter o mala intenci´

  • n</text>

</definition> </definitions> <synonyms> <synonym>malaleche</ synonym> </synonyms> </sense>

slide-23
SLIDE 23

Next steps Scale It Up

◮ Consolidate data in an XML database ◮ Build an RDF layer on top of the XML database ◮ Leverage Semantic Web to enhance our data

13/18

slide-24
SLIDE 24

Next Steps Prototype RDF/XML

<Sense rdf:about="sense:es_noun_malauva_se_2"> <isDescribedBy rdf:resource=" definition:es_noun_malauva_se_2_def_1"/> <hasRegister rdf:resource="register:informal" /> <hasRegion rdf:resource="region:ES"/> <hasSynonym rdf:resource="lemma:a5e644"/> </Sense> <StandardDefinition rdf:about="definition:es_noun_malauva_se_2_def_1"> <rdfs:label xml:lang="es">Persona que tiene mal car´ acter o mala intenci´

  • n</rdfs:label>

</StandardDefinition>

14/18

slide-25
SLIDE 25

RDF Data extraction Musical terms in English & Spanish

choir: chant: air:

hook: strain: chorus: chorus: chorale: ensemble: song: tune: aria: chorus: coral hook choral coro conjunto canción melodía aria estribillo tono

coro aire salmodia

15/18

slide-26
SLIDE 26

Inference mechanism

word sense X word sense Y word sense Z

hasAntonym hasSynonym hasAntonym 16/18

slide-27
SLIDE 27

Summary

◮ Overall project requirements

◮ Moving from products to platforms and services ◮ Supporting current business needs while innovating ◮ Adapting in nimble ways to fast changing market requirements ◮ Focusing on time and cost efficiency

◮ Data model

◮ Content driven ◮ Machine interpretable ◮ Modular ◮ Evolvable/adaptable

◮ Conversion process

◮ Highly automated ◮ Modular ◮ Scalable 17/18

slide-28
SLIDE 28

Thank you for your attention! Any questions? Matt Kohl: matt.kohl@oup.com Sandro Cirulli: sandro.cirulli@oup.com