TIB|AV-Portal Challenges managing audiovisual metadata encoded in - - PowerPoint PPT Presentation

tib av portal
SMART_READER_LITE
LIVE PREVIEW

TIB|AV-Portal Challenges managing audiovisual metadata encoded in - - PowerPoint PPT Presentation

TIB|AV-Portal Challenges managing audiovisual metadata encoded in RDF Jrg Waitelonis yovisto GmbH Margret Plank German National Library of Science and Technology (TIB) Hannover Prof. Dr. Harald Sack HPI-Potsdam / FIZ Karlsruhe & KIT SWIB16


slide-1
SLIDE 1

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Challenges managing audiovisual metadata encoded in RDF

Jörg Waitelonis yovisto GmbH Margret Plank German National Library of Science and Technology (TIB) Hannover

  • Prof. Dr. Harald Sack HPI-Potsdam / FIZ Karlsruhe & KIT

SWIB16 Semantic Web in Libraries Conference 2016, 28-30. November 2016, Bonn, Germany | http://swib.org/

TIB|AV-Portal

slide-2
SLIDE 2

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

2 What we do: Intelligent Linked Data-, Ontology- and Metadata-Management

Knowledge Discovery & Knowledge Mining Video- & Image Analysis, User Interfaces, Visualization

SWIB16 yovisto

Jörg Waitelonis

  • Prof. Dr. Harald Sack

Christian Hentschel

Based in:

August Bebel Str. 26-53 14482 Potsdam Germany

WELCOME SWIB16 Semantic Web in Libraries

slide-3
SLIDE 3

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

3

Hosted and maintained by: ■ yovisto GmbH, Potsdam ■ flowworks GmbH, München (Asset Management, Playout) Developed in cooperation of: ■ German National Library of Science and Technology (TIB), Hannover ■ Hasso-Plattner-Institute for IT-Systems Engineering (HPI), Potsdam

http://av.tib.eu/

slide-4
SLIDE 4

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

4

> 8000 ■ Lectures, ■ Conference talks, ■ Interviews, ■ Simulations, ■ Visualizations, ■ Research Data for ■ Scientist, ■ Lecturers, ■ Teachers, ■ Learners

http://av.tib.eu/

slide-5
SLIDE 5

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

5

TIB|AV-Portal

Users, Customers, Uploader

Search View Video Upload

Media Asset Management Streaming TIB Curators

Ingest Manage Metadata: DOI, QA, Right clearance

RDF Triplestore Search Index Workflow Management

AV-Analysis: Semantic Analysis:

■ Video Segmentation ■ Optical Character Recognition (OCR) ■ Speech-to-text (ASR) ■ Visual Concept Detection (VCD) ■ Context Modelling ■ Named Entity Linking

☑ approved

http://av.tib.eu/

slide-6
SLIDE 6

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Textual Metadata Authoritative: ■ Formal, descriptive, technical ■ E.g. title, description, keywords, etc. ■ Manually authored ■ Refers to entire video (coarse grained) Non-authoritative: ■ E.g. ASR/OCR-transcripts ■ Automatically Extracted ■ Refers to fragments of the video (fine grained) Knowledge base 63.356 GND subject headings

GND = Gemeinsame Normdatei (Integrated authority file)

  • Incl. English translations from mappings to DBpedia, LCSH, MACS and WTI Thesaurus

mapping mapping

TIB|AV-Portal

6

Semantic metadata analysis with Named Entity Linking

[1] Sven Strobel, PalomaMarín-Arraiza: Metadata for Scientific Audiovisual Media: Current Practices and Perspectives of the TIB|AV-Portal, In Proc. of Metadata and Semantics Research: 9th Research Conference, MTSR 2015, Manchester, UK, 2015, Springer

slide-7
SLIDE 7

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

7

http://dx.doi.org/10.5446/357#t=49:03,53:58

slide-8
SLIDE 8

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Why RDF? ■ Extensible ■ Different serialization forms ■ Interoperable ■ Queryable (SPARQL) ■ W3C standard How? ■ Vocabulary selection ■ Problem: heterogeneous metadata

○ authoritative, spatio-temporal, nested annotations

■ Vocab discovery -> http://lov.okfn.org/ ■ Selection criterions

TIB|AV-Portal

8

The Data Model & RDF-Export

[2] Jörg Waitelonis, Margret Plank, Harald Sack, TIB|AV-Portal: Integrating Automatically Generated Video Annotations into the Web of Data, in Proc. of 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016)

slide-9
SLIDE 9

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Vocabulary Selection Issues: ■ Availability on the Web ■ Openness ■ Level of complexity/richness ■ Maintained ■ Trustworthy authorship ■ Usage by others / popularity ■ Documentation ■ Adequate meaning ■ Specificity ■ Datatypes ■ Avoid contradictions, e.g. ■ Domain & range / sub- & super-class ■ Datatype vs. object properties ■ Does it fit currently used models

TIB|AV-Portal

9

The Data Model & RDF-Export

  • cf. http://wiki.dublincore.org/index.php/Vocabulary_evaluation,_selection_and_re-use
slide-10
SLIDE 10

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

The Data Model & RDF-Export

tib:video/16453 schema:name "Wall-crossing and geometry at infinity of Betti moduli spaces"@en ; schema:description "Linear algebraic differential equation (in one variable) ..."@en ; schema:keywords "Betti moduli"@en , "chaos theory"@en, "singularity"@en ; schema:dateCreated "1973-01-01T00:00:00+01:00"^^<http://www.w3.org/2001/XMLSchema#gYear> ; schema:duration 1:16:48 ; rdf:type schema:Movie ; schema:url <https://av.tib.eu/media/16453> ; schema:producer gnd:4028361-6 ; schema:publisher tib:Institut_des_Hautes__tudes_Scientifiques_%28IH_S%29 ; schema:license <http://creativecommons.org/licenses/by/3.0/deed.en> ; schema:availability schema:OnlineOnly ; bibframe:doi <http://dx.doi.org/10.5446/16453> ; schema:thumbnailUrl <https://av.tib.eu/images/avpimg1fdaede78b338bba137140fd805cd382> .

10 Standard Metadata and Basic Structure ■ DCMI Metadata Terms ■ DCMI Type Vocabulary ■ schema.org Vocabulary ■ Friend of a Friend Vocabulary 0.1 ■ Bibframe Vocabulary

  • http://purl.org/dc/terms/
  • http://purl.org/dc/dcmitype/
  • http://schema.org/
  • http://xmlns.com/foaf/
  • http://bibframe.org/vocab/

TIB|AV-Portal

slide-11
SLIDE 11

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

tib:video/16453#t=smpte-25:0:05:00:22,0:05:03:00 dcterms:isPartOf tib:video/16453 . tib:asr/16453_13753838_7522

  • a:hasTarget tib:video/16453#t=smpte-25:0:05:00:22,0:05:03:00 ;
  • a:annotatedBy tib:annotator/ASR-1.0.0 ;

rdf:type oa:Annotation ;

  • a:hasBody tib:asr/16453_13753838_7522#char=0,5617 .

tib:asr/16453_13753838_7522#char=0,5617 rdf:type nif:Context ; rdf:type nif:RFC5147String ; nif:isString "... five sets ..." . tib:asr/16453_13753838_7522#char=4743,4747 nif:referenceContext tib:asr/16453_13753838_7522#char=0,5617 ; itsrdf:taIdentRef gnd:4038613-2 ; itsrdf:taAnnotatorsRef tib:annotator/NEL-1.0.0 ; rdf:type nif:Phrase ; rdf:type nif:String ; nif:beginIndex "4743" ; nif:beginIndex "4747" ; nif:anchorOf "sets" .

11 Spatio-temporal Metadata ■ Open Annotation Data Model (OA) ■ NLP Interchange Format (NIF)

  • http://w3.org/ns/oa#
  • http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#

TIB|AV-Portal The Data Model & RDF-Export

slide-12
SLIDE 12

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

12

TIB|AV-Portal

tib:asr/16453_13753838

  • a:hasTarget
  • a:hasBody

ao:Annotation

rdf:type

  • a:annotatedBy

tib:annotator/ASR-1.0.0 tib:video/16453#t=smpte-25:0:23:12:12,0:23:14:4 “... the astronaut …” gnd:11896416X

The Data Model & RDF-Export Open Annotation

slide-13
SLIDE 13

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

13

TIB|AV-Portal

tib:asr/16453_13753838

  • a:hasTarget
  • a:hasBody

ao:Annotation

rdf:type

  • a:annotatedBy

tib:annotator/ASR-1.0.0 tib:video/16453#t=smpte-25:0:23:12:12,0:23:14:4

tib:asr/16453_13753838#char=0,62

rdf:type

nif:Context nif:RFC5147String “... the astronaut …”

nif:isString

tib:asr/16453_13753838#char=23,32

nif:referenceContext rdf:type

nif:String nif:Phrase “astronaut”

nif:anchorOf itsrdf: taAnnotatorsRef

tib:annotator/NEL-1.0.0

itsrdf: taldentRef

gnd:11896416X

nif: beginIndex

23

nif: endIndex

32

The Data Model & RDF-Export NLP Interchange Format (NIF)

http://av.tib.eu/opendata

slide-14
SLIDE 14

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Standard metadata ■ Concise manual verification and clearing by TIB subject specialists

TIB|AV-Portal

14 Automatically created metadata ■ AV analysis

○ typical detection errors (ASR, OCR, etc.)

■ Semantic analysis

○ missing annotations ○ wrong annotations ○ knowledgebase errors and insufficiencies

use the verified information to improve subsequent analysis

Data Quality

slide-15
SLIDE 15

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

TIB|AV-Portal Data Quality: Video Text Recognition

15

Title: "Lecture on Science and Creativity" Author: “Kroto, Harold”

Before OCR: ■ extend the OCR language model & subsequent spell-correction with terms from authoritative metadata (e.g.: Creativity, Harold, Kroto, Lecture, Science) ➥ OCR now detects “Kroto”.

Improving OCR Extend OCR vocabulary (per video) with

■ subject specific terminology ■ terminology from manually verified metadata

http://dx.doi.org/10.5446/15907

OCR detects “Kyoto”

slide-16
SLIDE 16

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

TIB|AV-Portal Data Quality: Video Text Recognition

16 Improving OCR Extend OCR vocabulary (per video) with

■ subject specific terminology ■ terminology from manually verified metadata ■ related terms from manually verified metadata (with help of NEL & graph-traversal/fact ranking)

Before OCR: ■ extension/boosting with related technical terms, e.g.: dbp:Apollo_11, dbp:Buzz_Aldrin, dbp:Nasa, dbp:Rocket, dbp:Saturn_V-B, dbp:Low_Earth_orbit, etc. ➥ OCR now detects “Rocket”.

apolloarchive.com

Title: “Armstrong landed on the moon” NEL: dbp:Neil_Armstrong, dbp:Moon

Apollo11 Saturn V Rocket

OCR detects “Pocket”

slide-17
SLIDE 17

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Solution?

■ Manual intervention (e.g. blacklisting)? ☆☆☆☆☆

TIB|AV-Portal Data Quality Semantic Analysis

Knowledgebase errors and insufficiencies:

■ Sparse linking ○ Insufficient for graph-based disambiguation ■ Multiple entries for the same entity

○ e.g. “Harald Sack”

17

Harald Sack

http://d-nb.info/gnd/ 173514537 http://d-nb.info/gnd/ 118058045 Multiple entries for the same entity

slide-18
SLIDE 18

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Typical Annotation Errors

TIB|AV-Portal Data Quality: Semantic Analysis

18

Alternative?

■ Manual annotation? ★★★☆☆

01 Missing: Terms that have not been annotated 02 Compound Split: Entities split into two separate entities 03 General/Specific: A more general instead of the more specific entity has been chosen 04 Wrong Entity Wrongly annotated entities not classified in category 1-3

The best NEL tools reach F1-measures barely beyond 0.6

(cf. http://gerbil.aksw.org/gerbil/overview, [3])

dbp:Army

03 General/Specific

dbp:United_States_Army

vs. After serving the US Army, he was again readmitted and even studied under John von Neumann.

02 Compound Split

Raman carried out ground-breaking work in the field of light scattering, which earned him the 1930 Nobel Prize in Physics.

dbp:Nobel_Price dbp:Physics

&

dbp:Nobel_Price_in_Pysiscs

vs.

dbp:Michael_Polanyi

04 Wrong Entity

dbp:Michael_Polanyi_Center

vs. In 1909, Polanyi studied to be a physician, obtaining his medical diploma in 1914.

slide-19
SLIDE 19

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

human

machine

Asked 20 people and a machine to annotate the same text:

TIB|AV-Portal Data Quality: Semantic Analysis

19

Solution?

■ Semi-automatic annotations! ★★★★★

01 Missing:

☆☆☆☆☆

02 Compound Split:

★☆☆☆☆

03 General/Specific:

★☆☆☆☆

04 Wrong Entity

★★★★★ and users can support the machines in the actual disambiguation process

human

machine ★ automated NEL can significantly support the user in localizing entities,

It seems that

[4] Tabea Tietz, Joscha Jäger, Jörg Waitelonis, Harald Sack: Semantic Annotation and Information Visualization for Blogposts with refer | In Proc. of the Second International Workshop on Visualization and Interaction for Ontologies and Linked Data (VOILA '16), volume 1704 pages 28 - 40, 2016

★★★★☆ ★★☆☆☆ ★★☆☆☆ ☆☆☆☆☆

slide-20
SLIDE 20

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Semi-automatic annotation and semantic enrichment tools, e.g.:

TIB|AV-Portal

20

http://refer.cx

Data Quality: Semantic Analysis

slide-21
SLIDE 21

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

Semi-automatic Annotation:

■ Improve NEL algorithm by learning from manually corrected annotations ■ Extend the NEL systems knowledge base with information derived from the manually corrected annotations

○ add links new to the KB graph between entities

  • f the same document

○ add new surface-forms to the candidate mapping dictionary

TIB|AV-Portal Data Quality: Semantic Analysis

21

AIDA

16.2 %

IITB

14.0 %

KORE50

20.3 %

MSNBC

6.4 %

NEEL2016

15.5 %

OKE1

17.2 %

REUTERS

16.5 %

RSS500

13.1 %

SPOTLIGHT

12.6 %

Improvements on repeating annotations (f-measure, standard NEL benchmark datasets [3])

slide-22
SLIDE 22

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

■ TIB|AV-Portal exports all metadata as RDF ■ Many quality issues still present in automatically extracted metadata ■ NEL can also improve audio-visual extraction methods ■ semi-automatic annotation tools can compensate machine and human mistakes ■ NEL can be improved by learning from manually corrected annotations

TIB|AV-Portal Summary

22

Better structure and open up the content!

[6] Janna Neumann, Margret Plank: TIB's Portal for audiovisual media: New ways of indexing and retrieval In IFLA Journal March 2014 40: 17-23, doi:10.1177/0340035214526531

slide-23
SLIDE 23

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

[1] Sven Strobel, PalomaMarín-Arraiza: Metadata for Scientific Audiovisual Media: Current Practices and Perspectives of the TIB|AV-Portal, In Proc. of Metadata and Semantics Research: 9th Research Conference, MTSR 2015, Manchester, UK, 2015, Springer [2] Jörg Waitelonis, Margret Plank, Harald Sack: TIB|AV-Portal: Integrating Automatically Generated Video Annotations into the Web of Data, in Proc. of 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) [3] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis und Lars Wesemann: GERBIL -- General Entity Annotation Benchmark Framework in 24th WWW conference [4] Tabea Tietz, Joscha Jäger, Jörg Waitelonis, Harald Sack: Semantic Annotation and Information Visualization for Blogposts with refer In Proc. of the Second International Workshop on Visualization and Interaction for Ontologies and Linked Data (VOILA '16), volume 1704 pages 28 - 40, 2016 [5] Janna Neumann, Margret Plank: TIB's Portal for audiovisual media: New ways of indexing and retrieval In IFLA Journal March 2014 40: 17-23, doi:10.1177/0340035214526531

23

TIB|AV-Portal References

slide-24
SLIDE 24

Jörg Waitelonis, yovisto GmbH, Semantic Web in Libraries Conference 2016, 28-30th November 2016, Bonn, Germany

TIB|AV-Portal Challenges managing audiovisual metadata encoded in RDF

SWIB16 Semantic Web in Libraries Conference 2016, 28-30. November 2016, Bonn, Germany | http://swib.org/

Thank You!

Jörg Waitelonis

joerg@yovisto.com

Conception & Prototyping Software Development UI/UX Design Consulting & Operations

http://av.tib.eu/opendata