WP3 : entity-fishing service Presented by Tanti Kristanti (INRIA - - PowerPoint PPT Presentation

wp3 entity fishing service
SMART_READER_LITE
LIVE PREVIEW

WP3 : entity-fishing service Presented by Tanti Kristanti (INRIA - - PowerPoint PPT Presentation

WP3 : entity-fishing service Presented by Tanti Kristanti (INRIA Paris) For the HIRMEOS Final Workshop 2 June 2019 Marseille, France entity-fishing (1) An open source tool composed of services to automate the entity recognition and


slide-1
SLIDE 1

WP3 : entity-fishing service

Presented by Tanti Kristanti (INRIA – Paris) For the HIRMEOS Final Workshop 2 June 2019 Marseille, France

slide-2
SLIDE 2

entity-fishing (1)

  • An open source tool composed of services to automate the entity recognition and

disambiguation against Wikidata 1

  • It is not restricted and not limited for special domains, classes of entities or

usages 2

  • Initially developed within the FP9 CENDARI (Collaborative European Digital

Archive Infrastructure) project 3

  • Continued to be developed within the H2020 HIRMEOS (High Integration of

Research Monographs in the European Open Science Infrastructure) project to enrich open access digital monographs published on five digital platforms 4

  • Deployed as part of the national infrastructure Huma-Num in France
  • A stable online service within the DARIAH-EU infrastructure, the European digital

research infrastructure for the arts and humanities

  • Distributed under Apache 2.0 license

1 Science-Miner, Entity disambiguation, http://science-miner.com/entity-disambiguation/, (accessed 6 May 2019) 2 Patrice Lopez, Overview: Motivation, 2019, https://nerd.readthedocs.io/en/latest/overview.html, (accessed 6 May 2019) 3 Patrice Lopez, Alexander Meyer, Laurent Romary. CENDARI Virtual Research Environment & Named Entity Recognition techniques. Grenzen überschreiten – Digitale Geisteswissenschaft heute

und morgen, Feb 2014, Berlin, Germany, https://hal.inria.fr/hal-01577975, (accessed 6 May 2019)

4 OAPEN, End user services: Named Entity Recognition and Disambiguation, http://www.oapen.org/content/services-end-user-services, (accessed 6 May 2019)

slide-3
SLIDE 3

entity-fishing (2)

  • Current version (0.0.3) supports English, French, German, Italian and Spanish
  • Based on machine-learning techniques (Gradient Tree Boosting, CRF, word and

entity embeddings)

  • For English and French, a Name Entity Recognition based on CRF Grobid-NER in combination

with the disambiguation

  • Library for machine learning uses SMILE ML
  • Knowledge base contains
  • 37 million entities 154M statements from Wikidata
  • 15 millions word and entity embeddings
  • Project repositories: https://github.com/kermitt2/entity-fishing
  • Demo: http://nerd.huma-num.fr/nerd/
  • Documentation: https://nerd.readthedocs.io/en/latest/
slide-4
SLIDE 4

Examples of Text and Pdf File Processing with entity-fishing

slide-5
SLIDE 5

How to use entity-fishing services ?

Response of the service Query parameter to be sent to the service

1 Patrice Lopez, entity-fishing REST API, 2019, https://nerd.readthedocs.io/en/latest/restAPI.html, (accessed 13 May 2019)

  • Through REST API
  • Service can be applied on 4 types of input 1:
  • text
  • search query
  • weighted vector of terms
  • PDF document
  • REST query
  • POST /disambiguate
  • POST /language
  • POST /segmentation
  • POST /customisations
  • GET /kb/concept/{id}
  • GET /kb/term/{term}
  • GET /language?text={text}
  • GET /segmentation?text={text}
  • GET /customisations
  • GET /customisation/{name}
  • PUT /customisation/{profile}
  • DELETE /customisation/{profile}
slide-6
SLIDE 6

WP3 Works

  • Deployment and integration of entity-fishing services in the

partners’ open access platforms.

  • The approach : reusability and code sharing
  • Process the following data:
  • 4 000 books in English and French from Open Edition
  • 2000 titles in English and German from OAPEN
  • 162 books in English from Ubiquity Press
  • 765 books (606 in German, 159 in English) from UGOE
  • Result (entity-fishing clients in Java, Python, PHP) under licence APACHE 2.0
  • entity-fishing-client-python: python client for entity-fishing service
  • entity-fishing client-php-oe: php client for entity-fishing service by OpenEdition
  • entity-fishing-client-php: php client for entity-fishing service by EKT
  • entity-fishing-client-oapen: integration scripts with the OAPEN infrastructure by OAPEN
  • For validation measures needs:
  • Use a CC-BY gold standard HIRMEOS corpus
  • Containing a set of thousands manually corrected Named Entity Recognition and Disambiguation entities with

Wikidata identifier (not present in any of the corpuses already existing (e.g. IITB, AQUAINT)

1 High Integration of Research Monographs in the European Open Science Infrastructure (HIRMEOS), WP3 NERD Work Package Validation, (accessed 6 May 2019) 2 Hirmeos Github, https://github.com/Hirmeos

slide-7
SLIDE 7

How partners integrate entity-fishing in their platforms ?

slide-8
SLIDE 8

The OpenEdition Books publishing

  • entity-fishing PHP client is created and

integrated into Core processes data for enrichments

  • Fetch entities as results of requesting the entity-

fishing API services for chapters

  • Entities are classified as PERSON and LOCATION
  • Aggregate the entities results at books level
  • Location and Person entities at book and chapter

level are stored in the SolR Index

  • Two facets for Persons and Location are added to

the front-end interface

slide-9
SLIDE 9

UGOE-SUB

  • entity-fishing is integrated into the

publishing workflow of Göttingen University Press (GUP) to enable the semi-automatic indexing of its monographs

  • Titles, abstract and metadata of the

monographs are processed by entity- fishing API to identify and categorize the named entities

  • Different named entities are classified

into different classes : PERSON, LOCATION and ORGANIZATION

  • Show how often every singular entity
  • ccurs
  • The indexed data are displayed as facets

which are made available to users as « Keywords »; It allows users to quickly find the monographs by the entities appeared

slide-10
SLIDE 10

EKT / National Documentation Center

  • The current release of OMP does not support any annotation service and EKT has

improved OMP with entity-fishing support

  • Integrating entity-fishing API service to the Open Monograph Press (OMP) monographs’

landing page to annotate the abstract

  • Two phases of implementation :
  • Create a PHP client that acts as a wrapper above entity-fishing service by hiding its complexity to the

user;

  • Hiding the complexity of HTTP protocol
  • The JSON result of entity-fishing service is wrapped to high level class objects
  • Integrate the client into the OMP Software.
slide-11
SLIDE 11

Ubiquity Press (UB)

  • Developed an internal service to receive

notifications from the existing company platform when a new article has been published and POSTs its content to the entity- fishing API to retrieve all the entities and store them locally.

  • The entities are shown to the reader as

clickable links referring to the Wikipedia entry.

slide-12
SLIDE 12

OAPEN

  • Create some scripts to :
  • Call entity-fishing service with 1) path to PDF and 2) API URL as arguments
  • Storing the entity-fishing response locally
  • Combine the entity-fishing results with the unique identifier of the book or chapter

in the OAPEN Library

  • Export of the database to CSV
  • OAPEN plans to make the data available as a CC0 licensed file, which will be

published on the OAPEN Library metadata page

slide-13
SLIDE 13