Machine Learning Reference Extraction Using GROBID Jacopo - PowerPoint PPT Presentation

Machine Learning Reference Extraction Using GROBID Jacopo Notarstefano jacopo.notarstefano [at] cern.ch October 14th, 2015

Use cases Curators often find themselves in one of the following situations: They only have the PDF of a paper, but no metadata. 1 The source provides only part of the metadata, but they want to 2 extract more from the PDF. They want to save the time spent entering metadata manually in 3 BibEdit. GROBID could help them solve these problems. Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

GROBID GROBID (GeneRation Of BIbliographical Data) is a machine learning library for parsing unstructured PDFs in structured XML documents, with a focus on technical and scientific publications. + toolkit for segmenting GROBID is a Java library that wraps Wapiti, a C + and labeling sequences. GROBID is widely used, for example at ResearchGate, Mendeley, HAL... Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Labeling sequences and PDFs, 1/2 Let’s see why extracting metadata from a PDF can be reduced to labeling a sequence. Let’s consider for example a reference: G. Isidori and F. Teubert, Status of indirect searches for New Physics with heavy flavour decays after the initial LHC run, Eur.Phys.J.Plus 129 (2014) 40 Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Labeling sequences and PDFs, 2/2 Extracting metadata can be seen as labeling each word with a category, encoded in this case as a color: G. Isidori and F. Teubert, Status of indirect searches for New Physics with heavy flavour decays after the initial LHC run, Eur.Phys.J.Plus 129 (2014) 40 Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

GROBID architecture In GROBID parlance, the knowledge that is used to extract metadata from data is called a model . GROBID is nothing more than a cascade of models, each acting on the output of the previous one. Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

GROBID’s output: TEI TEI (Text Encoding Initiative) publishes a set of guidelines which specify encoding methods for machine-readable texts. By extension, we will call “TEI” the format described by these guidelines. GROBID’s output conforms to a subset of TEI. <biblStruct xml:id="b2"> <analytic> <title level="a" type="main"> Status of indirect searches for New Physics with heavy flavour decays after the initial LHC run </title> <author> <persName> <forename type="first">G</forename> <surname>Isidori</surname> </persName> </author> <author> <persName> <forename type="first">F</forename> <surname>Teubert</surname> </persName> </author> </analytic> <monogr> <title level="j">Eur.Phys.J.Plus</title> <imprint> <biblScope unit="volume">129</biblScope> <biblScope unit="issue">40</biblScope> <date type="published" when="2014" /> </imprint> </monogr> </biblStruct> Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

GROBID credits GROBID is the work of Patrice Lopez, developer at INRIA. It has been adapted to papers from the HEP community by Joseph Boyd as part of his Master’s Thesis for EPFL, under the supervision of Gilles Louppe, Senior Fellow at Inspire. In particular, Joseph added to GROBID the concept of a collaboration , such as ATLAS or CMS, and improved considerably the accuracy of GROBID’s models by providing lots of training data, taken from Inspire. Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Caveat emptor “Converting PDF to XML is a bit like converting hamburgers into cows.” — Michael Kay That is, GROBID is not magic: it will misclassify things in various ways, and requires lots of training data to function properly. Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Invenio-Grobid Unfortunately, the XML output is not compatible with what we use in Inspire. So we need to add some layer between GROBID and Invenio: invenio-grobid . invenio-grobid is the joint work of me, my supervisor Jan ˚ Age Lavik and Ilias Koutsakis. invenio-grobid has been recently released on PyPI and is available at https://github.com/inspirehep/invenio-grobid . Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Current status and tentative roadmap, 1/2 Allow catalogers to upload PDFs 1 Automatic extraction of metadata Possibility to push to system Allow catalogers to edit extracted metadata 2 Perhaps integrate JSONEditor? Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Current status and tentative roadmap, 2/2 Take over bibliographic reference parsing (i.e. kill Refextract) 3 Consolidate with other services (e.g. CrossRef) Integrate in automatic ingestion workflows Pre-fill user submission forms from PDF 4 invenio-grobid could be used to feed back corrections in GROBID, 5 improving its precision Advanced visual interface to correct extraction (e.g. side-by-side view) Jacopo Notarstefano Machine Learning Reference Extraction Using GROBID

Machine Learning Reference Extraction Using GROBID Jacopo - PowerPoint PPT Presentation

Machine Learning Reference Extraction Using GROBID Jacopo Notarstefano jacopo.notarstefano [at] cern.ch October 14th, 2015 Use cases Curators often find themselves in one of the following situations: They only have the PDF of a paper, but no

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

CAPP Scotiabank Investment Symposium A p r i l 1 1 , 2 0 1 7 Disclaimer In the interests of

Palmerston North organised by Central Regions & Taihape Region Deer Farmers Thank you to our

In Investor Update Ju June 2017 What is is Modern? Success in a Low Modern GRANDE PRAIRIE

Colonial Coal International Corp. Western Canadas Leading Coking Coal Developer March 2018

Save Saskatchewan Libraries! Tales of a Fearless Facebook Campaign

Colonial Coal International Corp. Western Canadas Leading Coking Coal Developer April 2020

WATER NORTH COALITION JANUARY 19, 2017 Presented by: Gary Couch, Manager of Environmental

| C A P P 2 0 1 2

Machine Learning Reference Extraction Using GROBID Jacopo - PowerPoint PPT Presentation

Machine Learning Reference Extraction Using GROBID Jacopo Notarstefano jacopo.notarstefano [at] cern.ch October 14th, 2015 Use cases Curators often find themselves in one of the following situations: They only have the PDF of a paper, but no

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

CAPP Scotiabank Investment Symposium A p r i l 1 1 , 2 0 1 7 Disclaimer In the interests of

Palmerston North organised by Central Regions &amp; Taihape Region Deer Farmers Thank you to our

In Investor Update Ju June 2017 What is is Modern? Success in a Low Modern GRANDE PRAIRIE

Colonial Coal International Corp. Western Canadas Leading Coking Coal Developer March 2018

Save Saskatchewan Libraries! Tales of a Fearless Facebook Campaign

Colonial Coal International Corp. Western Canadas Leading Coking Coal Developer April 2020

WATER NORTH COALITION JANUARY 19, 2017 Presented by: Gary Couch, Manager of Environmental

| C A P P 2 0 1 2

Palmerston North organised by Central Regions & Taihape Region Deer Farmers Thank you to our