cermine automatic extraction of metadata and references
play

CERMINE automatic extraction of metadata and references from - PowerPoint PPT Presentation

CERMINE automatic extraction of metadata and references from scientific literature Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational


  1. CERMINE — automatic extraction of metadata and references from scientific literature Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak and Lukasz Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 11th IAPR International Workshop on Document Analysis Systems 7-10 April 2014 D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 1 / 21

  2. The goal TITLE AUTHORS AFFILIATIONS EMAILS ABSTRACT KEYWORDS D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 2 / 21

  3. The goal VOLUME PAGES TITLE URL AUTHOR YEAR SOURCE D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 3 / 21

  4. The motivation There are documents without metadata . Metadata information may be incomplete or incorrect . D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 4 / 21

  5. Requirements The metadata extraction system should be: comprehensive , automatic , modular , open and widely available , easily applicable , flexible and able to adapt to new layouts , well tested . D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 5 / 21

  6. The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF e Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 6 / 21

  7. The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF Basic e <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 7 / 21

  8. Basic structure extraction Character extraction — iText library Page segmentation — Docstrum Reading order resolving — bottom-up heuristic-based Initial zone classification — SVM ( metadata , references , body and other ) D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 8 / 21

  9. The output <Page> <PageID Value="0"/> TrueViz XML format: <Zone> <ZoneID Value="0"/> hierarchical structure containing: <ZoneCorners> pages, zones, lines, words, <Vertex x="55.320"y="34.295"/> characters <Vertex x="235.704"y="58.295"/> all elements have bounding boxes </ZoneCorners> reading order is given <ZoneNext Value="1"/> <Category Value="TITLE"/> zones have labels <Line> <Word> <Character> D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 9 / 21

  10. The process <XML> <title>Syste... Metadata <author>M.K... extraction <author>J.I... <journal>J... PDF Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf </front> R e f 250 720 Td e <back> r e e x n t c (PDF) Tj r <XML> <ref>1. <aut a e s c t i <ref>2. <aut o ET n </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 10 / 21

  11. Metadata extraction <XML> Metadata zone classification — SVM ( abstract , bib info , type , <title>System ... title , affiliation , author , keywords , <author>M. Kn... correspondence , dates and editor ) <author>J. Illsl... <affiliation>Uni... Metadata extraction — simple <keywords>arti... rule-based <journal>Journ... <volume>19<v... <date>14.06.1... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 11 / 21

  12. Zone classification classifiers are based on LibSVM library a zone is represented by 78 features : geometrical , lexical , sequential , formatting , heuristics the best SVM parameters were found by: a grid-search over 3-dimensional space of kernel function types and C (penalty parameter) and γ coefficients at every grid point a 10-fold cross-validation was performed we chose the parameters that gave the best mean accuracy initial classifier was trained on 964 documents with 155,144 zones in total metadata classifier was trained on 1,934 documents and 45,035 metadata zones in total D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 12 / 21

  13. The process <XML> <title>Syste... a <author>M.K... t a d a n t <author>J.I... e o i M t c a r <journal>J... t x PDF e Basic <JATS> <date>2009... structure <front> extraction BT <meta><title /F13 10 Tf References </front> 250 720 Td <back> extraction (PDF) Tj <XML> <ref>1. <aut <ref>2. <aut ET </back> <ref> <author>M.K. <title>Sys.. <journal>J... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 13 / 21

  14. Parsed reference extraction <XML> Reference strings extraction — <ref> K-means clustering [1] Reference parsing — CRF <author>M.K. ... <title>System... <journal>Journ... ... </ref> <ref>... D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 14 / 21

  15. Reference strings extraction clustering text lines into two sets : first lines and the rest unsupervised K-means algorithm with Euclidean distance 5 features (based on length, indentation, space between lines and the text) D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 15 / 21

  16. Reference parsing [8] Y . Wang, I.T. Phillips and R.M. Haralick, Document zone content classification and its performance evaluation, Pattern Recognition 39 (1) (2006), pp. 57–73. Conditional Random Fields token classifier based on GRMM and MALLET packages 42 constant features + the most popular words + features of two preceding and two following tokens the classifier was trained on 1000 citations from Cora-ref + PubMed D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 16 / 21

  17. GROTOAP2 dataset <NLM> PDF <NLM> PDF zone text matching <NLM> CERMINE tools PDF PubMed Central GROund Truth for Open Access Publications built automatically from PubMed Central Open Access Subset ∼ 60k ground truth files in TrueViz format with corresponding PDF files D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 17 / 21

  18. Results avg. precision avg. recall initial zone classifier 91.74% 87.31% metadata zone classifier 92.49% 93.83% reference parsing 90.18% 89.51% precision recall avg. adjustment journal title 68.68% 49.23% article title 95.03% volume 97.57% 78.57% abstract 91.43% issue 52.50% 56.64% avg. precision avg. recall pages 51.37% 34.71% authors 87.19% 82.07% year 98.79% 89.18% affiliations 70.13% 59.44% DOI 93.60% 57.46% keywords 61.11% 68.37% ISSN 44.29% 3.01% D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 18 / 21

  19. Future work a new extraction path for extracting structured full text the evaluation of the entire references extraction path comparing the results to other similar systems D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 19 / 21

  20. Links CERMINE web service : http://cermine.ceon.pl CERMINE source code : https://github.com/CeON/CERMINE GROTOAP2 : http://cermine.ceon.pl/grotoap2/ D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 20 / 21

  21. Thank you Thank you! Questions? Dominika Tkaczyk d.tkaczyk@icm.edu.pl � 2014 Dominika Tkaczyk. This document is distributed under the Creative Commons Attribution 3.0 license. c The complete text of the license can be seen here: http://creativecommons.org/licenses/by/3.0/ D.Tkaczyk et al. (ICM UW) CERMINE DAS 7-10 April 2014 21 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend