CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, - PowerPoint PPT Presentation

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen Liang, IST, Pennsylvania State University Huaiyu Yang, EECS, Vanderbilt University C. Lee Giles, IST & CSE Pennsylvania State University The International Workshop on Scholarly Big Data (SBD 2016)

CiteSeerX Data: Semanticizing Scholarly Big Data 2 Self-Introduction Dr. C. Lee Giles Dr. Jian Wu David Reese Professor Postdoctoral scholar PI and Director of CiteSeerX Tech leader of CiteSeerX Chen Liang Huaiyu Yang PhD student Undergraduate student Pennsylvania State University Vanderbilt University

CiteSeerX Data: Semanticizing Scholarly Big Data 3 Outline • Scholarly Big Data and the Uniqueness of CiteSeerX Data • Data Acquisition and Extraction • Data Products • Raw Data • Production Database • Production Repository • Data Management and Access • Semantic Entity Extraction From Academic Papers

CiteSeerX Data: Semanticizing Scholarly Big Data 4 Scholarly Data as Big Data • “Volume” • About 120 million #Scholarly Documents scholarly documents on the Web – 120TB or #Documents/Million 120 more [1] 100 • Growing at a rate of >1 80 million annually 60 • English only – factor of 2 40 more with other 20 languages 0 • Compare: NASA Earth Exchange Downscaled Climate Projections dataset (17TB) [1] Khabsa and Giles (2014, PLoS ONE)

CiteSeerX Data: Semanticizing Scholarly Big Data 5 Scholarly Big Data Features • “Variety” • Unstructured: document text • Structured: title, author, citation, etc - metadata • Semi-structured: tables, figures, algorithms, etc. • Rich in facts and knowledge • Related data • Social networks, slides, course material, data “inside” papers • “Velocity” • Scholarly Data is expected to be available in real time • On the whole, scholarly Data can be considered an important instance of big data.

CiteSeerX Data: Semanticizing Scholarly Big Data 6 Digital Library Search Engine (DLSE) • Crawl-based vs. submission-based DLSEs Crawl-based Submission-based Data Source Internet Author upload Metadata Source Automatically Author input + (majority) Extracted Automatically Extracted Data Quality varies high Human Labor (relatively) Low High Accessibility Open (or partially) Subscription • Crawl-based DLSEs are important sources of scholarly data for research tasks such as citation recommendation, author name disambiguation, ontologies, document classification, and Science of Science

CiteSeerX Data: Semanticizing Scholarly Big Data 7 The Uniqueness of CiteSeerX Data • Open-access Scholarly Data sets Datasets DBLP MAG* CiteSeerX Documents 5 million 100 million 7 million Header y y y Citations n y y URLs y y y (publishers) (open + publishers) (open) Full text n n y Disambiguated n n y author names * MAG: Microsoft Academic Graph

CiteSeerX Data: Semanticizing Scholarly Big Data 8 Data Acquisition whitelist Wikipedia PubMed Microsoft URLs arXiv External Academic Central Graph URLs Links user submitted URLs seeds open access web crawling digital repositories crawl repository

CiteSeerX Data: Semanticizing Scholarly Big Data 9 Metadata Extraction crawl repository crawl database crawl repository crawl database PDFLib TET PDFBOX/Xpdf PDFMEF Rule-based filter ML-based Filter ParsCit GROBID ParsCit SVMHeaderParse Currently Under test

CiteSeerX Data: Semanticizing Scholarly Big Data 10 Figures/Table/Barchart Extraction • Data: CiteSeerX papers • Extraction: • Extract figures + tables from papers • Extract metadata from figures + tables metadata metadata metadata • Large scale experiment • 6.7 Million papers in 14 infer semantic days with 8 processes trends cell desc. trends

CiteSeerX Data: Semanticizing Scholarly Big Data 11 Ingestion • Ingestion feeds data title title and metadata to the P.2 P.1 author author production retrieval match system cluster title: Focused Crawling Optimization cluster author: Jian Wu • Ingestion clusters near- paper cluster 1 duplicate documents title • Ingestion generate the P.3 author1 author2 citation graph (next slide) cluster title: Deep web crawling • Relational database cluster author: James Schneider, Mary Wilson • File system paper cluster 2 • Apache Solr

CiteSeerX Data: Semanticizing Scholarly Big Data 12 Type 1 node: clusters with both in-degrees and out- 1 degrees, containing papers, may contain citations Type 2 node (root): clusters with zero in-degree and non-zero out-degrees, only containing papers, i.e., 2 papers that are not cited yet. Type 3 node (leaf): clusters with non-zero in-degree and zero out-degrees, only containing citation records, 3 i.e., records without full text papers. Characteristics : • Directed • No cycles: old papers cannot not cite new papers Paper 1 1 2 Paper 2 1 1 Citation 1 newer older 2 Citation 2

CiteSeerX Data: Semanticizing Scholarly Big Data 13 Name Disambiguation • Challenging due to name variations and entity ambiguity • Task 1: distinguish different entities with the same surface name • Task 2: resolve same entities with different surface names Michael J. Jordan Michael I. Jordan Michael Jordan ? Michael W. Jordan (footballer) Michael Jordan (mycologist) C L Giles Lee Giles C Lee Giles Clyde Lee Giles

CiteSeerX Data: Semanticizing Scholarly Big Data 14 User Correction Figure: user-correction link on a paper summary page. • Users can change almost all metadata fields • New values are effective immediately after changes are submitted • Metadata can be changed multiple times • Version control • About 1 million user corrections since 2008.

CiteSeerX Data: Semanticizing Scholarly Big Data 15 Data Products Document Collection of CiteSeerX • Raw Data • Crawl repository 2015 26 million • 24TB PDFs 2014 • Crawl database • 26 million document URLs 2013 • 2.5 million parent URLs 2012 • 16GB 2011 2010 other page 2009 PDF document URL 2008 1.9 million homepage parent URL 0 5 10 15 20 25 30 Indexed Ingested Crawled

CiteSeerX Data: Semanticizing Scholarly Big Data 16 Data Products • Crawl website http://csxcrawlweb01.ist.psu.edu/ submit a URL to crawl Country ranking by number of docs Domain ranking by number of crawled docs

CiteSeerX Data: Semanticizing Scholarly Big Data 17 What Documents Have We Crawled • Manually label 1000 randomly selected crawled documents others 35,0% • Crawl repository can paper 47,9% be used for documents classification experiments to improve web crawling poster 0,6% • Crawl database can non-en 7,2% be used to generate abstract whitelists and 0,5% resume schedule crawl jobs slides book 0,3% report thesis 4,5% 1,8% 1,5% 0,9%

CiteSeerX Data: Semanticizing Scholarly Big Data 18 Production Databases • citeseerx • csx_citegraph • metadata directly • paper clusters extracted from papers • citation graph database.table description rows citeseerx.papers header metadata 6.8 million citeseerx.authors author metadata 20.6 million citeseerx.cannames authors (disambiguated) 1.2 million citeseerx.citations references 150.2 million citeseerx.citationContext citation context 131.9 million csx_citegraph.clusters citation graph (nodes) 45.7 million csx_citegraph.citegraph citation graph (edges) 112.5 million * Data are collected at the beginning of 2016.

CiteSeerX Data: Semanticizing Scholarly Big Data 19 What Does Citation Graph Look Like in- degree slope=−2.37 out-degree slope=−3.20 out-degree Suitable for large slope=−0.22 scale graph analysis In-degree and out-degree distribution of CiteSeerX Citation Graph. Plots made by SNAP. Data are collected at the beginning of 2016.

CiteSeerX Data: Semanticizing Scholarly Big Data 20 Production Repository • 7 million academic • Classification Accuracy documents (beginning of paper 83.0% 2016) others 7.5% • 9TB report 4.5% • PDF thesis 2.6% academic • XML (metadata) slides 0.8% 92.1% documents book 0.7% • body text abstract 0.3% • reference text non-en 0.3% • full text poster 0.2% • version metadata files resume 0%

CiteSeerX Data: Semanticizing Scholarly Big Data 21 Production Repository • False Negatives • Improving Classification Accuracy • Documents mis-classified as non-academic documents • Classifier based on Machine Learning and Structural others 70.7% features (Caragea et al. 2014 paper 12.3% WSC; Caragea et al. 2016 slides 5.7% IAAI) report 0.7% • Accuracy > 90% academic resume 0.7% 28.3% documents thesis 0.3% abstract 0.3% non-en 0.3% poster 0% book 0%

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, - PowerPoint PPT Presentation

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen Liang, IST, Pennsylvania State University Huaiyu Yang, EECS, Vanderbilt University C. Lee Giles, IST & CSE Pennsylvania State University The

Scalable Algorithms for Scholarly Figure Mining and Semantics Sagnik Ray Choudhury

ScholarBase: Towards a Cross-Domain Knowledgebase for Linked Scholarly Data Mahmoud Elbattah

Why open access is better for scholarly societies Stuart M. Shieber Welch Professor of Computer

Open Data Driving Scholarly Communications in 2020 Philip E. Bourne UCSD pbourne@ucsd.edu

Other Writing Assignments Literature Reviews - Theoretical Papers -Case Studies - Issue Papers

Unlocking citations from tens of millions of scholarly papers Dario Taraborelli SWIB 2017

Sentiment Analysis of Peer Review Texts for Scholarly Papers Ke Wang & Xiaojun Wan

Financial Support for Student Doctors to Present Scholarly Work at Conferences Research and

Vice Chancellor for Academic Affairs and Provost Office Competitive Scholarly Research Grants

Journals! The name stays constant; The meaning shifts. Platforms in scholarly publishing

Why is a book? The fate of writing, reading and thinking in a world of digital scholarly

Towards Knowledge-Based Assistance for Scholarly Editing Jana Kittelmann Christoph Wernhard MLU

BG3 - Altmetrics CERN Workshop on Innovations in Scholarly Communication (OAI8) Daniel Beucke |

PROCEDURES FOR PRESENTATION OF PAPERS TO THE HOUSE Papers is the collective term used for various

Hull-House Maps and Papers: A Presentation of Nationalities and Wages Hull-House Maps and Papers:

Hull-House Maps and Papers: A Presentation of Hull-House Maps and Papers: A Presentation of

Using and Extending LIMBO for Descriptive Modeling of Arrival Behaviors Symposium on Software

Update on the F 2 experiment Abel Sun Carnegie Mellon University Hall C Collaboration Meeting,

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Sentence-Level Quality Estimation for MT System Combination Tsuyoshi Okita, Rapha el Rubino,

Randomness extraction from Bell violation with continuous parametric down conversion Lijiong Shen

- Texas Instruments MIT - Signals Information and Algorithms Lab Motivation: Low-Power Wake-up

Beam Extraction and Transport Taneli Kalvas Department of Physics, University of Jyvskyl,

Why Meaco? Trading since 1991 Dehumidifiers are our core business

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, - PowerPoint PPT Presentation

CITESEERX DATA: SEMANTICIZING SCHOLARLY PAPERS Jian Wu, IST, Pennsylvania State University Chen Liang, IST, Pennsylvania State University Huaiyu Yang, EECS, Vanderbilt University C. Lee Giles, IST & CSE Pennsylvania State University The

Scalable Algorithms for Scholarly Figure Mining and Semantics Sagnik Ray Choudhury

ScholarBase: Towards a Cross-Domain Knowledgebase for Linked Scholarly Data Mahmoud Elbattah

Why open access is better for scholarly societies Stuart M. Shieber Welch Professor of Computer

Open Data Driving Scholarly Communications in 2020 Philip E. Bourne UCSD pbourne@ucsd.edu

Other Writing Assignments Literature Reviews - Theoretical Papers -Case Studies - Issue Papers

Unlocking citations from tens of millions of scholarly papers Dario Taraborelli SWIB 2017

Sentiment Analysis of Peer Review Texts for Scholarly Papers Ke Wang &amp; Xiaojun Wan

Financial Support for Student Doctors to Present Scholarly Work at Conferences Research and

Vice Chancellor for Academic Affairs and Provost Office Competitive Scholarly Research Grants

Journals! The name stays constant; The meaning shifts. Platforms in scholarly publishing

Why is a book? The fate of writing, reading and thinking in a world of digital scholarly

Towards Knowledge-Based Assistance for Scholarly Editing Jana Kittelmann Christoph Wernhard MLU

BG3 - Altmetrics CERN Workshop on Innovations in Scholarly Communication (OAI8) Daniel Beucke |

PROCEDURES FOR PRESENTATION OF PAPERS TO THE HOUSE Papers is the collective term used for various

Hull-House Maps and Papers: A Presentation of Nationalities and Wages Hull-House Maps and Papers:

Hull-House Maps and Papers: A Presentation of Hull-House Maps and Papers: A Presentation of

Using and Extending LIMBO for Descriptive Modeling of Arrival Behaviors Symposium on Software

Update on the F 2 experiment Abel Sun Carnegie Mellon University Hall C Collaboration Meeting,

Selective Phrase Pair Extraction for Improved Statistical Machine Translation Luke S.

Sentence-Level Quality Estimation for MT System Combination Tsuyoshi Okita, Rapha el Rubino,

Randomness extraction from Bell violation with continuous parametric down conversion Lijiong Shen

- Texas Instruments MIT - Signals Information and Algorithms Lab Motivation: Low-Power Wake-up

Beam Extraction and Transport Taneli Kalvas Department of Physics, University of Jyvskyl,

Why Meaco? Trading since 1991 Dehumidifiers are our core business

Sentiment Analysis of Peer Review Texts for Scholarly Papers Ke Wang & Xiaojun Wan