Automa'c Iden'fica'on of Research Ar'cles from Crawled - PowerPoint PPT Presentation

Automa'c ¡Iden'fica'on ¡of ¡ Research ¡Ar'cles ¡from ¡Crawled ¡ Documents ¡ ¡ Cornelia ¡Caragea 1 , ¡Jian ¡Wu 2 , ¡Kyle ¡Williams 2 , ¡Sujatha ¡Das ¡G. 1 , ¡ Madian ¡Khabsa 3 , ¡Pradeep ¡Teregowda 3 , ¡C. ¡Lee ¡Giles 2,3 ¡ ¡ 1 Computer ¡Science ¡and ¡Engineering, ¡University ¡of ¡North ¡Texas ¡ 2 InformaMon ¡Sciences ¡and ¡Technology, ¡ 3 Computer ¡Science ¡and ¡ Engineering, ¡Pennsylvania ¡State ¡University ¡ See ¡CIKM ¡2013 ¡and ¡ICDM ¡2011 ¡plenaries ¡for ¡more ¡details ¡

Online ¡Research ¡ArMcle ¡Libraries ¡ • Digital ¡libraries ¡store ¡and ¡index ¡research ¡arMcles ¡ – Make ¡it ¡easier ¡for ¡researchers ¡to ¡search ¡for ¡scienMfic ¡ informaMon ¡ ¡ • Examples ¡of ¡online ¡scholarly ¡digital ¡libraries: ¡ – CiteSeer X , ¡MicrosoV ¡Academic ¡Search, ¡arXiv, ¡ ArnetMiner, ¡ACM ¡DL, ¡Google ¡Scholar, ¡PubMed. ¡ • The ¡size ¡of ¡online ¡digital ¡libraries ¡has ¡grown ¡from ¡ thousands ¡to ¡many ¡millions ¡of ¡research ¡arMcles ¡

Large ¡Number ¡of ¡Scholarly ¡Documents ¡on ¡the ¡Web ¡ 120 100 80 Size in Millions 60 40 20 0 Total Scholar Web of Science Academic PubMed EsMmates ¡for ¡early ¡2013 ¡ Khabsa, ¡Giles, ¡2014 ¡– ¡in ¡review ¡ EsMmates ¡

Online ¡Research ¡ArMcle ¡Digital ¡ Libraries ¡ • Medium ¡for ¡answering ¡quesMons ¡such ¡as: ¡ – How ¡topics ¡emerge, ¡evolve, ¡or ¡disappear? ¡ – What ¡is ¡a ¡good ¡measure ¡of ¡quality ¡of ¡published ¡ works? ¡ – What ¡are ¡the ¡most ¡promising ¡areas ¡of ¡research? ¡ ¡ – How ¡authors ¡connect ¡and ¡influence ¡each ¡other? ¡ – Who ¡are ¡the ¡experts ¡in ¡a ¡field? ¡ – What ¡works ¡are ¡similar? ¡ – … ¡

CiteSeer X ¡ h_p://citeseerx.ist.psu.edu ¡ • ¡CiteSeer X ¡crawls ¡researcher ¡homepages ¡and ¡repositories ¡on ¡the ¡web ¡for ¡research ¡ papers ¡in ¡PDF, ¡formerly ¡in ¡computer ¡science, ¡but ¡all ¡fields ¡ • ¡Converts ¡PDF ¡to ¡text ¡ • ¡AutomaMcally ¡extracts ¡OAI ¡metadata ¡and ¡other ¡data ¡ • ¡AutomaMc ¡citaMon ¡indexing, ¡links ¡to ¡cited ¡documents, ¡creaMon ¡of ¡ document ¡page, ¡author ¡disambiguaMon ¡ • ¡SoVware ¡open ¡source ¡– ¡can ¡be ¡used ¡to ¡build ¡other ¡such ¡tools ¡ • ¡ Data ¡shared ¡with ¡others ¡for ¡research ¡ • ~3 ¡M ¡documents ¡ • ¡Ms ¡of ¡files ¡ • 80 ¡M ¡citaMons ¡ • 12 ¡M ¡authors ¡ • 2 ¡to ¡4 ¡M ¡hits ¡day ¡ • ¡100K ¡documents ¡added ¡ monthly ¡ • ¡300K ¡document ¡ downloaded ¡monthly ¡ • 800K ¡individual ¡users ¡ • ¡several ¡Tbytes ¡

CiteSeer ¡(aka ¡ResearchIndex) ¡ l Project ¡of ¡NEC ¡Research ¡InsMtute ¡ l Hosted ¡at ¡Princeton, ¡from ¡1997 ¡– ¡2004 ¡ l Moved ¡to ¡Penn ¡State ¡aVer ¡collaborators ¡leV ¡NEC ¡ C. Lee Giles l Provided ¡a ¡broad ¡range ¡of ¡unique ¡services ¡ including ¡ l AutomaMc ¡metadata ¡extracMon ¡ l Autonomous ¡citaMon ¡indexing ¡ l Reference ¡linking ¡ l Full ¡text ¡indexing ¡ l Similar ¡documents ¡lisMng ¡ Kurt Bollacker l Several ¡other ¡pioneering ¡features ¡ l Impact ¡ l Changed ¡scienMfic ¡research ¡– ¡preceded ¡Google ¡Scholar ¡ l Shares ¡code ¡and ¡data ¡ Steve Lawrence

Research ¡with ¡CiteSeer X ¡Data ¡ • Large ¡data ¡set ¡with ¡millions ¡of ¡categories ¡and ¡millions ¡of ¡examples ¡ – Authors, ¡papers, ¡citaMons, ¡tables, ¡figures, ¡equaMons, ¡etc. ¡ – Downloadable ¡from ¡Amazon ¡3c ¡ • Proven ¡as ¡a ¡powerful ¡resource ¡in ¡many ¡applicaMons ¡that ¡analyze ¡ research ¡arMcles ¡at ¡web ¡wide ¡scale, ¡including: ¡ ¡ ¡ – Topic ¡classificaMon ¡of ¡research ¡arMcles ¡ – document ¡and ¡citaMon ¡recommendaMon ¡ ¡ – author ¡name ¡disambiguaMon ¡ ¡ – expert ¡search ¡ ¡ – topic ¡evoluMon ¡ ¡ – collaborator ¡recommendaMon ¡ ¡ • These ¡applicaMons ¡require ¡accurate ¡and ¡representaMve ¡collecMons ¡of ¡ research ¡arMcles. ¡ ¡ – Depends ¡on ¡the ¡quality ¡of ¡a ¡classifier ¡that ¡idenMfies ¡research ¡arMcles ¡ from ¡other ¡documents ¡crawled ¡on ¡the ¡Web. ¡

CiteSeer X ¡Growth ¡ CiteSeerX-Document-Collec4on- 14" 12" Documents/million- 10" 8" 6" 4" 2" 0" 2008" 2009" 2010" 2011" 2012" 2013" Year- • The ¡growth ¡in ¡the ¡number ¡of ¡crawled ¡documents ¡as ¡well ¡as ¡in ¡the ¡ number ¡of ¡research ¡papers ¡indexed ¡by ¡CiteSeer X ¡between ¡‘08 ¡and ¡‘13. ¡ ¡ ( crawled, ¡ingested, ¡indexed ) ¡

Research ¡QuesMon ¡ Classify ¡Research ¡Papers ¡from ¡Large ¡ Focused ¡Crawls ¡ • How ¡to ¡design ¡features ¡that ¡capture ¡the ¡ specifics ¡of ¡research ¡arMcle ¡and ¡result ¡in ¡ classificaMon ¡models ¡that ¡accurately ¡and ¡ efficiently ¡idenMfy ¡such ¡documents ¡from ¡a ¡ collecMon ¡of ¡documents ¡crawled ¡on ¡the ¡Web. ¡ • Scholar, ¡CiteSeer, ¡MAS, ¡do ¡this ¡but ¡how ¡well? ¡ ¡

AutomaMc ¡Research ¡ArMcle ¡ClassificaMon ¡ Methodology ¡ • Classify ¡documents ¡as ¡ research ¡ if ¡they ¡contain ¡any ¡of ¡the ¡words ¡ references ¡or ¡ bibliography ¡ in ¡text ¡ – Current ¡method ¡in ¡CiteSeer ¡ – Drawback: ¡ ¡ • Will ¡mistakenly ¡classify ¡documents ¡such ¡as ¡CV ¡or ¡slides ¡as ¡research ¡arMcles ¡ if ¡they ¡contain ¡ references ¡in ¡them ¡ • Will ¡miss ¡to ¡idenMfy ¡research ¡arMcles ¡that ¡do ¡not ¡contain ¡any ¡of ¡the ¡two ¡ words ¡ • Classify ¡documents ¡using ¡a ¡“bag ¡of ¡words” ¡approach ¡ – Drawback: ¡ • May ¡not ¡capture ¡the ¡specifics ¡of ¡research ¡arMcles, ¡e.g., ¡due ¡to ¡the ¡diversity ¡ of ¡the ¡topics ¡covered ¡in ¡CiteSeer X . ¡ ¡ • For ¡example, ¡an ¡arMcle ¡in ¡HCI ¡may ¡have ¡a ¡different ¡vocabulary ¡space ¡ compared ¡to ¡a ¡paper ¡in ¡IR, ¡but ¡some ¡essenMal ¡terms ¡may ¡persist ¡across ¡ papers. ¡ • Be_er ¡methods? ¡

Possible ¡Features ¡for ¡Research ¡ArMcle ¡ IdenMficaMon ¡ Data ¡derived ¡from ¡PDFBox ¡text ¡

Structural ¡(Str) ¡Features ¡for ¡Research ¡ ArMcle ¡IdenMficaMon ¡

Textual ¡Features ¡

Datasets ¡ Two ¡independent ¡sets ¡of ¡documents ¡sampled ¡from ¡CiteSeer X : ¡ • – 1000 ¡docs ¡sampled ¡from ¡the ¡crawled ¡docs ¡( Crawl ) ¡ – 1500 ¡docs ¡sampled ¡from ¡CiteSeer X ¡that ¡passed ¡the ¡“references” ¡or ¡ “bibliography” ¡filter ¡( CiteSeer X ) ¡ – Data ¡is ¡three ¡years ¡old ¡ Manual ¡labeling: ¡ • – PosiMve ¡docs: ¡papers ¡in ¡conference ¡proceedings, ¡journal ¡arMcles, ¡research ¡ press ¡releases, ¡book ¡chapters, ¡and ¡technical ¡reports ¡ – NegaMve ¡docs: ¡books, ¡theses, ¡long ¡technical ¡documentaMon ¡of ¡more ¡than ¡50 ¡ pages, ¡slides, ¡posters, ¡incomplete ¡papers/books ¡(e.g., ¡a ¡references ¡list, ¡ preface, ¡table, ¡abstract), ¡brochures ¡(e.g., ¡a ¡company ¡introducMon, ¡circular, ¡ad, ¡ product ¡manual, ¡government ¡report, ¡meeMng ¡notes, ¡policy, ¡form ¡instrucMon, ¡ code, ¡installaMon ¡guide), ¡handouts, ¡homework, ¡schedule, ¡agenda, ¡news, ¡form, ¡ flyer, ¡syllabus, ¡class ¡notes, ¡le_ers, ¡curriculum ¡vita, ¡resumes, ¡memos, ¡speeches. ¡ Datasets ¡descripMon: ¡ • – Missing ¡text ¡mostly ¡from ¡scanned ¡documents ¡– ¡used ¡PDFBox ¡

Automa'c Iden'fica'on of Research Ar'cles from Crawled - PowerPoint PPT Presentation

Automa'c Iden'fica'on of Research Ar'cles from Crawled Documents Cornelia Caragea 1 , Jian Wu 2 , Kyle Williams 2 , Sujatha Das G. 1 , Madian Khabsa 3 , Pradeep

Introduc:on protocol ? Iden:fica:on based on payload Payload

par$cles sources produce parcles provide inial accelera*on

Breakout Report on Biomaterials Iden/fica/on of Grand Challenges

LAG-3: Iden,fica,on & Valida,on Of Next Genera,on Checkpoint

Iden&fica&on of metabolic changes in demen&a pa&ents

Collabora'ng for the iden'fica'on and dissemina'on of good

Iden%fica%onofNarra%vePeaksin Clips:TextFeaturesPerformBest

Parameter iden+fica+on with hybrid systems in a bounded-error

(VAMP) Iden%fica%on of molecular order parameters and states from nonreversible MD simula%ons

ATCA Automa*on Jamie Stevens | ATCA Senior Systems

Automa'c design of digital synthe'c gene circuits Mario A. Marchisio and Joerg Stelling

Automa'c Genera'on Control Using Ar'ficial Neural Networks By-

ISO Cer(fica(on Helping Inspire to do things be8er What is

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Storing Crawled Content Crawling, session 8 CS6200: Information Retrieval Slides by: Jesse

Big Data Systems Big Data Parallelism Huge data set crawled documents, web request logs,

Implementing autograd Slides by Matthew Johnson Autograds implementation

MATH 12002 - CALCULUS I 3.7: Antiderivatives (Part 1) Professor Donald L. White Department of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

Chapter 11 Introduction to Programming in C C: A High-Level Language Gives symbolic names to

Mandatory Access Control Systems CSE497b - Spring 2007 Introduction Computer and Network

Evaluating Systems Information Assurance Fall 2009 Reading Material Chapter 21 Computer

Module 19: Security The Security Problem Authentication Program Threats System

Security Profs. Bracy and Van Renesse based on slides by Prof. Sirer Security in the real world

Sambuz

Useful Links

Newsletter

Mail Us

Automa'c Iden'fica'on of Research Ar'cles from Crawled - PowerPoint PPT Presentation

Automa'c Iden'fica'on of Research Ar'cles from Crawled Documents Cornelia Caragea 1 , Jian Wu 2 , Kyle Williams 2 , Sujatha Das G. 1 , Madian Khabsa 3 , Pradeep

Introduc:on protocol ? Iden:fica:on based on payload Payload

par$cles sources produce par*cles provide ini*al accelera*on

Breakout Report on Biomaterials Iden/fica/on of Grand Challenges

LAG-3: Iden,fica,on &amp; Valida,on Of Next Genera,on Checkpoint

Iden&amp;fica&amp;on of metabolic changes in demen&amp;a pa&amp;ents

Collabora'ng for the iden'fica'on and dissemina'on of good

Iden%fica%onofNarra%vePeaksin Clips:TextFeaturesPerformBest

Parameter iden+fica+on with hybrid systems in a bounded-error

(VAMP) Iden%fica%on of molecular order parameters and states from nonreversible MD simula%ons

ATCA Automa*on Jamie Stevens | ATCA Senior Systems

Automa'c design of digital synthe'c gene circuits Mario A. Marchisio and Joerg Stelling

Automa'c Genera'on Control Using Ar'ficial Neural Networks By-

ISO Cer(fica(on Helping Inspire to do things be8er What is

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Storing Crawled Content Crawling, session 8 CS6200: Information Retrieval Slides by: Jesse

Big Data Systems Big Data Parallelism Huge data set crawled documents, web request logs,

Implementing autograd Slides by Matthew Johnson Autograds implementation

MATH 12002 - CALCULUS I 3.7: Antiderivatives (Part 1) Professor Donald L. White Department of

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

Chapter 11 Introduction to Programming in C C: A High-Level Language Gives symbolic names to

Mandatory Access Control Systems CSE497b - Spring 2007 Introduction Computer and Network

Evaluating Systems Information Assurance Fall 2009 Reading Material Chapter 21 Computer

Module 19: Security The Security Problem Authentication Program Threats System

Security Profs. Bracy and Van Renesse based on slides by Prof. Sirer Security in the real world

Sambuz

Useful Links

Newsletter

Mail Us

par$cles sources produce parcles provide inial accelera*on

LAG-3: Iden,fica,on & Valida,on Of Next Genera,on Checkpoint

Iden&fica&on of metabolic changes in demen&a pa&ents

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward