Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl

Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 2/24

“ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Introduction Motivation Why large corpora are essential for NLP? unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013]; open information extraction [Banko et al., 2007]; “ unreasonable efgectiveness of big data ” [Halevy et al., 2009]. Image source: https://goo.gl/egF322 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens. Web-scale datasets: ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Introduction Motivation Some popular datasets used in NLP research: BNC: 0.1 billion tokens ; ukWaC: 2 billion tokens ; Wikipedia: 3 billion tokens . Web-scale datasets: ClueWeb12: 0.7 billion documents ; CommonCrawl 2017: 3 billion documents ; The indexed Web: 5 billion documents ; The Web: 50 billion documents . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 5/24

Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

2 linguistically preprocessed ; 3 original texts are available. Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Introduction Motivation Diffjculties using Common Crawls directly: documents are not linguistically analyzed ; big data infrastructure and skills are needed. Objectives of this work: Make access to web-scale corpora a commodity : easy-to-use ; 1 no download is needed; access via API or web interface . 2 linguistically preprocessed ; 3 original texts are available. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Related Work May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 7/24

Related Work Large scale text collections WaCkypedia Syn.Ngrams ClueWeb12 Wikipedia GigaWord ENCOW16 PukWaC Tokens, 10 9 0.80 2.90 1.91 1.76 16.82 N/A 345.00 Documents, 10 6 1.10 5.47 5.69 4.11 9.22 733.02 3.50 Type Encyclop. Encyclop. Web News Web Web Books Source texts Yes Yes Yes Yes Yes Yes No Preprocessing Yes No Yes No Yes No No NER No No No No Yes No No Dep.parsed Yes No Yes No Yes No Yes May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 8/24

Related Work Common Crawl as a corpus [Laippala & Ginter, 2014]: used Common Crawl to construct a Finnish Parsebank (1.5 billion tokens, 116 million sentences) [Pennington et al., 2014]: GloVe embeddings trained on English Common Crawl: 42 and 820 billion of tokens (tokenization, no source texts); [Grave et al., 2018]: fastText embeddings trained on Common Crawl for 158 languages (tokenization, no source texts). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 9/24

Building a Web-Scale Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 10/24

Building a Web-Scale Corpus Corpus construction approach §3.3 §3.1 §3.2 Filtered preprocessed documents WARC web crawls Crawling Web Pages : Crawling Web Pages : Preprocessing : Linguistic Analysis: The Web lefex (Apache Hadoop) CCBot (Apache Nutch) CCBot (Apache Nutch) C4Corpus (Apache Hadoop) POS Tagging (OpenNLP) §5.2 Lemmatization (Stanford) Comp. of Distributional Model : Term Vectors , Named Entity Recognition (Stanford) JoBimText (Apache Spark) DepCC: Dependency Distributional Thesaurus Dep. Parsing (Malt + collapsing) Parsed Corpus May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 11/24

s3://commoncrawl/contrib/c4corpus/CC-MAIN-2016-07 Building a Web-Scale Corpus Preprocessing of texts C4Corpus tool [Habernal et al., 2016]: 1 Language detection , license detection, and removal of boilerplate page elements, such as menus; 2 “Exact match” document de-duplication ; 3 Removing near duplicate documents; May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 12/24

Building a Web-Scale Corpus Stages of development of the corpus … based on the Common Crawl 2016-07 web crawl dump Stage of the Processing Size (.gz) Input raw web crawl (HTML, WARC) 29,539.4 Gb Preprocessed corpus (simple HTML) 832.0 Gb Preprocessed corpus English (simple HTML) 683.4 Gb Dependency-parsed English corpus (CoNLL) 2,624.6 Gb May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 13/24

2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

Building a Web-Scale Corpus Linguistic analysis of texts 1 POS Tagging and Lemmatization : OpenNLP POS tagger; Stanford lemmatizer. 2 Named Entity Recognition : Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens). 3 Dependency Parsing : Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015]. May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common

Annotating and querying the Icelandic Parsed Historical Corpus and closely related

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Dependency Grammars Topological Dependency Trees: A Constraint-based Account of Linear

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

The use of parsed corpora in information structural research LSA Summer Institute 2013: Workshop

What a parsed corpus is and how to use it Anthony Kroch and Beatrice Santorini University of

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Building and searching large parsed corpora of diachronic texts Beatrice Santorini University of

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Natural Language Processing Other Syntactic Models Parsing IV Dan Klein UC Berkeley Dependency

Dependency Parsing CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre, Dan

Dependency Grammars and Parsing CMSC 473/673 UMBC Outline Review: PCFGs and CKY Dependency

Thoughts on Learner Data and Motivation Learner Language Dependency Parsing and Dependency

Sharing value-added services for Research and Educa5on communi5es Dr. Ognjen Prnjat European and

Gender Differences at Critical Transitions in the Careers of Science, Engineering, and

Applying Data Mining Methods for the Analysis of Stable Isotope Data in Bioarchaeology Markus

Pancreatic Cancer Tumor Board Janet Ely, MSN, FNP, AOCNP, APCHN Diane Koeller, MS, MPH, LCG

Division of Environmental Biology (DEB) Virtual Office Hour Welcome to the DEB Virtual Office

Faculty Early Career Development (CAREER) Program (NSF 17-537) Next Deadlines: July 18,

On Continuous, Discrete and Timed Models in Systems Biology Oded Maler CNRS - VERIMAG Grenoble,

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi,