Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - - PowerPoint PPT Presentation

building a web scale dependency parsed corpus from common
SMART_READER_LITE
LIVE PREVIEW

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - - PowerPoint PPT Presentation

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common


slide-1
SLIDE 1

Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann

Building a Web-Scale Dependency-Parsed Corpus from Common Crawl

slide-2
SLIDE 2

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 2/24

Introduction

slide-3
SLIDE 3

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Why large corpora are essential for NLP?

unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013];

  • pen information extraction [Banko et al., 2007];

“unreasonable efgectiveness of big data” [Halevy et al., 2009].

Image source: https://goo.gl/egF322 Introduction

Motivation

slide-4
SLIDE 4

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24

Why large corpora are essential for NLP?

unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013];

  • pen information extraction [Banko et al., 2007];

“unreasonable efgectiveness of big data” [Halevy et al., 2009].

Image source: https://goo.gl/egF322 Introduction

Motivation

slide-5
SLIDE 5

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Some popular datasets used in NLP research:

BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.

Web-scale datasets:

ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.

Introduction

Motivation

slide-6
SLIDE 6

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24

Some popular datasets used in NLP research:

BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.

Web-scale datasets:

ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.

Introduction

Motivation

slide-7
SLIDE 7

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 5/24

Some popular datasets used in NLP research:

BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.

Web-scale datasets:

ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.

Introduction

Motivation

slide-8
SLIDE 8

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Diffjculties using Common Crawls directly:

documents are not linguistically analyzed; big data infrastructure and skills are needed.

Objectives of this work:

Make access to web-scale corpora a commodity:

1

easy-to-use;

no download is needed; access via API or web interface.

2 linguistically preprocessed; 3 original texts are available.

Introduction

Motivation

slide-9
SLIDE 9

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Diffjculties using Common Crawls directly:

documents are not linguistically analyzed; big data infrastructure and skills are needed.

Objectives of this work:

Make access to web-scale corpora a commodity:

1

easy-to-use;

no download is needed; access via API or web interface.

2 linguistically preprocessed; 3 original texts are available.

Introduction

Motivation

slide-10
SLIDE 10

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Diffjculties using Common Crawls directly:

documents are not linguistically analyzed; big data infrastructure and skills are needed.

Objectives of this work:

Make access to web-scale corpora a commodity:

1

easy-to-use;

no download is needed; access via API or web interface.

2 linguistically preprocessed; 3 original texts are available.

Introduction

Motivation

slide-11
SLIDE 11

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24

Diffjculties using Common Crawls directly:

documents are not linguistically analyzed; big data infrastructure and skills are needed.

Objectives of this work:

Make access to web-scale corpora a commodity:

1

easy-to-use;

no download is needed; access via API or web interface.

2 linguistically preprocessed; 3 original texts are available.

Introduction

Motivation

slide-12
SLIDE 12

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 7/24

Related Work

slide-13
SLIDE 13

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 8/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 Type Encyclop. Encyclop. Web News Web Web Books Source texts Yes Yes Yes Yes Yes Yes No Preprocessing Yes No Yes No Yes No No NER No No No No Yes No No Dep.parsed Yes No Yes No Yes No Yes Related Work

Large scale text collections

slide-14
SLIDE 14

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 9/24

[Laippala & Ginter, 2014]: used Common Crawl to construct a Finnish Parsebank (1.5 billion tokens, 116 million sentences) [Pennington et al., 2014]: GloVe embeddings trained on English Common Crawl: 42 and 820 billion of tokens (tokenization, no source texts); [Grave et al., 2018]: fastText embeddings trained on Common Crawl for 158 languages (tokenization, no source texts).

Related Work

Common Crawl as a corpus

slide-15
SLIDE 15

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 10/24

Building a Web-Scale Corpus

slide-16
SLIDE 16

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 11/24 DepCC: Dependency Parsed Corpus Preprocessing: C4Corpus (Apache Hadoop) Crawling Web Pages: CCBot (Apache Nutch) Linguistic Analysis: lefex (Apache Hadoop)

WARC web crawls Filtered preprocessed documents

§3.1 §3.2 §3.3

  • Comp. of Distributional Model:

JoBimText (Apache Spark) §5.2 Term Vectors, Distributional Thesaurus

POS Tagging (OpenNLP) Lemmatization (Stanford) Named Entity Recognition (Stanford)

  • Dep. Parsing (Malt + collapsing)

Crawling Web Pages: CCBot (Apache Nutch) The Web Building a Web-Scale Corpus

Corpus construction approach

slide-17
SLIDE 17

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 12/24

C4Corpus tool [Habernal et al., 2016]: s3://commoncrawl/contrib/c4corpus/CC-MAIN-2016-07

1 Language detection, license detection, and removal of

boilerplate page elements, such as menus;

2 “Exact match” document de-duplication; 3 Removing near duplicate documents;

Building a Web-Scale Corpus

Preprocessing of texts

slide-18
SLIDE 18

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 13/24

… based on the Common Crawl 2016-07 web crawl dump Stage of the Processing Size (.gz) Input raw web crawl (HTML, WARC) 29,539.4 Gb Preprocessed corpus (simple HTML) 832.0 Gb Preprocessed corpus English (simple HTML) 683.4 Gb Dependency-parsed English corpus (CoNLL) 2,624.6 Gb

Building a Web-Scale Corpus

Stages of development of the corpus

slide-19
SLIDE 19

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

1 POS Tagging and Lemmatization:

OpenNLP POS tagger; Stanford lemmatizer.

2 Named Entity Recognition:

Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).

3 Dependency Parsing:

Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].

Building a Web-Scale Corpus

Linguistic analysis of texts

slide-20
SLIDE 20

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

1 POS Tagging and Lemmatization:

OpenNLP POS tagger; Stanford lemmatizer.

2 Named Entity Recognition:

Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).

3 Dependency Parsing:

Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].

Building a Web-Scale Corpus

Linguistic analysis of texts

slide-21
SLIDE 21

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24

1 POS Tagging and Lemmatization:

OpenNLP POS tagger; Stanford lemmatizer.

2 Named Entity Recognition:

Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).

3 Dependency Parsing:

Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].

Building a Web-Scale Corpus

Linguistic analysis of texts

slide-22
SLIDE 22

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 15/24 ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS NER # newdoc url = http://www.poweredbyosteons.org/2012/01/brief-history-of-bioarchaeological.html # newdoc s3 = s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments... … # sent_id = http://www.poweredbyosteons.org/2012/01/brief-history-of-bioarchaeological.html#60 # text = The American Museum of Natural History was established in New York in 1869. The the DT DT _ 2 det 2:det O 1 American American NNP NNP _ 2 nn 2:nn B-Org 2 Museum Museum NNP NNP _ 7 nsubjpass 7:nsubjpass I-Org 3

  • f
  • f

IN IN _ 2 prep _ I-Org 4 Natural Natural NNP NNP _ 5 nn 5:nn I-Org 5 History History NNP NNP _ 3 pobj 2:prep_of I-Org 6 was be VBD VBD _ 7 auxpass 7:auxpass O 7 established establish VBN VBN _ 7 ROOT 7:ROOT O 8 in in IN IN _ 7 prep _ O 9 New New NNP NNP _ 10 nn 10:nn B-Loc 10 York York NNP NNP _ 8 pobj 7:prep_in I-Loc 11 in in IN IN _ 7 prep _ O 12 1869 1869 CD CD _ 11 pobj 7:prep_in O 13 . . . . _ 7 punct 7:punct O … Building a Web-Scale Corpus

A sample document: CoNLL format

slide-23
SLIDE 23

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24

Computational Settings

Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.

Running Time

Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:

  • Min. time / task: 38 min.;

Median time / task: 1 hour 10 min.;

  • Max. time/ task: 9 hours and 4 min.

Building a Web-Scale Corpus

Technical details

slide-24
SLIDE 24

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24

Computational Settings

Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.

Running Time

Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:

  • Min. time / task: 38 min.;

Median time / task: 1 hour 10 min.;

  • Max. time/ task: 9 hours and 4 min.

Building a Web-Scale Corpus

Technical details

slide-25
SLIDE 25

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24

Computational Settings

Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.

Running Time

Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:

  • Min. time / task: 38 min.;

Median time / task: 1 hour 10 min.;

  • Max. time/ task: 9 hours and 4 min.

Building a Web-Scale Corpus

Technical details

slide-26
SLIDE 26

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 17/24

Available on Amazon S3 fjle system (us-east-1 region); No need to download using EC2 or EMR; Free traffjc inside a region.

Building a Web-Scale Corpus

Using the corpus in the AWS

slide-27
SLIDE 27

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 18/24

http://ltdemos.informatik.uni-hamburg.de/depcc login/password: ”reader”

Building a Web-Scale Corpus

Indexing the CoNLL fjles

slide-28
SLIDE 28

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 19/24

Results

slide-29
SLIDE 29

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 20/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams DepCC Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 251.92 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 364.80 Type Encyclop. Encyclop. Web News Web Web Books Web Source texts Yes Yes Yes Yes Yes Yes No Yes Preprocessing Yes No Yes No Yes No No Yes NER No No No No Yes No No Yes Dep.parsed Yes No Yes No Yes No Yes Yes Results

Comparison with other large corpora

slide-30
SLIDE 30

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 20/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams DepCC Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 251.92 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 364.80 Type Encyclop. Encyclop. Web News Web Web Books Web Source texts Yes Yes Yes Yes Yes Yes No Yes Preprocessing Yes No Yes No Yes No No Yes NER No No No No Yes No No Yes Dep.parsed Yes No Yes No Yes No Yes Yes Results

Comparison with other large corpora

slide-31
SLIDE 31

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 21/24

syntactic count-based distributional model [Biemann & Riedl, 2013]; weighted using the LMI weighting schema; vectors are converted to unit length. pruning of features:

  • max. number of features per word (fpw);
  • max. number of words per feature (wpf).

Results

A model for verb similarity

slide-32
SLIDE 32

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 21/24

syntactic count-based distributional model [Biemann & Riedl, 2013]; weighted using the LMI weighting schema; vectors are converted to unit length. pruning of features:

  • max. number of features per word (fpw);
  • max. number of words per feature (wpf).

Results

A model for verb similarity

slide-33
SLIDE 33

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results

Evaluation on the verb similarity task

slide-34
SLIDE 34

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results

Evaluation on the verb similarity task

slide-35
SLIDE 35

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results

Evaluation on the verb similarity task

slide-36
SLIDE 36

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24

We presented DepCC, a web-scale linguistically analyzed corpus of English:

365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.

Access to an index of the sentences and dependency trees is available on request.

RESTful HTTP API; web-based search box using Kibana.

State-of-the-art results on the verb similarity task.

Results

Summary

slide-37
SLIDE 37

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24

We presented DepCC, a web-scale linguistically analyzed corpus of English:

365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.

Access to an index of the sentences and dependency trees is available on request.

RESTful HTTP API; web-based search box using Kibana.

State-of-the-art results on the verb similarity task.

Results

Summary

slide-38
SLIDE 38

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24

We presented DepCC, a web-scale linguistically analyzed corpus of English:

365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.

Access to an index of the sentences and dependency trees is available on request.

RESTful HTTP API; web-based search box using Kibana.

State-of-the-art results on the verb similarity task.

Results

Summary

slide-39
SLIDE 39

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

Thank you! Questions?

https://commoncrawl.s3.amazonaws.com/contrib/ depcc/CC-MAIN-2016-07

Results

slide-40
SLIDE 40

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifjcial Intelligence (IJCAI’07), volume 7 (pp. 2670–2676). Hyderabad, India: AAAI Press. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209–226. Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 238–247). Baltimore, MD, USA: Association for Computational Linguistics. Biemann, C. & Riedl, M. (2013).

slide-41
SLIDE 41

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

Text: now in 2d! a framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1), 55–95. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363–370).: Association for Computational Linguistics. Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A. (2016). Simverb-3500: A large-scale evaluation set of verb similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2173–2182). Austin, TX, USA: Association for Computational Linguistics. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages.

slide-42
SLIDE 42

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Habernal, I., Zayed, O., & Gurevych, I. (2016). C4Corpus: Multilingual web-size corpus with free license. In Proceedings of the Language Resources and Evaluation Conference (LREC’2016) (pp. 914–922). Portorož, Slovenia: ELRA. Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable efgectiveness of data. IEEE Intelligent Systems, 24(2), 8–12. Hall, J., Nilsson, J., & Nivre, J. (2010). Single malt or blended? a study in multilingual parser

  • ptimization.

In Trends in Parsing Technology (pp. 19–33). Springer. Laippala, V. & Ginter, F. (2014). Syntactic N-gram collection from a large-scale corpus of internet Finnish.

slide-43
SLIDE 43

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

In Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT 2014, volume 268 (pp. 184). Kaunas, Lithuania: IOS Press. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26 (pp. 3111–3119). Harrahs and Harveys, NV, USA: Curran Associates, Inc. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Doha, Qatar: Association for Computational Linguistics. Ruppert, E., Klesy, J., Riedl, M., & Biemann, C. (2015). Rule-based Dependency Parse Collapsing and Propagation for German and English.

slide-44
SLIDE 44

May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24

In Proceedings of GSCL (pp. 58–66). Essen, Germany. Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Baiaski, H. Biber, E. Breiteneder, M. Kupietz, H. Laengen, & A. Witt (Eds.), Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3) (pp. 28–34). Lancaster: UCREL IDS.