Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl - - PowerPoint PPT Presentation
Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann Building a Web-Scale Dependency-Parsed Corpus from Common Crawl Introduction May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common
Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 2/24
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24
Why large corpora are essential for NLP?
unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013];
“unreasonable efgectiveness of big data” [Halevy et al., 2009].
Image source: https://goo.gl/egF322 Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 3/24
Why large corpora are essential for NLP?
unsupervised methods, pre-training, and more … word embeddings [Mikolov et al., 2013];
“unreasonable efgectiveness of big data” [Halevy et al., 2009].
Image source: https://goo.gl/egF322 Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24
Some popular datasets used in NLP research:
BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.
Web-scale datasets:
ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 4/24
Some popular datasets used in NLP research:
BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.
Web-scale datasets:
ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 5/24
Some popular datasets used in NLP research:
BNC: 0.1 billion tokens; ukWaC: 2 billion tokens; Wikipedia: 3 billion tokens.
Web-scale datasets:
ClueWeb12: 0.7 billion documents; CommonCrawl 2017: 3 billion documents; The indexed Web: 5 billion documents; The Web: 50 billion documents.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24
Diffjculties using Common Crawls directly:
documents are not linguistically analyzed; big data infrastructure and skills are needed.
Objectives of this work:
Make access to web-scale corpora a commodity:
1
easy-to-use;
no download is needed; access via API or web interface.
2 linguistically preprocessed; 3 original texts are available.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24
Diffjculties using Common Crawls directly:
documents are not linguistically analyzed; big data infrastructure and skills are needed.
Objectives of this work:
Make access to web-scale corpora a commodity:
1
easy-to-use;
no download is needed; access via API or web interface.
2 linguistically preprocessed; 3 original texts are available.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24
Diffjculties using Common Crawls directly:
documents are not linguistically analyzed; big data infrastructure and skills are needed.
Objectives of this work:
Make access to web-scale corpora a commodity:
1
easy-to-use;
no download is needed; access via API or web interface.
2 linguistically preprocessed; 3 original texts are available.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 6/24
Diffjculties using Common Crawls directly:
documents are not linguistically analyzed; big data infrastructure and skills are needed.
Objectives of this work:
Make access to web-scale corpora a commodity:
1
easy-to-use;
no download is needed; access via API or web interface.
2 linguistically preprocessed; 3 original texts are available.
Introduction
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 7/24
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 8/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 Type Encyclop. Encyclop. Web News Web Web Books Source texts Yes Yes Yes Yes Yes Yes No Preprocessing Yes No Yes No Yes No No NER No No No No Yes No No Dep.parsed Yes No Yes No Yes No Yes Related Work
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 9/24
[Laippala & Ginter, 2014]: used Common Crawl to construct a Finnish Parsebank (1.5 billion tokens, 116 million sentences) [Pennington et al., 2014]: GloVe embeddings trained on English Common Crawl: 42 and 820 billion of tokens (tokenization, no source texts); [Grave et al., 2018]: fastText embeddings trained on Common Crawl for 158 languages (tokenization, no source texts).
Related Work
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 10/24
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 11/24 DepCC: Dependency Parsed Corpus Preprocessing: C4Corpus (Apache Hadoop) Crawling Web Pages: CCBot (Apache Nutch) Linguistic Analysis: lefex (Apache Hadoop)
WARC web crawls Filtered preprocessed documents
§3.1 §3.2 §3.3
JoBimText (Apache Spark) §5.2 Term Vectors, Distributional Thesaurus
POS Tagging (OpenNLP) Lemmatization (Stanford) Named Entity Recognition (Stanford)
Crawling Web Pages: CCBot (Apache Nutch) The Web Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 12/24
C4Corpus tool [Habernal et al., 2016]: s3://commoncrawl/contrib/c4corpus/CC-MAIN-2016-07
1 Language detection, license detection, and removal of
boilerplate page elements, such as menus;
2 “Exact match” document de-duplication; 3 Removing near duplicate documents;
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 13/24
… based on the Common Crawl 2016-07 web crawl dump Stage of the Processing Size (.gz) Input raw web crawl (HTML, WARC) 29,539.4 Gb Preprocessed corpus (simple HTML) 832.0 Gb Preprocessed corpus English (simple HTML) 683.4 Gb Dependency-parsed English corpus (CoNLL) 2,624.6 Gb
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24
1 POS Tagging and Lemmatization:
OpenNLP POS tagger; Stanford lemmatizer.
2 Named Entity Recognition:
Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).
3 Dependency Parsing:
Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24
1 POS Tagging and Lemmatization:
OpenNLP POS tagger; Stanford lemmatizer.
2 Named Entity Recognition:
Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).
3 Dependency Parsing:
Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 14/24
1 POS Tagging and Lemmatization:
OpenNLP POS tagger; Stanford lemmatizer.
2 Named Entity Recognition:
Stanford NER [Finkel et al., 2005], 7.48 billion occurrences of entities (251.92 billion tokens).
3 Dependency Parsing:
Malt parser [Hall et al., 2010]; parsing of 1 Mb of text / core in 1–4 min. … also used in PukWaC [Baroni et al., 2009], ENCOW16 [Schäfer, 2015]. collapsing with [Ruppert et al., 2015].
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 15/24 ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS NER # newdoc url = http://www.poweredbyosteons.org/2012/01/brief-history-of-bioarchaeological.html # newdoc s3 = s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments... … # sent_id = http://www.poweredbyosteons.org/2012/01/brief-history-of-bioarchaeological.html#60 # text = The American Museum of Natural History was established in New York in 1869. The the DT DT _ 2 det 2:det O 1 American American NNP NNP _ 2 nn 2:nn B-Org 2 Museum Museum NNP NNP _ 7 nsubjpass 7:nsubjpass I-Org 3
IN IN _ 2 prep _ I-Org 4 Natural Natural NNP NNP _ 5 nn 5:nn I-Org 5 History History NNP NNP _ 3 pobj 2:prep_of I-Org 6 was be VBD VBD _ 7 auxpass 7:auxpass O 7 established establish VBN VBN _ 7 ROOT 7:ROOT O 8 in in IN IN _ 7 prep _ O 9 New New NNP NNP _ 10 nn 10:nn B-Loc 10 York York NNP NNP _ 8 pobj 7:prep_in I-Loc 11 in in IN IN _ 7 prep _ O 12 1869 1869 CD CD _ 11 pobj 7:prep_in O 13 . . . . _ 7 punct 7:punct O … Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24
Computational Settings
Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.
Running Time
Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:
Median time / task: 1 hour 10 min.;
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24
Computational Settings
Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.
Running Time
Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:
Median time / task: 1 hour 10 min.;
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 16/24
Computational Settings
Apache Hadoop 2.6 cluster with 16 nodes; 2.75 TB of RAM used; 356 cores used, Intel Xeon E5-2603v4@1.70GHz.
Running Time
Total time: 110 hours (0.84 Mb/min/core); 19101 tasks each for 100 Mb of input data:
Median time / task: 1 hour 10 min.;
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 17/24
Available on Amazon S3 fjle system (us-east-1 region); No need to download using EC2 or EMR; Free traffjc inside a region.
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 18/24
http://ltdemos.informatik.uni-hamburg.de/depcc login/password: ”reader”
Building a Web-Scale Corpus
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 19/24
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 20/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams DepCC Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 251.92 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 364.80 Type Encyclop. Encyclop. Web News Web Web Books Web Source texts Yes Yes Yes Yes Yes Yes No Yes Preprocessing Yes No Yes No Yes No No Yes NER No No No No Yes No No Yes Dep.parsed Yes No Yes No Yes No Yes Yes Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 20/24 WaCkypedia Wikipedia PukWaC GigaWord ENCOW16 ClueWeb12 Syn.Ngrams DepCC Tokens, 109 0.80 2.90 1.91 1.76 16.82 N/A 345.00 251.92 Documents, 106 1.10 5.47 5.69 4.11 9.22 733.02 3.50 364.80 Type Encyclop. Encyclop. Web News Web Web Books Web Source texts Yes Yes Yes Yes Yes Yes No Yes Preprocessing Yes No Yes No Yes No No Yes NER No No No No Yes No No Yes Dep.parsed Yes No Yes No Yes No Yes Yes Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 21/24
syntactic count-based distributional model [Biemann & Riedl, 2013]; weighted using the LMI weighting schema; vectors are converted to unit length. pruning of features:
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 21/24
syntactic count-based distributional model [Biemann & Riedl, 2013]; weighted using the LMI weighting schema; vectors are converted to unit length. pruning of features:
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 22/24 Model SimVerb3500 SimVerb3000 SimVerb500 SimLex222 Wikipedia+ukWaC+BNC: Count SVD 500-dim [Baroni et al., 2014] 0.196 0.186 0.259 0.200 PolyglotWikipedia: SGNS BOW 300-dim [Gerz et al., 2016] 0.274 0.333 0.265 0.328 8B: SGNS BOW 500-dim [Gerz et al., 2016] 0.348 0.350 0.378 0.307 8B: SGNS DEPS 500-dim [Gerz et al., 2016] 0.356 0.351 0.389 0.385 PolyglotWikipedia: SGNS DEPS 300-dim [Gerz et al., 2016] 0.313 0.304 0.401 0.390 Wikipedia: LMI DEPS wpf-1000 fpw-2000 0.283 0.284 0.271 0.268 Wikipedia+ukWac+GigaWord: LMI DEPS wpf-1000 fpw-2000 0.376 0.368 0.419 0.183 DepCC: LMI DEPS wpf-1000 fpw-1000 0.400 0.387 0.477 0.285 DepCC: LMI DEPS wpf-1000 fpw-2000 0.404 0.392 0.477 0.292 DepCC: LMI DEPS wpf-2000 fpw-2000 0.399 0.388 0.459 0.268 DepCC: LMI DEPS wpf-5000 fpw-5000 0.382 0.372 0.442 0.226 Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24
We presented DepCC, a web-scale linguistically analyzed corpus of English:
365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.
Access to an index of the sentences and dependency trees is available on request.
RESTful HTTP API; web-based search box using Kibana.
State-of-the-art results on the verb similarity task.
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24
We presented DepCC, a web-scale linguistically analyzed corpus of English:
365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.
Access to an index of the sentences and dependency trees is available on request.
RESTful HTTP API; web-based search box using Kibana.
State-of-the-art results on the verb similarity task.
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 23/24
We presented DepCC, a web-scale linguistically analyzed corpus of English:
365 million documents; 252 billion tokens; 7.5 billion of named entity occurrences; 14.3 billion sentences; 2.4 Tb are available on Amazon S3.
Access to an index of the sentences and dependency trees is available on request.
RESTful HTTP API; web-based search box using Kibana.
State-of-the-art results on the verb similarity task.
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
https://commoncrawl.s3.amazonaws.com/contrib/ depcc/CC-MAIN-2016-07
Results
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifjcial Intelligence (IJCAI’07), volume 7 (pp. 2670–2676). Hyderabad, India: AAAI Press. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209–226. Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 238–247). Baltimore, MD, USA: Association for Computational Linguistics. Biemann, C. & Riedl, M. (2013).
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
Text: now in 2d! a framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1), 55–95. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363–370).: Association for Computational Linguistics. Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A. (2016). Simverb-3500: A large-scale evaluation set of verb similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2173–2182). Austin, TX, USA: Association for Computational Linguistics. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages.
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Habernal, I., Zayed, O., & Gurevych, I. (2016). C4Corpus: Multilingual web-size corpus with free license. In Proceedings of the Language Resources and Evaluation Conference (LREC’2016) (pp. 914–922). Portorož, Slovenia: ELRA. Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable efgectiveness of data. IEEE Intelligent Systems, 24(2), 8–12. Hall, J., Nilsson, J., & Nivre, J. (2010). Single malt or blended? a study in multilingual parser
In Trends in Parsing Technology (pp. 19–33). Springer. Laippala, V. & Ginter, F. (2014). Syntactic N-gram collection from a large-scale corpus of internet Finnish.
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
In Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT 2014, volume 268 (pp. 184). Kaunas, Lithuania: IOS Press. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26 (pp. 3111–3119). Harrahs and Harveys, NV, USA: Curran Associates, Inc. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Doha, Qatar: Association for Computational Linguistics. Ruppert, E., Klesy, J., Riedl, M., & Biemann, C. (2015). Rule-based Dependency Parse Collapsing and Propagation for German and English.
May 10, 2018 Building a Web-Scale Dependency-Parsed Corpus from Common Crawl, Panchenko et al. LREC’18 24/24
In Proceedings of GSCL (pp. 58–66). Essen, Germany. Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Baiaski, H. Biber, E. Breiteneder, M. Kupietz, H. Laengen, & A. Witt (Eds.), Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3) (pp. 28–34). Lancaster: UCREL IDS.