C2NLU: An Overview
Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017
C2NLU: An Overview Heike Adel 2017/01/23 1 / 34
C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - - PowerPoint PPT Presentation
C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34 Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting Motivation Why do we want
C2NLU: An Overview Heike Adel 2017/01/23 1 / 34
◮ Why do we want character-based models?
◮ Which character-based models/research exist?
◮ Which challenges/open questions need to be considered? C2NLU: An Overview Contents Heike Adel 2017/01/23 2 / 34
C2NLU: An Overview Introduction Heike Adel 2017/01/23 3 / 34
C2NLU: An Overview Introduction Heike Adel 2017/01/23 4 / 34
◮ Tokens = manually designed features for NLU models
◮ Models can directly access the data,
◮ Models learn their own representation (“features”) of the data
C2NLU: An Overview Motivation Heike Adel 2017/01/23 5 / 34
◮ English: some difficult cases
◮ Chinese: tokens are not separated by spaces ◮ German: compounds
◮ Turkish: agglutinative language
C2NLU: An Overview Motivation Heike Adel 2017/01/23 6 / 34
C2NLU: An Overview Motivation Heike Adel 2017/01/23 7 / 34
◮ Inflections ◮ Derivations ◮ Wide range of morphological processes (vowel harmony,
C2NLU: An Overview Motivation Heike Adel 2017/01/23 8 / 34
◮ Example: morphology
C2NLU: An Overview Motivation Heike Adel 2017/01/23 9 / 34
◮ Possible application: names/transliterations in end-to-end
C2NLU: An Overview Motivation Heike Adel 2017/01/23 10 / 34
◮ Information retrieval with character n-grams
◮ Grapheme-to-phoneme conversion
◮ Char align: bilingual character-level alignments [Church 1993] ◮ Prefix and suffix features for tagging rare words
◮ Transliteration
◮ Diacritics restauration [Mihalcea et al. 2002] ◮ POS induction (unsupervised, multilingual) [Clark 2003] ◮ Characters and character n-grams as features for NER
◮ Language identification [Alex 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 11 / 34
◮ Morpheme-level features for language models;
◮ Language-independent character n-gram language models for
◮ Hybrid word/subword n-gram language model for OOV words
◮ Characters and character n-grams as input to Restricted
◮ Machine translation based on characters/character n-grams
C2NLU: An Overview Early Work Heike Adel 2017/01/23 12 / 34
◮ Tokenization-based models ◮ Bag-of-n-gram models ◮ End-to-end models
◮ e.g., tokenization-based bag-of-n-gram models ◮ e.g., bag-of-n-gram or tokenization-based models trained
C2NLU: An Overview Categorization Heike Adel 2017/01/23 13 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 14 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 15 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 16 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 17 / 34
t a b e l table word em- bedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 18 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 19 / 34
t a b e l table word em- bedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN max
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 20 / 34
concatenation
embeddings convolution layer with multiple filters
widths max-over-time pooling layer highway network [Srivastava et al. 2015] long-short term memory network softmax output to obtain distribution over next word cross-entropy loss between next word and prediction
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 21 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 22 / 34
C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 23 / 34
◮ No notion of “words”/token boundaries necessary
C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 24 / 34
C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 25 / 34
j )
C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 26 / 34
C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 27 / 34
C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 28 / 34
C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 29 / 34
C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 30 / 34
◮ Mostly with lingustic structure and large vocabulary ◮ Only a few uncapitalized non-words ◮ Balanced parantheses and quotes over long distances (e.g., 30
C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 31 / 34
C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 32 / 34
◮ Direct modeling of input data (i.e. character or byte
◮ No tokenization necessary, character segments can span token
◮ Models without notion of words/tokens can achieve
◮ Character-level models especially suited for rare words,
C2NLU: An Overview Conclusion Heike Adel 2017/01/23 33 / 34
C2NLU: An Overview Conclusion Heike Adel 2017/01/23 34 / 34
Beatrice Alex: An Unsupervised System for Identifying English Inclusions in German Text. ACL 2005.
Timothy Baldwin, Marco Lui: Language Identification: The Long and the Short of the Matter. NAACL 2010.
Miguel Ballesteros, Chris Dyer, Noah A. Smith: Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. EMNLP 2015.
Maximilian Bisani, Hermann Ney: Joint-sequence models for grapheme-to-phoneme conversion. Speech communication 2008.
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tom´ aˇ s Mikolov: Enriching Word Vectors with Subword
Jan A. Botha, Phil Blunsom: Compositional Morphology for Word Representations and Language Modelling. ICML 2014.
Kris Cao, Marek Rei: A Joint Model for Word Embedding and Word Morphology. ACL 2016.
William B. Cavnar: Using An N-Gram-Based Document Representation With a Vector Processing Retrieval
Aitao Chen, Jianzhang He, Liangjie Xu, Jason Meggs: Chinese Text Retrieval Without Using a Dictionary. SIGIR 1997.
Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Huanbo Luan: Joint Learning of Character and Word
Jason P.C. Chiu, Eric Nichols: Named Entity Recognition with Bidirectional LSTM-CNNs. TACL vol. 4, 2016.
Junyoung Chung, Kyunghyun Cho, Yoshua Bengio: A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. ACL 2016.
Kenneth W. Church: Char align: A Program for Aligning Parallel Texts at the Character Level. ACL 1993.
Alexander Clark: Combining Distributional and Morphological Information for Part of Speech Induction. EACL 2003.
Marta R. Costa-juss` a, Jos´ e A.R. Fonollosa: Character-based Neural Machine Translation. ACL 2016.
Ryan Cotterell, Tim Vieira, Hinrich Sch¨ utze: A Joint Model of Orthography and Morphological Segmentation. NAACL 2016.
Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science 1995.
C´ ıcero N. dos Santos, Bianca Zadrozny: Learning Character-Level Representations for Part-of-Speech Tagging. ICML 2014.
C´ ıcero N. dos Santos, Ma´ ıra Gatti: Deep Convolutional Neural Networks for Sentiment Analysis of Short
C´ ıcero N. dos Santos, Victor Guimar˜ aes: Boosting Named Entity Recognition with Neural Character
Ted Dunning: Statistical Identification of Language. Technical Report MCCS 940-273. 1994.
Asli Eyecioglu, Bill Keller: ASOBEK at SemEval-2016 Task 1: Sentence Representation with Character N-gram Embeddings for Semtnatic Textual Similarity. SemEval 2016.
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, Chris Dyer: Morphological Inflection Generation Using Character Sequence to Sequence Learning. NAACL 2016.
Stefan Gerdjikov, Klaus U. Schulz: Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure. arXiv 2016.
Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya: Multilingual Language Processing From Bytes. NAACL 2016.
David Golub, Xiaodong He: Character-Level Question Answering with Attention. EMNLP 2016.
Haizhou Li, Zhang Min, Su Jian: A Joint Source-Channel Model for Machine Transliteration. ACL 2004.
Teemu Hiersim¨ aki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Sami Virpioja, Janne Pylkk¨
Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech & Language, vol. 20, 2006.
Speech Recognition of Highly Inflectional Language - Czech. Eurospeech 2001.
Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, Noah A. Smith: Hierarchical Character-Word Models for Language Identification. SocialNLP 2016.
Moonyoung Kang, Tim Ng, Long Nguyen: Mandarin word-character hybrid-input Neural Network Language
Katharina Kann, Hinrich Sch¨ utze: MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection. Sigmorphon 2016.
Katharina Kann, Hinrich Sch¨ utze: Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection. ACL 2016.
Katharina Kann, Ryan Cotterell, Hinrich Sch¨ utze: Neural Morphological Analysis: Encoding-Decoding Canonical Segments. EMNLP 2016.
Ronald M. Kaplan, Martin Kay: Regular Models of Phonological Rule Systems. Computational Linguistics, vol. 20, 1994.
Kimmo Kettunen, Paul McNamee, Feza Baskaya: Using Syllables as Indexing Terms in Full-Text Information
Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush: Character-Aware Neural Language Models. AAAI 2016.
Katrin Kirchhoff, Dimitra Vergyri, Jeff Bilmes, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Conversational Arabic Speech Recognition. Computer Speech & Language, vol. 20, 2006.
Dan Klein, Joseph Smarr, Huy Nguyen, Christopher D. Manning: Named Entity Recognition with Character-Level Models. CoNLL 2003.
Kevin Knight, Jonathan Graehl: Machine transliteration. Computational Linguistics, vol. 24, 1998.
Stefan Kombrink, Mirko Hannemann, Luk´ aˇ s Burget, Hynek Heˇ rmansk´ y: Recovery of Rare Words in Lecture
Wanqiu Kou, Fang Li, Timothy Baldwin: Automatic Labelling of Topic Models using Word Vectors and Letter Trigram Vectors. AIRS 2015.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer: Neural Architectures for Named Entity Recognition. NAACL 2016.
Yves Lepage, Etienne Denoual: Purest Ever Example-Based Machine Translation: Detailed Presentation and
Wang Ling, Tiago Lu´ ıs, Lu´ ıs Marujo, Ram´
Isabel Trancoso: Finding Function in Form: Compositional Character Models for Open Vocabulary Word
Minh-Thang Luong, Richard Socher, Christopher D. Manning: Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013.
Minh-Thang Luong, Christoper D. Manning: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. ACL 2016.
Xuezhe Ma, Eduard Hovy: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. ACL 2016.
Paul McNamee, James Mayfield: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, vol. 7, 2004.
Rada Mihalcea, Vivi Nastase: Letter Level Learning for Language Independent Diacritics Restoration. CoNLL 2002.
Tom´ aˇ s Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, Jan ˇ Cernock´ y: Subword Language Modeling with Neural Networks. 2012.
Yasumasa Miyamoto, Kyunghyun Cho: Gated Word-Character Recurrent Language Model. EMNLP 2016.
Thomas M¨ uller, Helmut Schmid, Hinrich Sch¨ utze: Efficient Higher-Order CRFs for Morphological Tagging. EMNLP 2013.
Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow: Learning Sub-Word Units for Open Vocabulary Speech Recognition. ACL 2011.
Fuchun Peng, Dale Schuurmans, Vlado Keselj, Shaojun Wang: Language Independent Authorship Attribution using Character Level Language Models. EACL 2003.
Barbara Plank, Anders Søgaard, Yoav Goldberg: Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. ACL 2016.
Pushpendre Rastogi, Ryan Cotterell, Jason Eisner: Weighting Finite-State Transductions With Neural Context. NAACL 2016.
Adwait Ratnaparkhi: A Maximum Entropy Model for Part-of-Speech Tagging. EMNLP 1996.
Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Sch¨ utze: Statistical Models for Unsupervised, Semi-supervised and Supervised Transliteration Mining. Computational Linguistics 2016.
Hinrich Sch¨ utze: Word Space. NIPS 1992.
Hinrich Sch¨ utze: Nonsymbolic Text Representation. EACL 2017.
Terrence J. Sejnowski, Charles R. Rosenberg: Parallel networks that learn to pronounce English text. Complex systems 1987.
Rico Sennrich, Barry Haddow, Alexandra Birch: Neural Machine Translation of Rare Words with Subword
uter, Hermann Ney: Hybrid Language Models Using Mixed Types of Sub-lexical Units for Open Vocabulary German LVCSR. Interspeech 2011.
uter, Hermann Ney: Feature-rich Sub-lexical Language Models Using a Maximum Entropy Approach for German LVCSR. Interspeech 2013.
Claude E. Shannon: Prediction and Entropy of Printed English. BSTJ, vol. 30, 1951.
Henning Sperr, Jan Niehues, Alexander Waibel: Letter N-Gram-Based Input Encoding for Continuous Space Language Models. CVSC 2013.
Rupesh K. Srivastava, Klaus Greff, J¨ urgen Schmidhuber: Highway Networks. ICML 2015.
Ilya Sutskever, James Martens, Geoffrey Hinton: Generating Text with Recurrent Neural Networks. ICML 2011.
J¨
Dimitra Vergyri, Katrin Kirchhoff, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Arabic Speech Recognition. Interspeech 2004.
David Vilar, Jan-T. Peter, Hermann Ney: Can We Translate Letters? WSMT 2007.
Rudra Murthy V, Mitesh Khapra, Dr. Pushpak Bhattacharyya: Sharing Network Parameters for Crosslingual Named Entity Recognition. arXiv 2016.
Ekaterina Vylomova, Trevor Cohn, Xuanli He, Gholamreza Haffari: Word Representation Models for Morphologically Rich Languages in Neural Machine Translation. arXiv 2016.
Linlin Wang, Zhu Cao, Yu Xia, Gerard de Melo: Morphological Segmentation with Window LSTM Neural
John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu: CHARAGRAM: Embedding Words and Sentences via Character n-Grams. EMNLP 2016.
Yonghui Wu et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Technical Report 2016.
Yijun Xiao, Kyunghyun Cho: Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers. arXiv 2016.
Yadollah Yaghoobzadeh, Hinrich Sch¨ utze: Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities. EACL 2017.
Zhiling Yang, Ruslan Salakhutdinov, William Cohen: Multi-Task Cross-Lingual Sequence Tagging from
Xiang Zhang, Junbo Zhao, Yann LeCun: Character-level Convolutional Networks for Text Classification. NIPS 2015.