[PPT] - C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January PowerPoint Presentation

SLIDE 1

C2NLU: An Overview

Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017

C2NLU: An Overview Heike Adel 2017/01/23 1 / 34

SLIDE 2

Traditional NLP/NLU

Typical NLP/NLU processing pipeline [Gillick et al. 2015]:

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

◮ Pipeline of different modules: prone to subsequent errors ◮ We usually cannot recover from errors (e.g., in tokenization)

C2NLU: An Overview Introduction Heike Adel 2017/01/23 3 / 34

SLIDE 4

Idea: C2NLU

◮ Most extreme view: character end-to-end model that gets rid

f traditional pipeline entirely

Is this reasonable?

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

C2NLU: An Overview Introduction Heike Adel 2017/01/23 4 / 34

SLIDE 5

C2NLU: Direct models of data

◮ Traditional machine learning: based on feature engineering

◮ Tokens = manually designed features for NLU models

◮ In contrast: deep learning:

◮ Models can directly access the data,

e.g., pixels in vision, acoustic signals in speech recognition

◮ Models learn their own representation (“features”) of the data

⇒ Character-based models: raw-data approach for text

C2NLU: An Overview Motivation Heike Adel 2017/01/23 5 / 34

SLIDE 6

C2NLU: Tokenization-free models

◮ Tokenization is difficult

◮ English: some difficult cases

[Yahoo!, San Francisco-Los Angeles flights, #starwars]

◮ Chinese: tokens are not separated by spaces ◮ German: compounds

[Donaudampfschifffahrtsgesellschaftskapit¨ ansm¨ utze → the hat of the captain of the association for shipping with steam powered vessels on the Danube]

◮ Turkish: agglutinative language

[Bayramla¸ samadıklarımızdandır → He is among those with whom we haven’t been able to exchange Season’s greetings]

◮ Problem: difficult/inefficient to correct tokenization decisions

C2NLU: An Overview Motivation Heike Adel 2017/01/23 6 / 34

SLIDE 7

C2NLU: More robust against noise

◮ Robust against small perturbations of input ◮ Examples: letter insertions, deletions, substitutions,

transpositions [commputer, compter, compuzer, comptuer]

◮ Examples: space insertions, space deletions

[guacamole → gua camole, ran fast → ranfast]

C2NLU: An Overview Motivation Heike Adel 2017/01/23 7 / 34

SLIDE 8

C2NLU: Robust morphological processing

◮ If we model the sequence of characters of a token, we can in

principle learn all morphological regularities:

◮ Inflections ◮ Derivations ◮ Wide range of morphological processes (vowel harmony,

agglutination, reduplication, ...)

◮ Modeling words would, e.g., ignore that many words share

common root, prefix or suffix

◮ ⇒ C2NLU: promising framework for incorporating lingustic

knowledge about morphology into statistical models

C2NLU: An Overview Motivation Heike Adel 2017/01/23 8 / 34

SLIDE 9

C2NLU: Orthographic productivity

◮ Character sequences are not arbitrary, but their predictability

is limited

◮ Example: morphology

◮ Properties of names are predictable from character patterns

[Delonda → female name, osinopril → medication]

◮ Modifications of existing words

[staycation, dramedy, Obamacare]

◮ Non-morphological orthographic productivity

[cooooool, Watergate, Dieselgate]

◮ Sound symbolism, phonesthemes

[gl → glitter, gleam, glow]

◮ Onomatopoeia

[oink, tick tock]

C2NLU: An Overview Motivation Heike Adel 2017/01/23 9 / 34

SLIDE 10

C2NLU: Out-of-vocabulary (OOV)

◮ No OOVs in character input ◮ OOV generation possible, without the use of special

mechanisms

◮ Possible application: names/transliterations in end-to-end

machine translation

◮ Open question: How can character-based systems accurately

generate OOVs?

C2NLU: An Overview Motivation Heike Adel 2017/01/23 10 / 34

SLIDE 11

Early work: Application specific character-based features

◮ The history of character-based features for ML models is long

◮ Information retrieval with character n-grams

[McNamee et al. 2004, Chen et al. 1997, Damashek 1995, Cavnar 1994, de Heer 1974]

◮ Grapheme-to-phoneme conversion

[Bisani et al. 2008, Kaplan et al. 1994, Sejnowski et al. 1987]

◮ Char align: bilingual character-level alignments [Church 1993] ◮ Prefix and suffix features for tagging rare words

[M¨ uller et al. 2013, Ratnaparkhi 1996]

◮ Transliteration

[Sajjad et al. 2016, Li et al. 2004, Knight et al. 1998]

◮ Diacritics restauration [Mihalcea et al. 2002] ◮ POS induction (unsupervised, multilingual) [Clark 2003] ◮ Characters and character n-grams as features for NER

[Klein et al. 2003]

◮ Language identification [Alex 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 11 / 34

SLIDE 12

Early work (2): Language modeling and machine translation

◮ Character-based language modeling (non-neural)

“How well can the next letter of a text be predicted when the preceding N letters are known?” [Shannon 1951]

◮ Morpheme-level features for language models;

application: speech recognition [Shaik et al. 2013, Kirchhoff et

al. 2006, Vergyri et al. 2004, Ircing et al. 2001]

◮ Language-independent character n-gram language models for

authorship attribution [Peng et al. 2003]

◮ Hybrid word/subword n-gram language model for OOV words

in speech recognition [Parada et al. 2011, Shaik et al. 2011, Kombrink et al. 2010, Hirsim¨ aki et al. 2006]

◮ Characters and character n-grams as input to Restricted

Boltzmann Machine-based language models; application: machine translation [Sperr et al. 2013]

◮ Character-based machine translation (non-neural)

◮ Machine translation based on characters/character n-grams

[Tiedemann et al. 2013, Vilar et al. 2007, Lepage et al. 2005]

C2NLU: An Overview Early Work Heike Adel 2017/01/23 12 / 34

SLIDE 13

Categorization of previous work

◮ Three clusters [Sch¨

utze 2017]

◮ Tokenization-based models ◮ Bag-of-n-gram models ◮ End-to-end models

◮ But: also mixtures possible,

◮ e.g., tokenization-based bag-of-n-gram models ◮ e.g., bag-of-n-gram or tokenization-based models trained

end-to-end

C2NLU: An Overview Categorization Heike Adel 2017/01/23 13 / 34

SLIDE 14

Tokenization-based models

◮ Character-level models based on tokenization

(tokenization: necessary pre-processing step)

◮ Model input: tokenized text or individual tokens ◮ Example: word representations based on characters

(e.g., for rare words or OOVs)

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 14 / 34

SLIDE 15

Tokenization-based models: Examples

Example: word representations based on characters

◮ (1) Average of character embeddings

Σ t a b e l

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 15 / 34

SLIDE 16

Tokenization-based models: Examples (2)

Example: word representations based on characters

◮ (2) Bidirectional RNN/LSTM over character embeddings t a b e l

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 16 / 34

SLIDE 17

Tokenization-based models: Examples (3)

Example: word representations based on characters

◮ (3) CNN over character embeddings

t a b e l max

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 17 / 34

SLIDE 18

Tokenization-based models: Examples (4)

How to integrate such character-based embeddings into a larger system? ⇒ Example: Hierarchical RNNs [Ling et al. 2016, Luong et al. 2016, Plank et al. 2016, Vylomova et al. 2016, Wang et al. 2016, Yang et al. 2016, Ballesteros et al. 2015]

t a b e l table word embedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 18 / 34

SLIDE 19

Tokenization-based models: Examples (5)

Hierarchical CNN + FF network [dosSantos et al. 2014ab/2015]

t a b e l table word em- bedding char based embedding the red is in the kitchen max conc NN

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 19 / 34

SLIDE 20

Tokenization-based models: Examples (6)

Hierarchical CNN+RNN [Chiu et al. 2016, Costa-Juss` a et

al. 2016, Jaech et al. 2016, Kim et al. 2016, V et al. 2016,

Vylomova et al. 2016]

t a b e l table word embedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN max

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 20 / 34

SLIDE 21

Tokenization-based models: Examples (7)

Character-Aware Neural Language Models [Kim et al.2016]

concatenation

f character

embeddings convolution layer with multiple filters

f different

widths max-over-time pooling layer highway network [Srivastava et al. 2015] long-short term memory network softmax output to obtain distribution over next word cross-entropy loss between next word and prediction

◮ Combining the character

model output with word embeddings did not help in this study

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 21 / 34

SLIDE 22

Tokenization-based models: Overview

◮ Hierarchical approaches

char level upper level task examples LSTM LSTM POS tagging/NER [Lample et al. 2016, Plank et al. 2016, Yang et al. 2016] CNN FF network POS tagging/NER [dosSantos et al. 2014b, dosSantos et al. 2015] sentiment analysis [dosSantos et al. 2014a] CNN LSTM NER [Chiu et al. 2016, V et al. 2016] LSTM LSTM dependency parsing [Ballesteros et al. 2015]

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 22 / 34

SLIDE 23

Tokenization-based models: Overview (2)

◮ Language identification (including code-switching/language

inclusions) [Jaech et al. 2016, Ling 2015, Alex 2005]

◮ Neural language modeling with subword-level input (e.g.,

characters or morphemes) [Kim et al. 2016, Botha et al. 2014, Shaik et al. 2013, Sperr et al. 2013, Mikolov et al. 2012]

◮ Word representation learning from character and/or

morpheme level word decompositions (e.g., for rare words) [Vylomova et al. 2016, Chen et al. 2015, Ling et al. 2015, Luong et al. 2013]

◮ Morphological processing of words [Cao et al. 2016, Cotterell

et al. 2016, Faruqui et al. 2016, Kann et al. 2016, Rastogi et

al. 2016, Wang et al. 2016]

◮ Neural machine translation of rare words with subword units

[Costa-Juss` a et al. 2016, Luong et al. 2016, Sennrich et al. 2016]

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 23 / 34

SLIDE 24

Bag-of-n-gram models

◮ Character n-gram embeddings, within-token and cross-token

◮ No notion of “words”/token boundaries necessary

◮ Application: embedding of a piece of text: sum of occurring

character n-grams

◮ Example:

Dagstuhl is located in Germany. → Dags agst gstu stuh tuhl uhl hl i l is is is l s lo loc loca

cat cate ated ted ed i d in in in G n Ge Ger Germ erma

rman many any.

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 24 / 34

SLIDE 25

Bag-of-n-gram models: Examples

WordSpace [Sch¨ utze 1992]

◮ Goal: fixed-length distributed semantic representations for

text (corpus-based)

◮ Training of k-gram embeddings: extract m top character k-grams (m = 5000, k = 4) extract cooccurrence counts of k-grams corpus k-grams SVD cooc. matrix n-dim k-gram embeddings ◮ Application: obtaining vectors for a piece of text: extract k-grams in piece of text sum k-gram embeddings piece

f text

k-grams bag-of-k-gram embedding

f text

◮ Evaluation: word sense disambiguation: sum embeddings in surrounding contexts cluster ambiguous word context vectors sense representation

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 25 / 34

SLIDE 26

Bag-of-n-gram models: Examples (2)

CHARAGRAM: Embedding Words and Sentences via Character n-grams [Wieting et al. 2016]

◮ Based on n-gram vectors (learned end-to-end) ◮ Obtaining word/sequence embeddings: summing character

n-gram vectors gCHAR(x) = h(b +

m+1

i=1

i

j=1+i−k

I[xi

j ∈ V ]W xi

j )

◮ Evaluation: word similarity, sentence similarity, part-of-speech

tagging

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 26 / 34

SLIDE 27

Bag-of-n-gram models: Overview

◮ Word sense disambiguation [Sch¨

utze 1992]

◮ Language identification [Baldwin et al. 2010, Dunning 1994] ◮ Text retrieval

[Kettunen et al. 2010, McNamee et al. 2004, Cavnar 1994]

◮ Topic labeling [Kou et al. 2015] ◮ Word/text similarity

[Bojanowski et al. 2017, Eyecioglu et al. 2016, Wieting 2016]

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 27 / 34

SLIDE 28

End-to-end models

◮ Input: sequence of characters or bytes ◮ Directly optimized on an objective ◮ ⇒ Tokenization-free ◮ ⇒ Models learn their own, task-specific representations

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 28 / 34

SLIDE 29

End-to-end models: Examples

Multilingual Language Processing From Bytes [Gillick et al. 2015]

◮ Segments can begin and end mid-word or even mid-character ◮ Same model for different languages ◮ Evaluation: POS tagging, NER

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 29 / 34

SLIDE 30

End-to-end models: Examples (2)

Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers [Xiao et al. 2016]

Embedding Layer Recurrent Layers Convolutional Layers Classification Layer p(y|X) (x1, x2, …, xT)

◮ Motivation: receptive field of a

convolutional layer is small (usually 3-7) ⇒ many layers necessary to capture long-term dependencies ⇒ combine convolutional networks with recurrent networks

◮ Evaluation: sentiment analysis, ontology

classification, question type classification, news categorization

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 30 / 34

SLIDE 31

End-to-end models: Examples (3)

Generating Text with Recurrent Neural Networks [Sutskever et al. 2011]

◮ RNN with multiplicative (“gated”) connections for

character-level language modeling

◮ Multiplicative connections: each input character specifies a

different hidden-to-hidden weight matrix: W (xt)

hh

=

M

m=1

x(m)

t

W (m)

hh ◮ Qualitative analysis of generated text:

◮ Mostly with lingustic structure and large vocabulary ◮ Only a few uncapitalized non-words ◮ Balanced parantheses and quotes over long distances (e.g., 30

characters)

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 31 / 34

SLIDE 32

End-to-end models: Overview

◮ Machine translation, character/subword input/output

[Chung et al. 2016, Wu et al. 2016, Vilar et al. 2007]

◮ Generating character sequences [Sutskever et al. 2011] ◮ Text/document classification with character-level models

[Xiao et al. 2016, Zhang et al. 2015]

◮ Sequence labeling [byte input: Gillick et al. 2015, character

input: Ma et al. 2016]

◮ Unsupervised, language-independent identification of phrases

r words (“meaningful subparts of language”)

[Gerdjikov et al. 2016]

◮ Question answering with character-level encoder-decoder

[Golub et al. 2016]

◮ Language modeling with combined word-character input

[Miyamoto et al. 2016]

◮ Entity typing based on character sequence of name

[Yaghoobzadeh et al. 2017]

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 32 / 34

SLIDE 33

C2NLU: Wrap-up

◮ C2NLU: possibility to overcome traditional NLP/NLU pipeline

◮ Direct modeling of input data (i.e. character or byte

sequences)

◮ No tokenization necessary, character segments can span token

boundaries

◮ Previous work shows:

◮ Models without notion of words/tokens can achieve

state-of-the-art results

◮ Character-level models especially suited for rare words,

morphologically rich languages, language independent NLU

C2NLU: An Overview Conclusion Heike Adel 2017/01/23 33 / 34

SLIDE 34

C2NLU: Open questions from proposal

◮ What is the relationship between morphology and character

level models?

◮ Will character-level models reach and/or surpass token-based

models and, if so, in which subareas of NLU?

◮ Using character-level models, can we realize universal

multilingual processing (English, Chinese, Turkish, etc.)?

◮ If domain knowledge injected into ML models no longer

consists of tokenization rules and morphological expertise, what would replace it?

◮ Detecting syntactic and semantic relationships at the

character level is more expensive than at the word-level. How can we address the resulting challenges in scalability for character-level models?

◮ How can C2NLU systems accurately generate OOVs?

C2NLU: An Overview Conclusion Heike Adel 2017/01/23 34 / 34

SLIDE 35

References

◮

Beatrice Alex: An Unsupervised System for Identifying English Inclusions in German Text. ACL 2005.

◮

Timothy Baldwin, Marco Lui: Language Identification: The Long and the Short of the Matter. NAACL 2010.

◮

Miguel Ballesteros, Chris Dyer, Noah A. Smith: Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. EMNLP 2015.

◮

Maximilian Bisani, Hermann Ney: Joint-sequence models for grapheme-to-phoneme conversion. Speech communication 2008.

◮

Piotr Bojanowski, Edouard Grave, Armand Joulin, Tom´ aˇ s Mikolov: Enriching Word Vectors with Subword

Information. TACL 2017.

◮

Jan A. Botha, Phil Blunsom: Compositional Morphology for Word Representations and Language Modelling. ICML 2014.

◮

Kris Cao, Marek Rei: A Joint Model for Word Embedding and Word Morphology. ACL 2016.

◮

William B. Cavnar: Using An N-Gram-Based Document Representation With a Vector Processing Retrieval

Model. Nist Special Publication 1995.

◮

Aitao Chen, Jianzhang He, Liangjie Xu, Jason Meggs: Chinese Text Retrieval Without Using a Dictionary. SIGIR 1997.

◮

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Huanbo Luan: Joint Learning of Character and Word

Embeddings. IJCAI 2015.

◮

Jason P.C. Chiu, Eric Nichols: Named Entity Recognition with Bidirectional LSTM-CNNs. TACL vol. 4, 2016.

◮

Junyoung Chung, Kyunghyun Cho, Yoshua Bengio: A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. ACL 2016.

◮

Kenneth W. Church: Char align: A Program for Aligning Parallel Texts at the Character Level. ACL 1993.

◮

Alexander Clark: Combining Distributional and Morphological Information for Part of Speech Induction. EACL 2003.

◮

Marta R. Costa-juss` a, Jos´ e A.R. Fonollosa: Character-based Neural Machine Translation. ACL 2016.

◮

Ryan Cotterell, Tim Vieira, Hinrich Sch¨ utze: A Joint Model of Orthography and Morphological Segmentation. NAACL 2016.

◮

Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science 1995.

SLIDE 36

References (2)

◮

T. de Heer: Experiments with Syntactic Traces in Information Retrieval. Information Storage and Retrieval,
vol. 10, 1974.

◮

C´ ıcero N. dos Santos, Bianca Zadrozny: Learning Character-Level Representations for Part-of-Speech Tagging. ICML 2014.

◮

C´ ıcero N. dos Santos, Ma´ ıra Gatti: Deep Convolutional Neural Networks for Sentiment Analysis of Short

Texts. COLING 2014.

◮

C´ ıcero N. dos Santos, Victor Guimar˜ aes: Boosting Named Entity Recognition with Neural Character

Embeddings. Named Entity Workshop 2015.

◮

Ted Dunning: Statistical Identification of Language. Technical Report MCCS 940-273. 1994.

◮

Asli Eyecioglu, Bill Keller: ASOBEK at SemEval-2016 Task 1: Sentence Representation with Character N-gram Embeddings for Semtnatic Textual Similarity. SemEval 2016.

◮

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, Chris Dyer: Morphological Inflection Generation Using Character Sequence to Sequence Learning. NAACL 2016.

◮

Stefan Gerdjikov, Klaus U. Schulz: Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure. arXiv 2016.

◮

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya: Multilingual Language Processing From Bytes. NAACL 2016.

◮

David Golub, Xiaodong He: Character-Level Question Answering with Attention. EMNLP 2016.

◮

Haizhou Li, Zhang Min, Su Jian: A Joint Source-Channel Model for Machine Transliteration. ACL 2004.

◮

Teemu Hiersim¨ aki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Sami Virpioja, Janne Pylkk¨

nen: Unlimited

Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech & Language, vol. 20, 2006.

◮

P. Ircing, P. Krebc, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, W. Byrne: On Large Vocabulary Continuous

Speech Recognition of Highly Inflectional Language - Czech. Eurospeech 2001.

◮

Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, Noah A. Smith: Hierarchical Character-Word Models for Language Identification. SocialNLP 2016.

◮

Moonyoung Kang, Tim Ng, Long Nguyen: Mandarin word-character hybrid-input Neural Network Language

Model. Interspeech 2011.

◮

Katharina Kann, Hinrich Sch¨ utze: MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection. Sigmorphon 2016.

SLIDE 37

References (3)

◮

Katharina Kann, Hinrich Sch¨ utze: Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection. ACL 2016.

◮

Katharina Kann, Ryan Cotterell, Hinrich Sch¨ utze: Neural Morphological Analysis: Encoding-Decoding Canonical Segments. EMNLP 2016.

◮

Ronald M. Kaplan, Martin Kay: Regular Models of Phonological Rule Systems. Computational Linguistics, vol. 20, 1994.

◮

Kimmo Kettunen, Paul McNamee, Feza Baskaya: Using Syllables as Indexing Terms in Full-Text Information

Retrieval. Baltic HLT 2010.

◮

Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush: Character-Aware Neural Language Models. AAAI 2016.

◮

Katrin Kirchhoff, Dimitra Vergyri, Jeff Bilmes, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Conversational Arabic Speech Recognition. Computer Speech & Language, vol. 20, 2006.

◮

Dan Klein, Joseph Smarr, Huy Nguyen, Christopher D. Manning: Named Entity Recognition with Character-Level Models. CoNLL 2003.

◮

Kevin Knight, Jonathan Graehl: Machine transliteration. Computational Linguistics, vol. 24, 1998.

◮

Stefan Kombrink, Mirko Hannemann, Luk´ aˇ s Burget, Hynek Heˇ rmansk´ y: Recovery of Rare Words in Lecture

Speech. TSD 2010.

◮

Wanqiu Kou, Fang Li, Timothy Baldwin: Automatic Labelling of Topic Models using Word Vectors and Letter Trigram Vectors. AIRS 2015.

◮

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer: Neural Architectures for Named Entity Recognition. NAACL 2016.

◮

Yves Lepage, Etienne Denoual: Purest Ever Example-Based Machine Translation: Detailed Presentation and

Assessment. Machine Translation, vol. 19, 2005.

◮

Wang Ling, Tiago Lu´ ıs, Lu´ ıs Marujo, Ram´

n Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black,

Isabel Trancoso: Finding Function in Form: Compositional Character Models for Open Vocabulary Word

Representation. EMNLP 2015.

◮

Minh-Thang Luong, Richard Socher, Christopher D. Manning: Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013.

◮

Minh-Thang Luong, Christoper D. Manning: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. ACL 2016.

SLIDE 38

References (4)

◮

Xuezhe Ma, Eduard Hovy: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. ACL 2016.

◮

Paul McNamee, James Mayfield: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, vol. 7, 2004.

◮

Rada Mihalcea, Vivi Nastase: Letter Level Learning for Language Independent Diacritics Restoration. CoNLL 2002.

◮

Tom´ aˇ s Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, Jan ˇ Cernock´ y: Subword Language Modeling with Neural Networks. 2012.

◮

Yasumasa Miyamoto, Kyunghyun Cho: Gated Word-Character Recurrent Language Model. EMNLP 2016.

◮

Thomas M¨ uller, Helmut Schmid, Hinrich Sch¨ utze: Efficient Higher-Order CRFs for Morphological Tagging. EMNLP 2013.

◮

Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow: Learning Sub-Word Units for Open Vocabulary Speech Recognition. ACL 2011.

◮

Fuchun Peng, Dale Schuurmans, Vlado Keselj, Shaojun Wang: Language Independent Authorship Attribution using Character Level Language Models. EACL 2003.

◮

Barbara Plank, Anders Søgaard, Yoav Goldberg: Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. ACL 2016.

◮

Pushpendre Rastogi, Ryan Cotterell, Jason Eisner: Weighting Finite-State Transductions With Neural Context. NAACL 2016.

◮

Adwait Ratnaparkhi: A Maximum Entropy Model for Part-of-Speech Tagging. EMNLP 1996.

◮

Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Sch¨ utze: Statistical Models for Unsupervised, Semi-supervised and Supervised Transliteration Mining. Computational Linguistics 2016.

◮

Hinrich Sch¨ utze: Word Space. NIPS 1992.

◮

Hinrich Sch¨ utze: Nonsymbolic Text Representation. EACL 2017.

◮

Terrence J. Sejnowski, Charles R. Rosenberg: Parallel networks that learn to pronounce English text. Complex systems 1987.

◮

Rico Sennrich, Barry Haddow, Alexandra Birch: Neural Machine Translation of Rare Words with Subword

Units. ACL 2016.

◮

M. Ali Basha Shaik, Amr El-Desoky Mousa, Ralf Schl¨

uter, Hermann Ney: Hybrid Language Models Using Mixed Types of Sub-lexical Units for Open Vocabulary German LVCSR. Interspeech 2011.

SLIDE 39

References (5)

◮

M. Ali Basha Shaik, Amr El-Desoky Mousa, Ralf Schl¨

uter, Hermann Ney: Feature-rich Sub-lexical Language Models Using a Maximum Entropy Approach for German LVCSR. Interspeech 2013.

◮

Claude E. Shannon: Prediction and Entropy of Printed English. BSTJ, vol. 30, 1951.

◮

Henning Sperr, Jan Niehues, Alexander Waibel: Letter N-Gram-Based Input Encoding for Continuous Space Language Models. CVSC 2013.

◮

Rupesh K. Srivastava, Klaus Greff, J¨ urgen Schmidhuber: Highway Networks. ICML 2015.

◮

Ilya Sutskever, James Martens, Geoffrey Hinton: Generating Text with Recurrent Neural Networks. ICML 2011.

◮

J¨

rg Tiedemann, Preslav Nakov: Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets. RANLP 2013.

◮

Dimitra Vergyri, Katrin Kirchhoff, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Arabic Speech Recognition. Interspeech 2004.

◮

David Vilar, Jan-T. Peter, Hermann Ney: Can We Translate Letters? WSMT 2007.

◮

Rudra Murthy V, Mitesh Khapra, Dr. Pushpak Bhattacharyya: Sharing Network Parameters for Crosslingual Named Entity Recognition. arXiv 2016.

◮

Ekaterina Vylomova, Trevor Cohn, Xuanli He, Gholamreza Haffari: Word Representation Models for Morphologically Rich Languages in Neural Machine Translation. arXiv 2016.

◮

Linlin Wang, Zhu Cao, Yu Xia, Gerard de Melo: Morphological Segmentation with Window LSTM Neural

Networks. AAAI 2016.

◮

John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu: CHARAGRAM: Embedding Words and Sentences via Character n-Grams. EMNLP 2016.

◮

Yonghui Wu et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Technical Report 2016.

◮

Yijun Xiao, Kyunghyun Cho: Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers. arXiv 2016.

◮

Yadollah Yaghoobzadeh, Hinrich Sch¨ utze: Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities. EACL 2017.

◮

Zhiling Yang, Ruslan Salakhutdinov, William Cohen: Multi-Task Cross-Lingual Sequence Tagging from

Scratch. arXiv 2016.

◮

Xiang Zhang, Junbo Zhao, Yann LeCun: Character-level Convolutional Networks for Text Classification. NIPS 2015.

C2NLU: An Overview

Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017

Contents

W e l c o m e _ t o _ m y _ t a l k

NLU greeting

◮ Motivation

◮ Previous work

◮ Conclusion

Traditional NLP/NLU

Typical NLP/NLU processing pipeline [Gillick et al. 2015]:

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

◮ Pipeline of different modules: prone to subsequent errors ◮ We usually cannot recover from errors (e.g., in tokenization)

Idea: C2NLU

◮ Most extreme view: character end-to-end model that gets rid

Is this reasonable?

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

C2NLU: Direct models of data

◮ Traditional machine learning: based on feature engineering

◮ In contrast: deep learning:

e.g., pixels in vision, acoustic signals in speech recognition

⇒ Character-based models: raw-data approach for text

C2NLU: Tokenization-free models

◮ Tokenization is difficult

[Yahoo!, San Francisco-Los Angeles flights, #starwars]

[Donaudampfschifffahrtsgesellschaftskapit¨ ansm¨ utze → the hat of the captain of the association for shipping with steam powered vessels on the Danube]

[Bayramla¸ samadıklarımızdandır → He is among those with whom we haven’t been able to exchange Season’s greetings]

◮ Problem: difficult/inefficient to correct tokenization decisions

C2NLU: More robust against noise

◮ Robust against small perturbations of input ◮ Examples: letter insertions, deletions, substitutions,

transpositions [commputer, compter, compuzer, comptuer]

◮ Examples: space insertions, space deletions

[guacamole → gua camole, ran fast → ranfast]

C2NLU: Robust morphological processing

◮ If we model the sequence of characters of a token, we can in

principle learn all morphological regularities:

agglutination, reduplication, ...)

◮ Modeling words would, e.g., ignore that many words share

common root, prefix or suffix

◮ ⇒ C2NLU: promising framework for incorporating lingustic

knowledge about morphology into statistical models

C2NLU: Orthographic productivity

◮ Character sequences are not arbitrary, but their predictability

is limited

◮ Properties of names are predictable from character patterns

[Delonda → female name, osinopril → medication]

◮ Modifications of existing words

[staycation, dramedy, Obamacare]

◮ Non-morphological orthographic productivity

[cooooool, Watergate, Dieselgate]

◮ Sound symbolism, phonesthemes

[gl → glitter, gleam, glow]

◮ Onomatopoeia

[oink, tick tock]

C2NLU: Out-of-vocabulary (OOV)

◮ No OOVs in character input ◮ OOV generation possible, without the use of special

mechanisms

machine translation

◮ Open question: How can character-based systems accurately

generate OOVs?

Early work: Application specific character-based features

◮ The history of character-based features for ML models is long

[McNamee et al. 2004, Chen et al. 1997, Damashek 1995, Cavnar 1994, de Heer 1974]

[Bisani et al. 2008, Kaplan et al. 1994, Sejnowski et al. 1987]

[M¨ uller et al. 2013, Ratnaparkhi 1996]

[Sajjad et al. 2016, Li et al. 2004, Knight et al. 1998]

[Klein et al. 2003]

Early work (2): Language modeling and machine translation

◮ Character-based language modeling (non-neural)

“How well can the next letter of a text be predicted when the preceding N letters are known?” [Shannon 1951]

application: speech recognition [Shaik et al. 2013, Kirchhoff et

authorship attribution [Peng et al. 2003]

in speech recognition [Parada et al. 2011, Shaik et al. 2011, Kombrink et al. 2010, Hirsim¨ aki et al. 2006]

Boltzmann Machine-based language models; application: machine translation [Sperr et al. 2013]

◮ Character-based machine translation (non-neural)

[Tiedemann et al. 2013, Vilar et al. 2007, Lepage et al. 2005]

Categorization of previous work

◮ Three clusters [Sch¨

utze 2017]

◮ But: also mixtures possible,

end-to-end