C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - - PowerPoint PPT Presentation

c2nlu an overview
SMART_READER_LITE
LIVE PREVIEW

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - - PowerPoint PPT Presentation

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34 Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting Motivation Why do we want


slide-1
SLIDE 1

C2NLU: An Overview

Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017

C2NLU: An Overview Heike Adel 2017/01/23 1 / 34

slide-2
SLIDE 2

Contents

W e l c o m e _ t o _ m y _ t a l k

NLU greeting

◮ Motivation

◮ Why do we want character-based models?

◮ Previous work

◮ Which character-based models/research exist?

◮ Conclusion

◮ Which challenges/open questions need to be considered? C2NLU: An Overview Contents Heike Adel 2017/01/23 2 / 34

slide-3
SLIDE 3

Traditional NLP/NLU

Typical NLP/NLU processing pipeline [Gillick et al. 2015]:

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

◮ Pipeline of different modules: prone to subsequent errors ◮ We usually cannot recover from errors (e.g., in tokenization)

C2NLU: An Overview Introduction Heike Adel 2017/01/23 3 / 34

slide-4
SLIDE 4

Idea: C2NLU

◮ Most extreme view: character end-to-end model that gets rid

  • f traditional pipeline entirely

Is this reasonable?

tokenization (language-specific) document segmentation into sentences token sequence syntactic analysis (per sentence) sentences semantic analysis POS tags, syntactic dependencies NLU NE tags, semantic roles, ...

C2NLU: An Overview Introduction Heike Adel 2017/01/23 4 / 34

slide-5
SLIDE 5

C2NLU: Direct models of data

◮ Traditional machine learning: based on feature engineering

◮ Tokens = manually designed features for NLU models

◮ In contrast: deep learning:

◮ Models can directly access the data,

e.g., pixels in vision, acoustic signals in speech recognition

◮ Models learn their own representation (“features”) of the data

⇒ Character-based models: raw-data approach for text

C2NLU: An Overview Motivation Heike Adel 2017/01/23 5 / 34

slide-6
SLIDE 6

C2NLU: Tokenization-free models

◮ Tokenization is difficult

◮ English: some difficult cases

[Yahoo!, San Francisco-Los Angeles flights, #starwars]

◮ Chinese: tokens are not separated by spaces ◮ German: compounds

[Donaudampfschifffahrtsgesellschaftskapit¨ ansm¨ utze → the hat of the captain of the association for shipping with steam powered vessels on the Danube]

◮ Turkish: agglutinative language

[Bayramla¸ samadıklarımızdandır → He is among those with whom we haven’t been able to exchange Season’s greetings]

◮ Problem: difficult/inefficient to correct tokenization decisions

C2NLU: An Overview Motivation Heike Adel 2017/01/23 6 / 34

slide-7
SLIDE 7

C2NLU: More robust against noise

◮ Robust against small perturbations of input ◮ Examples: letter insertions, deletions, substitutions,

transpositions [commputer, compter, compuzer, comptuer]

◮ Examples: space insertions, space deletions

[guacamole → gua camole, ran fast → ranfast]

C2NLU: An Overview Motivation Heike Adel 2017/01/23 7 / 34

slide-8
SLIDE 8

C2NLU: Robust morphological processing

◮ If we model the sequence of characters of a token, we can in

principle learn all morphological regularities:

◮ Inflections ◮ Derivations ◮ Wide range of morphological processes (vowel harmony,

agglutination, reduplication, ...)

◮ Modeling words would, e.g., ignore that many words share

common root, prefix or suffix

◮ ⇒ C2NLU: promising framework for incorporating lingustic

knowledge about morphology into statistical models

C2NLU: An Overview Motivation Heike Adel 2017/01/23 8 / 34

slide-9
SLIDE 9

C2NLU: Orthographic productivity

◮ Character sequences are not arbitrary, but their predictability

is limited

◮ Example: morphology

◮ Properties of names are predictable from character patterns

[Delonda → female name, osinopril → medication]

◮ Modifications of existing words

[staycation, dramedy, Obamacare]

◮ Non-morphological orthographic productivity

[cooooool, Watergate, Dieselgate]

◮ Sound symbolism, phonesthemes

[gl → glitter, gleam, glow]

◮ Onomatopoeia

[oink, tick tock]

C2NLU: An Overview Motivation Heike Adel 2017/01/23 9 / 34

slide-10
SLIDE 10

C2NLU: Out-of-vocabulary (OOV)

◮ No OOVs in character input ◮ OOV generation possible, without the use of special

mechanisms

◮ Possible application: names/transliterations in end-to-end

machine translation

◮ Open question: How can character-based systems accurately

generate OOVs?

C2NLU: An Overview Motivation Heike Adel 2017/01/23 10 / 34

slide-11
SLIDE 11

Early work: Application specific character-based features

◮ The history of character-based features for ML models is long

◮ Information retrieval with character n-grams

[McNamee et al. 2004, Chen et al. 1997, Damashek 1995, Cavnar 1994, de Heer 1974]

◮ Grapheme-to-phoneme conversion

[Bisani et al. 2008, Kaplan et al. 1994, Sejnowski et al. 1987]

◮ Char align: bilingual character-level alignments [Church 1993] ◮ Prefix and suffix features for tagging rare words

[M¨ uller et al. 2013, Ratnaparkhi 1996]

◮ Transliteration

[Sajjad et al. 2016, Li et al. 2004, Knight et al. 1998]

◮ Diacritics restauration [Mihalcea et al. 2002] ◮ POS induction (unsupervised, multilingual) [Clark 2003] ◮ Characters and character n-grams as features for NER

[Klein et al. 2003]

◮ Language identification [Alex 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 11 / 34

slide-12
SLIDE 12

Early work (2): Language modeling and machine translation

◮ Character-based language modeling (non-neural)

“How well can the next letter of a text be predicted when the preceding N letters are known?” [Shannon 1951]

◮ Morpheme-level features for language models;

application: speech recognition [Shaik et al. 2013, Kirchhoff et

  • al. 2006, Vergyri et al. 2004, Ircing et al. 2001]

◮ Language-independent character n-gram language models for

authorship attribution [Peng et al. 2003]

◮ Hybrid word/subword n-gram language model for OOV words

in speech recognition [Parada et al. 2011, Shaik et al. 2011, Kombrink et al. 2010, Hirsim¨ aki et al. 2006]

◮ Characters and character n-grams as input to Restricted

Boltzmann Machine-based language models; application: machine translation [Sperr et al. 2013]

◮ Character-based machine translation (non-neural)

◮ Machine translation based on characters/character n-grams

[Tiedemann et al. 2013, Vilar et al. 2007, Lepage et al. 2005]

C2NLU: An Overview Early Work Heike Adel 2017/01/23 12 / 34

slide-13
SLIDE 13

Categorization of previous work

◮ Three clusters [Sch¨

utze 2017]

◮ Tokenization-based models ◮ Bag-of-n-gram models ◮ End-to-end models

◮ But: also mixtures possible,

◮ e.g., tokenization-based bag-of-n-gram models ◮ e.g., bag-of-n-gram or tokenization-based models trained

end-to-end

C2NLU: An Overview Categorization Heike Adel 2017/01/23 13 / 34

slide-14
SLIDE 14

Tokenization-based models

◮ Character-level models based on tokenization

(tokenization: necessary pre-processing step)

◮ Model input: tokenized text or individual tokens ◮ Example: word representations based on characters

(e.g., for rare words or OOVs)

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 14 / 34

slide-15
SLIDE 15

Tokenization-based models: Examples

Example: word representations based on characters

◮ (1) Average of character embeddings

Σ t a b e l

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 15 / 34

slide-16
SLIDE 16

Tokenization-based models: Examples (2)

Example: word representations based on characters

◮ (2) Bidirectional RNN/LSTM over character embeddings t a b e l

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 16 / 34

slide-17
SLIDE 17

Tokenization-based models: Examples (3)

Example: word representations based on characters

◮ (3) CNN over character embeddings

t a b e l max

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 17 / 34

slide-18
SLIDE 18

Tokenization-based models: Examples (4)

How to integrate such character-based embeddings into a larger system? ⇒ Example: Hierarchical RNNs [Ling et al. 2016, Luong et al. 2016, Plank et al. 2016, Vylomova et al. 2016, Wang et al. 2016, Yang et al. 2016, Ballesteros et al. 2015]

t a b e l table word em- bedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 18 / 34

slide-19
SLIDE 19

Tokenization-based models: Examples (5)

Hierarchical CNN + FF network [dosSantos et al. 2014ab/2015]

t a b e l table word em- bedding char based embedding the red is in the kitchen max conc NN

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 19 / 34

slide-20
SLIDE 20

Tokenization-based models: Examples (6)

Hierarchical CNN+RNN [Chiu et al. 2016, Costa-Juss` a et

  • al. 2016, Jaech et al. 2016, Kim et al. 2016, V et al. 2016,

Vylomova et al. 2016]

t a b e l table word em- bedding char based embedding the red is in the kitchen DET JJ NN V DET NN IN max

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 20 / 34

slide-21
SLIDE 21

Tokenization-based models: Examples (7)

Character-Aware Neural Language Models [Kim et al.2016]

concatenation

  • f character

embeddings convolution layer with multiple filters

  • f different

widths max-over-time pooling layer highway network [Srivastava et al. 2015] long-short term memory network softmax output to obtain distribution over next word cross-entropy loss between next word and prediction

◮ Combining the character

model output with word embeddings did not help in this study

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 21 / 34

slide-22
SLIDE 22

Tokenization-based models: Overview

◮ Hierarchical approaches

char level upper level task examples LSTM LSTM POS tagging/NER [Lample et al. 2016, Plank et al. 2016, Yang et al. 2016] CNN FF network POS tagging/NER [dosSantos et al. 2014b, dosSantos et al. 2015] sentiment analysis [dosSantos et al. 2014a] CNN LSTM NER [Chiu et al. 2016, V et al. 2016] LSTM LSTM dependency parsing [Ballesteros et al. 2015]

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 22 / 34

slide-23
SLIDE 23

Tokenization-based models: Overview (2)

◮ Language identification (including code-switching/language

inclusions) [Jaech et al. 2016, Ling 2015, Alex 2005]

◮ Neural language modeling with subword-level input (e.g.,

characters or morphemes) [Kim et al. 2016, Botha et al. 2014, Shaik et al. 2013, Sperr et al. 2013, Mikolov et al. 2012]

◮ Word representation learning from character and/or

morpheme level word decompositions (e.g., for rare words) [Vylomova et al. 2016, Chen et al. 2015, Ling et al. 2015, Luong et al. 2013]

◮ Morphological processing of words [Cao et al. 2016, Cotterell

et al. 2016, Faruqui et al. 2016, Kann et al. 2016, Rastogi et

  • al. 2016, Wang et al. 2016]

◮ Neural machine translation of rare words with subword units

[Costa-Juss` a et al. 2016, Luong et al. 2016, Sennrich et al. 2016]

C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 23 / 34

slide-24
SLIDE 24

Bag-of-n-gram models

◮ Character n-gram embeddings, within-token and cross-token

◮ No notion of “words”/token boundaries necessary

◮ Application: embedding of a piece of text: sum of occurring

character n-grams

◮ Example:

Dagstuhl is located in Germany. → Dags agst gstu stuh tuhl uhl hl i l is is is l s lo loc loca

  • cat cate ated ted ed i d in in in G n Ge Ger Germ erma

rman many any.

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 24 / 34

slide-25
SLIDE 25

Bag-of-n-gram models: Examples

WordSpace [Sch¨ utze 1992]

◮ Goal: fixed-length distributed semantic representations for

text (corpus-based)

◮ Training of k-gram embeddings: extract m top character k-grams (m = 5000, k = 4) extract cooccurrence counts of k-grams corpus k-grams SVD cooc. matrix n-dim k-gram embeddings ◮ Application: obtaining vectors for a piece of text: extract k-grams in piece of text sum k-gram embeddings piece

  • f text

k-grams bag-of-k-gram embedding

  • f text

◮ Evaluation: word sense disambiguation: sum embeddings in surrounding contexts cluster ambiguous word context vectors sense representation

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 25 / 34

slide-26
SLIDE 26

Bag-of-n-gram models: Examples (2)

CHARAGRAM: Embedding Words and Sentences via Character n-grams [Wieting et al. 2016]

◮ Based on n-gram vectors (learned end-to-end) ◮ Obtaining word/sequence embeddings: summing character

n-gram vectors gCHAR(x) = h(b +

m+1

  • i=1

i

  • j=1+i−k

I[xi

j ∈ V ]W xi

j )

◮ Evaluation: word similarity, sentence similarity, part-of-speech

tagging

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 26 / 34

slide-27
SLIDE 27

Bag-of-n-gram models: Overview

◮ Word sense disambiguation [Sch¨

utze 1992]

◮ Language identification [Baldwin et al. 2010, Dunning 1994] ◮ Text retrieval

[Kettunen et al. 2010, McNamee et al. 2004, Cavnar 1994]

◮ Topic labeling [Kou et al. 2015] ◮ Word/text similarity

[Bojanowski et al. 2017, Eyecioglu et al. 2016, Wieting 2016]

C2NLU: An Overview Bag-of-n-gram Models Heike Adel 2017/01/23 27 / 34

slide-28
SLIDE 28

End-to-end models

◮ Input: sequence of characters or bytes ◮ Directly optimized on an objective ◮ ⇒ Tokenization-free ◮ ⇒ Models learn their own, task-specific representations

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 28 / 34

slide-29
SLIDE 29

End-to-end models: Examples

Multilingual Language Processing From Bytes [Gillick et al. 2015]

◮ Segments can begin and end mid-word or even mid-character ◮ Same model for different languages ◮ Evaluation: POS tagging, NER

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 29 / 34

slide-30
SLIDE 30

End-to-end models: Examples (2)

Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers [Xiao et al. 2016]

Embedding Layer Recurrent Layers Convolutional Layers Classification Layer p(y|X) (x1, x2, …, xT)

◮ Motivation: receptive field of a

convolutional layer is small (usually 3-7) ⇒ many layers necessary to capture long-term dependencies ⇒ combine convolutional networks with recurrent networks

◮ Evaluation: sentiment analysis, ontology

classification, question type classification, news categorization

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 30 / 34

slide-31
SLIDE 31

End-to-end models: Examples (3)

Generating Text with Recurrent Neural Networks [Sutskever et al. 2011]

◮ RNN with multiplicative (“gated”) connections for

character-level language modeling

◮ Multiplicative connections: each input character specifies a

different hidden-to-hidden weight matrix: W (xt)

hh

=

M

  • m=1

x(m)

t

W (m)

hh ◮ Qualitative analysis of generated text:

◮ Mostly with lingustic structure and large vocabulary ◮ Only a few uncapitalized non-words ◮ Balanced parantheses and quotes over long distances (e.g., 30

characters)

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 31 / 34

slide-32
SLIDE 32

End-to-end models: Overview

◮ Machine translation, character/subword input/output

[Chung et al. 2016, Wu et al. 2016, Vilar et al. 2007]

◮ Generating character sequences [Sutskever et al. 2011] ◮ Text/document classification with character-level models

[Xiao et al. 2016, Zhang et al. 2015]

◮ Sequence labeling [byte input: Gillick et al. 2015, character

input: Ma et al. 2016]

◮ Unsupervised, language-independent identification of phrases

  • r words (“meaningful subparts of language”)

[Gerdjikov et al. 2016]

◮ Question answering with character-level encoder-decoder

[Golub et al. 2016]

◮ Language modeling with combined word-character input

[Miyamoto et al. 2016]

◮ Entity typing based on character sequence of name

[Yaghoobzadeh et al. 2017]

C2NLU: An Overview End-to-end Models Heike Adel 2017/01/23 32 / 34

slide-33
SLIDE 33

C2NLU: Wrap-up

◮ C2NLU: possibility to overcome traditional NLP/NLU pipeline

◮ Direct modeling of input data (i.e. character or byte

sequences)

◮ No tokenization necessary, character segments can span token

boundaries

◮ Previous work shows:

◮ Models without notion of words/tokens can achieve

state-of-the-art results

◮ Character-level models especially suited for rare words,

morphologically rich languages, language independent NLU

C2NLU: An Overview Conclusion Heike Adel 2017/01/23 33 / 34

slide-34
SLIDE 34

C2NLU: Open questions from proposal

◮ What is the relationship between morphology and character

level models?

◮ Will character-level models reach and/or surpass token-based

models and, if so, in which subareas of NLU?

◮ Using character-level models, can we realize universal

multilingual processing (English, Chinese, Turkish, etc.)?

◮ If domain knowledge injected into ML models no longer

consists of tokenization rules and morphological expertise, what would replace it?

◮ Detecting syntactic and semantic relationships at the

character level is more expensive than at the word-level. How can we address the resulting challenges in scalability for character-level models?

◮ How can C2NLU systems accurately generate OOVs?

C2NLU: An Overview Conclusion Heike Adel 2017/01/23 34 / 34

slide-35
SLIDE 35

References

Beatrice Alex: An Unsupervised System for Identifying English Inclusions in German Text. ACL 2005.

Timothy Baldwin, Marco Lui: Language Identification: The Long and the Short of the Matter. NAACL 2010.

Miguel Ballesteros, Chris Dyer, Noah A. Smith: Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. EMNLP 2015.

Maximilian Bisani, Hermann Ney: Joint-sequence models for grapheme-to-phoneme conversion. Speech communication 2008.

Piotr Bojanowski, Edouard Grave, Armand Joulin, Tom´ aˇ s Mikolov: Enriching Word Vectors with Subword

  • Information. TACL 2017.

Jan A. Botha, Phil Blunsom: Compositional Morphology for Word Representations and Language Modelling. ICML 2014.

Kris Cao, Marek Rei: A Joint Model for Word Embedding and Word Morphology. ACL 2016.

William B. Cavnar: Using An N-Gram-Based Document Representation With a Vector Processing Retrieval

  • Model. Nist Special Publication 1995.

Aitao Chen, Jianzhang He, Liangjie Xu, Jason Meggs: Chinese Text Retrieval Without Using a Dictionary. SIGIR 1997.

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Huanbo Luan: Joint Learning of Character and Word

  • Embeddings. IJCAI 2015.

Jason P.C. Chiu, Eric Nichols: Named Entity Recognition with Bidirectional LSTM-CNNs. TACL vol. 4, 2016.

Junyoung Chung, Kyunghyun Cho, Yoshua Bengio: A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. ACL 2016.

Kenneth W. Church: Char align: A Program for Aligning Parallel Texts at the Character Level. ACL 1993.

Alexander Clark: Combining Distributional and Morphological Information for Part of Speech Induction. EACL 2003.

Marta R. Costa-juss` a, Jos´ e A.R. Fonollosa: Character-based Neural Machine Translation. ACL 2016.

Ryan Cotterell, Tim Vieira, Hinrich Sch¨ utze: A Joint Model of Orthography and Morphological Segmentation. NAACL 2016.

Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science 1995.

slide-36
SLIDE 36

References (2)

  • T. de Heer: Experiments with Syntactic Traces in Information Retrieval. Information Storage and Retrieval,
  • vol. 10, 1974.

C´ ıcero N. dos Santos, Bianca Zadrozny: Learning Character-Level Representations for Part-of-Speech Tagging. ICML 2014.

C´ ıcero N. dos Santos, Ma´ ıra Gatti: Deep Convolutional Neural Networks for Sentiment Analysis of Short

  • Texts. COLING 2014.

C´ ıcero N. dos Santos, Victor Guimar˜ aes: Boosting Named Entity Recognition with Neural Character

  • Embeddings. Named Entity Workshop 2015.

Ted Dunning: Statistical Identification of Language. Technical Report MCCS 940-273. 1994.

Asli Eyecioglu, Bill Keller: ASOBEK at SemEval-2016 Task 1: Sentence Representation with Character N-gram Embeddings for Semtnatic Textual Similarity. SemEval 2016.

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, Chris Dyer: Morphological Inflection Generation Using Character Sequence to Sequence Learning. NAACL 2016.

Stefan Gerdjikov, Klaus U. Schulz: Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure. arXiv 2016.

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya: Multilingual Language Processing From Bytes. NAACL 2016.

David Golub, Xiaodong He: Character-Level Question Answering with Attention. EMNLP 2016.

Haizhou Li, Zhang Min, Su Jian: A Joint Source-Channel Model for Machine Transliteration. ACL 2004.

Teemu Hiersim¨ aki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Sami Virpioja, Janne Pylkk¨

  • nen: Unlimited

Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech & Language, vol. 20, 2006.

  • P. Ircing, P. Krebc, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, W. Byrne: On Large Vocabulary Continuous

Speech Recognition of Highly Inflectional Language - Czech. Eurospeech 2001.

Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, Noah A. Smith: Hierarchical Character-Word Models for Language Identification. SocialNLP 2016.

Moonyoung Kang, Tim Ng, Long Nguyen: Mandarin word-character hybrid-input Neural Network Language

  • Model. Interspeech 2011.

Katharina Kann, Hinrich Sch¨ utze: MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection. Sigmorphon 2016.

slide-37
SLIDE 37

References (3)

Katharina Kann, Hinrich Sch¨ utze: Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection. ACL 2016.

Katharina Kann, Ryan Cotterell, Hinrich Sch¨ utze: Neural Morphological Analysis: Encoding-Decoding Canonical Segments. EMNLP 2016.

Ronald M. Kaplan, Martin Kay: Regular Models of Phonological Rule Systems. Computational Linguistics, vol. 20, 1994.

Kimmo Kettunen, Paul McNamee, Feza Baskaya: Using Syllables as Indexing Terms in Full-Text Information

  • Retrieval. Baltic HLT 2010.

Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush: Character-Aware Neural Language Models. AAAI 2016.

Katrin Kirchhoff, Dimitra Vergyri, Jeff Bilmes, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Conversational Arabic Speech Recognition. Computer Speech & Language, vol. 20, 2006.

Dan Klein, Joseph Smarr, Huy Nguyen, Christopher D. Manning: Named Entity Recognition with Character-Level Models. CoNLL 2003.

Kevin Knight, Jonathan Graehl: Machine transliteration. Computational Linguistics, vol. 24, 1998.

Stefan Kombrink, Mirko Hannemann, Luk´ aˇ s Burget, Hynek Heˇ rmansk´ y: Recovery of Rare Words in Lecture

  • Speech. TSD 2010.

Wanqiu Kou, Fang Li, Timothy Baldwin: Automatic Labelling of Topic Models using Word Vectors and Letter Trigram Vectors. AIRS 2015.

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer: Neural Architectures for Named Entity Recognition. NAACL 2016.

Yves Lepage, Etienne Denoual: Purest Ever Example-Based Machine Translation: Detailed Presentation and

  • Assessment. Machine Translation, vol. 19, 2005.

Wang Ling, Tiago Lu´ ıs, Lu´ ıs Marujo, Ram´

  • n Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black,

Isabel Trancoso: Finding Function in Form: Compositional Character Models for Open Vocabulary Word

  • Representation. EMNLP 2015.

Minh-Thang Luong, Richard Socher, Christopher D. Manning: Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013.

Minh-Thang Luong, Christoper D. Manning: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. ACL 2016.

slide-38
SLIDE 38

References (4)

Xuezhe Ma, Eduard Hovy: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. ACL 2016.

Paul McNamee, James Mayfield: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, vol. 7, 2004.

Rada Mihalcea, Vivi Nastase: Letter Level Learning for Language Independent Diacritics Restoration. CoNLL 2002.

Tom´ aˇ s Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, Jan ˇ Cernock´ y: Subword Language Modeling with Neural Networks. 2012.

Yasumasa Miyamoto, Kyunghyun Cho: Gated Word-Character Recurrent Language Model. EMNLP 2016.

Thomas M¨ uller, Helmut Schmid, Hinrich Sch¨ utze: Efficient Higher-Order CRFs for Morphological Tagging. EMNLP 2013.

Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow: Learning Sub-Word Units for Open Vocabulary Speech Recognition. ACL 2011.

Fuchun Peng, Dale Schuurmans, Vlado Keselj, Shaojun Wang: Language Independent Authorship Attribution using Character Level Language Models. EACL 2003.

Barbara Plank, Anders Søgaard, Yoav Goldberg: Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. ACL 2016.

Pushpendre Rastogi, Ryan Cotterell, Jason Eisner: Weighting Finite-State Transductions With Neural Context. NAACL 2016.

Adwait Ratnaparkhi: A Maximum Entropy Model for Part-of-Speech Tagging. EMNLP 1996.

Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Sch¨ utze: Statistical Models for Unsupervised, Semi-supervised and Supervised Transliteration Mining. Computational Linguistics 2016.

Hinrich Sch¨ utze: Word Space. NIPS 1992.

Hinrich Sch¨ utze: Nonsymbolic Text Representation. EACL 2017.

Terrence J. Sejnowski, Charles R. Rosenberg: Parallel networks that learn to pronounce English text. Complex systems 1987.

Rico Sennrich, Barry Haddow, Alexandra Birch: Neural Machine Translation of Rare Words with Subword

  • Units. ACL 2016.

  • M. Ali Basha Shaik, Amr El-Desoky Mousa, Ralf Schl¨

uter, Hermann Ney: Hybrid Language Models Using Mixed Types of Sub-lexical Units for Open Vocabulary German LVCSR. Interspeech 2011.

slide-39
SLIDE 39

References (5)

  • M. Ali Basha Shaik, Amr El-Desoky Mousa, Ralf Schl¨

uter, Hermann Ney: Feature-rich Sub-lexical Language Models Using a Maximum Entropy Approach for German LVCSR. Interspeech 2013.

Claude E. Shannon: Prediction and Entropy of Printed English. BSTJ, vol. 30, 1951.

Henning Sperr, Jan Niehues, Alexander Waibel: Letter N-Gram-Based Input Encoding for Continuous Space Language Models. CVSC 2013.

Rupesh K. Srivastava, Klaus Greff, J¨ urgen Schmidhuber: Highway Networks. ICML 2015.

Ilya Sutskever, James Martens, Geoffrey Hinton: Generating Text with Recurrent Neural Networks. ICML 2011.

  • rg Tiedemann, Preslav Nakov: Analyzing the Use of Character-Level Translation with Sparse and Noisy
  • Datasets. RANLP 2013.

Dimitra Vergyri, Katrin Kirchhoff, Kevin Duh, Andreas Stolcke: Morphology-Based Language Modeling for Arabic Speech Recognition. Interspeech 2004.

David Vilar, Jan-T. Peter, Hermann Ney: Can We Translate Letters? WSMT 2007.

Rudra Murthy V, Mitesh Khapra, Dr. Pushpak Bhattacharyya: Sharing Network Parameters for Crosslingual Named Entity Recognition. arXiv 2016.

Ekaterina Vylomova, Trevor Cohn, Xuanli He, Gholamreza Haffari: Word Representation Models for Morphologically Rich Languages in Neural Machine Translation. arXiv 2016.

Linlin Wang, Zhu Cao, Yu Xia, Gerard de Melo: Morphological Segmentation with Window LSTM Neural

  • Networks. AAAI 2016.

John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu: CHARAGRAM: Embedding Words and Sentences via Character n-Grams. EMNLP 2016.

Yonghui Wu et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Technical Report 2016.

Yijun Xiao, Kyunghyun Cho: Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers. arXiv 2016.

Yadollah Yaghoobzadeh, Hinrich Sch¨ utze: Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities. EACL 2017.

Zhiling Yang, Ruslan Salakhutdinov, William Cohen: Multi-Task Cross-Lingual Sequence Tagging from

  • Scratch. arXiv 2016.

Xiang Zhang, Junbo Zhao, Yann LeCun: Character-level Convolutional Networks for Text Classification. NIPS 2015.