Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 / 76

Language Model Language Model (LM): probability distribution over a sequence of words. p ( w 1 , . . . , w T ) for any sequence of length T from a vocabulary V (with w i ∈ V for all i ). Important for many downstream applications: machine translation speech recognition text generation Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 2 / 76

Count-based Language Models By the chain rule, any distribution can be factorized as T � p ( w 1 , . . . , w T ) = p ( w t | w 1 , . . . , w t − 1 ) t =1 Count-based n -gram language models make a Markov assumption: p ( w t | w 1 , . . . , w t ) ≈ p ( w t | w t − n , . . . , w t − 1 ) Need smoothing to deal with rare n -grams. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 3 / 76

Neural Language Models Neural Language Models (NLM) Represent words as dense vectors in R n (word embeddings). w t ∈ R |V| : One-hot representation of word ∈ V at time t ⇒ x t = Xw t : Word embedding ( X ∈ R n ×|V| , n < |V| ) Train a neural net that composes history to predict next word. exp( p j · g ( x 1 , . . . , x t − 1 ) + q j ) p ( w t = j | w 1 , . . . , w t − 1 ) = � exp( p j ′ · g ( x 1 , . . . , x t − 1 ) + q j ′ ) j ′ ∈V = softmax( P g ( x 1 , . . . , x t − 1 ) + q ) p j ∈ R m , q j ∈ R : Output word embedding/bias for word j ∈ V g : Composition function Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 4 / 76

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 5 / 76

Recurrent Neural Network LM (Mikolov et al. 2011) Maintain a hidden state vector h t that is recursively calculated. h t = f ( Wx t + Uh t − 1 + b ) h t ∈ R m : Hidden state at time t (summary of history) W ∈ R m × n : Input-to-hidden transformation U ∈ R m × m : Hidden-to-hidden transformation f ( · ) : Non-linearity Apply softmax to h t . Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 76

Recurrent Neural Network LM (Mikolov et al. 2011) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 10 / 76

Word Embeddings (Collobert et al. 2011; Mikolov et al. 2012) Key ingredient in Neural Language Models. After training, similar words are close in the vector space. (Not unique to NLMs) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 13 / 76

NLM Performance (on Penn Treebank) Difficult/expensive to train, but performs well. Language Model Perplexity 5-gram count-based ( Mikolov and Zweig 2012 ) 141 . 2 RNN ( Mikolov and Zweig 2012) 124 . 7 Deep RNN ( Pascanu et al. 2013) 107 . 5 LSTM ( Zaremba, Sutskever, and Vinyals 2014) 78 . 4 Renewed interest in language modeling. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 14 / 76

NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “leading”, “training”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 15 / 76

NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “trade”, “trades”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 16 / 76

NLM Issue No parameter sharing across orthographically similar words. Orthography contains much semantic/syntactic information. How can we leverage subword information for language modeling? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 17 / 76

Previous (NLM-based) Work Use morphological segmenter as a preprocessing step unfortunately ⇒ un PRE − fortunate STM − ly SUF Luong, Socher, and Manning 2013: Recursive Neural Network over morpheme embeddings Botha and Blunsom 2014: Sum over word/morpheme embeddings Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 18 / 76

This Work Main Idea : No morphology, use characters directly. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

This Work Main Idea : No morphology, use characters directly. Convolutional Neural Networks (CNN) (LeCun et al. 1989) Central to deep learning systems in vision. Shown to be effective for NLP tasks (Collobert et al. 2011) . CNNs in NLP typically involve temporal (rather than spatial) convolutions over words. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

Network Architecture: Overview Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 20 / 76

Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 21 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. 2. Take the max-over-time (with bias and nonlinearity) y = tanh(max { f [ i ] } + b ) i as the feature corresponding to the filter H (for a particular word). Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Representation of absurdity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 24 / 76

Character-level CNN (CharCNN) H ∈ R d × w : Convolutional filter matrix of width w = 3 Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 25 / 76

Character-level CNN (CharCNN) f [1] = � C [ ∗ , 1 : 3] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 26 / 76

Character-level CNN (CharCNN) f [ T − 2] = � C [ ∗ , T − 2 : T ] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 29 / 76

Character-level CNN (CharCNN) y [1] = max { f [ i ] } i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 30 / 76

Character-level CNN (CharCNN) Each filter picks out a character n -gram Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 31 / 76

Character-level CNN (CharCNN) f ′ [1] = � C [ ∗ , 1 : 2] , H ′ � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 32 / 76

Character-level CNN (CharCNN) { f ′ [ i ] } y [2] = max i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 33 / 76

Character-level CNN (CharCNN) Many filter matrices (25–200) per width (1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 34 / 76

Character-level CNN (CharCNN) Add bias, apply nonlinearity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 35 / 76

Character-level CNN (CharCNN) Before Now Word embedding Output from CharCNN PTB Perplexity: 85 . 4 PTB Perplexity: 84 . 6 CharCNN is slower, but convolution operations on GPU have been very optimized. Can we model more complex interactions between character n -grams picked up by the filters? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 36 / 76

Highway Network Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 76

Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 /

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Presentation Outline (1) K (1) Key ey Sources of TF Information: Using Right Tools (2) Trade

What are Stocks?! By: Kandyse P Mrs. Eckelman Business Math T:/peckelman/business Math/What is

Narrabri Gas Project Pre-Hearing meeting - Independent Planning Commission 1 Overview +

What Will it Cost to Adapt to Climate Change? UNFCCC Side Event June 2009 Adaptation to Climate

EARNINGS PRESENTATION Second Quarter 2018 DISCLAIMER 2 Discussion of Forward-Looking

By CHHIENG Pich, Director, DICO National Project Director, TDSP Director, EIF-NIU 1. Part I

Custom Writing Service - Special Prices Powerpoint presentation assignment app Public health

Educomp Solutions Ltd. Educomp Solutions Ltd. Creating a New Learning Curve VISION To solve