Deep Learning for Natural Language Processing (in 2 hours) Eneko - PowerPoint PPT Presentation

Representation learning MLP: What are we learning in W 0 when we backpropagate? h 0 x W 0 W 0 (one-hot) 1 y = softmax ( W 2 h 1 + b 2 ) → 0 h 1 = f ( W 1 h 0 + b 1 ) 0 h 0 = f ( W 0 x + b 0 ) 0 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 50

Representation learning Back-propagation allows Neural Networks to learn representations for words while training: word embeddings! h 0 x W 0 W 0 (one-hot) 1 y = softmax ( W 2 h 1 + b 2 ) → 0 h 1 = f ( W 1 h 0 + b 1 ) 0 h 0 = f ( W 0 x + b 0 ) 0 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 51

Representation learning ● Back-propagation allows Neural Networks to learn representations for words while training! – Word embeddings ! Continuous vector space instead of one-hot ● Are these word embeddings useful? ● Which task would be the best to learn embeddings that can be used in other tasks? ● Can we transfer this representation from one task to the other? ● Can we have all languages in one embedding space? DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 52

Representation learning Representation learning and and word embeddings word embeddings DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 53

Word embeddings judge ● Let’s represent words as vectors: driver policeman Similar words should have vectors cop Tallinn which are close to each other Cambridge Paris London WHY? ● If an AI has seen these two sequences I live in Cambridge I live in Paris ● … then which one should be more plausible? I live in Tallinn I live in policeman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 54

Word embeddings judge ● Let’s represent words as vectors: driver policeman Similar words should have vectors cop Tallinn which are close to each other Cambridge Paris London WHY? ● If an AI has seen these two sequences I live in Cambridge I live in Paris ● … then which one should be more plausible? I live in Tallinn OK I live in policeman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 55

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Distributional vector spaces: Option 1: use co-occurrence Option 2: learn low-rank matrix PPMI dense matrix directly ● 1A: large sparse matrix ● 2A: MLP on particular classification task ● 1B: factorize it and use low- ● 2B: find a general task rank dense matrix (SVD) ● DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 59

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris General task with large quantities of data: guess the missing word (language models) CBOW : given context guess middle word … people who keep pet dogs or cats exhibit better mental and physical health … SKIP-GRAM given middle word guess context … people who keep pet dogs or cats exhibit better mental and physical health … Proposed by Mikolov et al. (2013) - Word2vec DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 60

CBOW Source: Mikolov et al. 2013 Like MLP, one layer, LARGE vocabulary LARGE number of classes! … people who keep pet The vocabulary dogs or cats exhibit better mental and physical health … Cross-entropy loss / Softmax expensive! Negative sampling : K E w n ∼ P noise log σ( h on ) J NEG w ( t ) = log σ( h 0 i )− ∑ k = 1 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 62

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Pre-trained word embeddings can leverage texts with BILLIONS of words!! Pre-trained word embeddings useful for: ● Word similarity ● Word analogy ● Other tasks like PoS tagging, NERC, sentiment analysis, etc. ● Initialize embedding layer in deep learning models DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 63

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word similarity Source: Collobert et al. 2011 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 64

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word analogy: a is to b as c is to ? king ? man is to king as woman is to ? man woman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 65

judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word analogy: a is to b as c is to ? king man is to king as woman is to ? man queen woman a − b ≈ c − d d ≈ c − a + b argmax d ∈ V ( cos ( d ,c − a + b )) king − man + woman ≈ queen DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 66

Word embeddings ● How to use embeddings in a given task (e.g. MLP sentiment analysis): – Learn them from scratch (random init.) – Initialize using pre-trained embeddings from some other task (e.g. word2vec) ● Other embeddings: – GloVe (Pennington et al. 2014) – Fastext (Mikolov et al. 2017) DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 68

Recap ● Deep learning: learn representations for words ● Are they useful for anything? ● Which task would be the best to learn embeddings that can be used in other tasks? ● Can we transfer this representation from one task to the other? ● Can we have all languages in one embedding space? DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 70

Superhuman abilities: Superhuman abilities: cross-lingual word cross-lingual word embeddings embeddings http://aclweb.org/anthology/P18-1073 http://aclweb.org/anthology/P18-1073 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 71

Contents ● Introduction to NLP – Deep Learning ~ Learning Representations ● Text as bag of words – Text Classification – Representation learning and word embeddings – Superhuman: xlingual word embeddings ● Text as a sequence: RNN – Sentence encoders – Machine Translation – Superhuman: unsupervised MT DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 120

From words to sequences ● Representation for words: one vector for each word (word embeddings ) ● Representation for sequences of words: one vector for each sequence (?!) – Is it possible to represent a sentence in one vector at all? – Let’s go back to MLP DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 121

From words to sequences MLP: What is h 0 with respect to the words in the input ? Add vectors of words in context (1’s in x), ● plus bias, apply non-linearity h 0 sentence representation f ( ∑ ⃗ w i + b 0 ) DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 122

Sentence encoder A function: D w i ∈ℝ input a sequence of word embeddings ⃗ output a sentence representation D' s ∈ℝ ⃗ hidden layers s sentence representation ⃗ sentence encoder word embeddings w 1 w 2 w 3 ⃗ ⃗ ⃗ DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 123

Sentence encoder Baseline: Continuous bag of words (pre-trained embeddings) h 1 = f ( W 1 h 0 + b 1 ) h 0 = ⃗ s s sentence representation ⃗ Σ word embeddings w 1 w 2 w 3 ⃗ ⃗ ⃗ DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 124

Deep Learning for Natural Language Processing (in 2 hours) Eneko - PowerPoint PPT Presentation

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre http://ixa2.si.ehu.eus/eneko IXA NLP group http://ixa.eus @eagirre Contents Introduction to Deep learning Deep Learning ~ Learning Representations Text as

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Study of microbiological laboratory techniques for the etiological diagnosis and guidance in the

Non-equilibrium Thermodynamics of Driven Disordered Materials Eran Bouchbinder Weizmann

Economics of the Internet: A Policy Perspective Saswati Sarkar - A joint work with Mohammad Hassan

Personalized PageRank over WordNet for Similarity and Word Sense Disambiguation Eneko Agirre

JAKIN ETA EREIN Aldaketa klimatikoa eta planetaren energia Erein?, sembrar -> Eragin: promover

DIGITAL COMMONWEALTH Repository Systems Update: 2020 Eben English Boston Public Library

Determinants of Female Labour Force Participation in Jordan Alma Boustati UNU WIDER SOAS

Models in spintronics (Part II) OUTLINE : Spin-dependent transport in magnetic tunnel junctions

Deep Learning for Natural Language Processing (in 2 hours) Eneko - PowerPoint PPT Presentation

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre http://ixa2.si.ehu.eus/eneko IXA NLP group http://ixa.eus @eagirre Contents Introduction to Deep learning Deep Learning ~ Learning Representations Text as

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Study of microbiological laboratory techniques for the etiological diagnosis and guidance in the

Non-equilibrium Thermodynamics of Driven Disordered Materials Eran Bouchbinder Weizmann

Economics of the Internet: A Policy Perspective Saswati Sarkar - A joint work with Mohammad Hassan

Personalized PageRank over WordNet for Similarity and Word Sense Disambiguation Eneko Agirre

JAKIN ETA EREIN Aldaketa klimatikoa eta planetaren energia Erein?, sembrar -&gt; Eragin: promover

DIGITAL COMMONWEALTH Repository Systems Update: 2020 Eben English Boston Public Library

Determinants of Female Labour Force Participation in Jordan Alma Boustati UNU WIDER SOAS

Models in spintronics (Part II) OUTLINE : Spin-dependent transport in magnetic tunnel junctions

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

JAKIN ETA EREIN Aldaketa klimatikoa eta planetaren energia Erein?, sembrar -> Eragin: promover