Modern Neural-Networks approaches to NLP J.-C. Chappelier - PowerPoint PPT Presentation

Introduction Introduction How does it work? Conclusion Modern Neural-Networks approaches to NLP J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 1 / 47

Introduction Objectives of this lecture Introduction CAVEAT/REMINDER How does it Introduction So, is this course a Machine Learning Course? work? NLP Natural Conclusion Language Functions Corpus-Based Approach to NLP Linguistic ◮ NLP makes use of Machine Learning (as would Image Processing for instance) Processing Levels Example of NLP ◮ but good results require: architecture ◮ good preprocessing Interdependencies between processing ◮ good data (to learn from), relevant annotations levels Conclusion ◮ good understanding of the pros/cons, features, outputs, results, ... ☞ The goal of this course is to provide you with the core concepts and baseline techniques to achieve the above mentioned requirements. � EPFL c J.-C. Chappelier & M. Rajman Introduction to INLP – 22 / 45 The goal of this lecture is to make give a broad overview on modern Neural Network approaches to NLP. This lecture is worth deepening with some full Deep Learning course; e.g.: ◮ F . Fleuret (Master) Deep learning (EE-559) ◮ J. Henderson (EDOC) Deep Learning For Natural Language Processing (EE-608) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 2 / 47

Introduction Contents Introduction How does it work? Conclusion ➀ Introduction ◮ What is it all about? What does it change? ◮ Why now? ◮ Is it worth it? ➁ How does it work? ◮ words (word2vec (CBow, Skip-gram), GloVe, fastText) ◮ documents (RNN, CNN, LSTM, GRU) ➂ Conclusion ◮ Advantages and drawbacks ◮ Future � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 3 / 47

Introduction What is it all about? Introduction What is it all about? Why now? Is it worth it? Modern approach to NLP heavily emphasizes “Neural Networks” and “Deep Learning” How does it work? Conclusion Two key ideas (which are, in fact, quite independant): ◮ make use of more abstract/algeabraic representation of words: use “ word embeddings ”: ◮ go from sparse (& high-dimensional) to dense (& less high-dimensional) representation of documents ◮ make use of (“deep”) neural networks (= trainable non-linear functions) Other characteristics: ◮ supervised tasks ◮ better results (at least on usual benchmarks) ◮ less? preprocessing/“feature selection” ◮ CPU and data consuming � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 4 / 47

Introduction How does it work? Introduction What is it all about? Why now? Is it worth it? How does it work? ◮ Key idea #1: Learning Word Representations Conclusion Typical NLP: Corpus –> some algo –> word/tokens/ n -grams vectors Key idea in recent approaches: can we do it task independant ? so as to reduce whatever NL P(rocessing) to some algebraic vector manipulation: no longer start “core (NL)P” from words anymore, but from vectors (learned once for all) that capture general syntactical and semantic information ◮ Key idea #2: use Neural Networks (NN) to do the “ from vectors to output ”-job → R m non-linear function with (many) parameters A NN is simply a R n − � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 5 / 47

Introduction Neural Networks (NN): a 4-slides primer Introduction What is it all about? Why now? ◮ NN are non-linear non-parametric (= many parameters models) functions Is it worth it? How does it work? ◮ The ones we’re here talking about are for supervised learning Conclusion ☞ make use of a loss function to evaluate how their output fits to the desired output usual loss: corpus (negative) log-likelihood ∝ P ( output | input ) ◮ non-linarity: localised on each "neuron" (1-D non linear function) sigmoïd -like (e.g. logistic function 1 / ( 1 + e − x ) ) or ReLU (weird name for very simple function: max( 0 , x ) ) sigmoid(x) max(0,x) 1 10 0.8 8 0.6 6 0.4 4 0.2 2 0 0 -10 -5 0 5 10 -10 -5 0 5 10 ◮ the non-linearity is applyed to a linear combination of input: dot-product of input (vector) and parameters (“weight” vector) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 6 / 47

Introduction Softmax output function Introduction What is it all about? Why now? Another famous non-linearity is the “softmax function” Is it worth it? How does it work? softmax = generalization from 1D to n -D of logistic function (see e.g. "Logitic Regression", 2 Conclusion weeks ago) Purpose: turns whatever list of values into a probability distribution ( x 1 ,..., x m ) �− → ( s 1 ,..., s m ) where e x i s i = m ∑ e x j j = 1 Examples: x = ( 7 , 12 , − 4 , 8 , 4 ) − → s = ( 0 . 0066 , 0 . 9752 , 1 e − 6 , 0 . 01798 , 0 . 0003 ) x = ( 0 . 33 , 0 . 5 , 0 . 1 , 0 . 07 ) − → s = ( 0 . 266 , 0 . 316 , 0 . 211 , 0 . 206 ) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 7 / 47

Introduction Multi-Layer Perceptrons (MLP) a.k.a. Feed-Forward NN Introduction What is it all about? (FFNN) Why now? Is it worth it? MLP (Rumelhart, 1986) : neurons are organized in (a few) layers, from input to ouput: How does it work? Conclusion Parameters: "weights" of the network = input weights of each neurons MLP are universal approximators : input : x 1 ,..., x n ( n -dimensional real vecto), output : ≃ f ( x 1 ,..., x n ) ∈ R m to whatever precision decided a priori In a probabilistic framework: very often used to approximate the posterior probability P ( y 1 ,..., y m | x 1 ,..., x n ) Convergence to a local minimum of the loss function (often the mean quadratic error) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 8 / 47

Introduction NN learning procedure Introduction What is it all about? Why now? Is it worth it? How does it work? General learning procedure (see e.g. Baum-Welch): Conclusion ➀ Initialize the parameters ➁ Then loop over training data ( supervised ): 1. Compute (using NN) output from given input 2. Compute loss by comparing output to reference 3. Update parameters: “backpropagation”: update proportionnal to the gradient of the loss function 4. Stopping when some criterion is fulfilled (e.g. loss function is small, validation-set error increases, number of steps is reached) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 9 / 47

Introduction about Deep Learning (more later) Introduction What is it all about? Why now? Is it worth it? How does it ◮ not all Neural Network models (NN) are deep learners work? Conclusion ◮ there is NO need of deep learning for good "word"-embeddings ◮ models: convolutional NN (CNN) or recurrrent NN (RNN, incl. LSTM) ◮ still suffer the same old problems : overfitting and computational power a quote, from Pr. Michel Jordan (IEEE Spectrum, 2014): “ deep learning is largely a rebranding of neural networks, which go back to the 1980s. They actually go back to the 1960s; it seems like every 20 years there is a new wave that involves them. In the current wave, the main success story is the convolutional neural network, but that idea was already present in the previous wave. ” Why such a reborn now? ☞ many more data (user-data pillage), more computational power (GPUs) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 10 / 47

Introduction What is Deep Learning after all? Introduction What is it all about? Why now? Is it worth it? How does it composition of many functions (neural-net layers) work? taking advantage of Conclusion ◮ the chain rule (aka “back-propagation”) ◮ stochastic gradient decent ◮ parameters-sharing/localization of computation (a.k.a. “convolutions”) ◮ parallel operations on GPUs This does not differ much from networks from the 90s: several tricks and algorithmic improvements backed-up by 1. large data sets (user-data pillage) 2. large computational resources (GPU popularized) 3. enthusiasm from academia and industry (hype) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 11 / 47

Introduction Corpus-based linguistics: the evolution Introduction What is it all about? Why now? Is it worth it? ◮ before corpora ( < 1970): hand written rules How does it work? ◮ first wave ( ≃ 1980-2015): probabilistic models (HMM, SCFG, CRF, ...) Conclusion ◮ neural-nets and "word" embedings (1986, 1990, 1997, 2003, 2011, 2013+): ◮ MLP: David Rumelhart, 1986 ◮ RNN: Jeffrey Elman, 1990 ◮ LSTM: Hochreiter and Schmidhuber, 1997 ◮ early NN Word Embdedings: Yoshua Bengio et al., 2003; Collobert & Weston (et al.) 2008 & 2011 ◮ word2vec (2013), GloVe (2014) ◮ ... ◮ transfer learning (2018–): ULMFiT (2018), ELMo (2018), BERT (2018), OpenAI GPT2 (2019) use even more than "word" embeddings: pre-trained early layers to feed the later layers of some NN, followed by a (shallow?) task-specific architecture that is trained in a supervised way � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 12 / 47

Modern Neural-Networks approaches to NLP J.-C. Chappelier - PowerPoint PPT Presentation

Introduction Introduction How does it work? Conclusion Modern Neural-Networks approaches to NLP J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult I&C EPFL c J.-C. Chappelier Modern Neural-Networks approaches to

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Modern Systems for Neural Networks Valentin Dalibard This talk 1.Practicalities of training

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2020/ In

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ In

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Level 3 CS: what I think worked (I hope) Dr N Leslie Onslow College Hakihea December 2015 Dr N

Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK

Inverse Kinematics Inverse Kinematics Inverse Kinematics Carnegie Carnegie Sebastian Grassia

Multitouch Puppetry Creating coordinated 3D motion for an articulated arm Michael Kipp Quan

COMPLETE CONFIGURATION AERO-STRUCTURAL OPTIMIZATION USING A COUPLED SENSITIVITY ANALYSIS METHOD

3Q 2017 Financial Results (1 Jul 2017 to 30 Sep 2017) 3 November 2017 Important Notice This

IETF Administra/ve Oversight Commi7ee Overview 10 March

FY2016 Financial Results (20 May 2016 to 31 Dec 2016) 13 February 2017 Important Notice This