Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos - PowerPoint PPT Presentation

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner

Outline • Seq2Seq +Attention • Transformers +BERT • Embeddings CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2

Illustration: http://jalammar.github.io/illustrated-bert/ CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

ELMo: Stacked Bi-directional LSTMs • ELMo yielded incredibly good word embeddings, which yielded state-of-the-art results when applied to many NLP tasks. • Main ELMo takeaway: given enough training data, having tons of explicit connections between your vectors is useful (system can determine how to best use context) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018

REFLECTION So far, for all of our sequential modelling, we have been concerned with emitting 1 output per input datum. Sometimes, a sequence is the smallest granularity we care about though (e.g., an English sentence) CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 6

Sequence-to-Sequence (seq2seq) • If our input is a sentence in Language A, and we wish to translate it to Language B, it is clearly sub-optimal to translate word by word (like our current models are suited to do). • Instead, let a sequence of tokens be the unit that we ultimately wish to work with (a sequence of length N may emit a sequences of length M ) • Seq2seq models are comprised of 2 RNNs : 1 encoder, 1 decoder CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Hidden layer Input layer The brown dog ran ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Hidden layer Input layer The brown dog ran ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Hidden layer Input layer The brown dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le Hidden layer Input layer The brown dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le Hidden layer Input layer The brown Le dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien Hidden layer Input layer The brown Le dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien Hidden layer Input layer chien The brown Le dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun Hidden layer Input layer chien The brown Le dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun Hidden layer Input layer chien The brown Le brun dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun a Hidden layer Input layer chien The brown Le brun dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun a Hidden layer Input layer chien The brown Le brun a dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun a couru Hidden layer Input layer chien The brown Le brun a dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun a couru Hidden layer Input layer chien The brown Le brun a couru dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Le chien brun a couru </s> Hidden layer Input layer chien The brown Le brun a couru dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Hidden layer Input layer chien The brown Le brun a couru dog <s> ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Hidden layer Input layer The brown dog chien ran brun couru a Le DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Training occurs like RNNs typically do; the loss (from the decoder outputs) is calculated, and we update weights all the way to the beginning (encoder) Hidden layer Input layer The brown dog chien ran brun couru a Le DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Hidden layer Input layer The brown dog chien ran brun couru a Le DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) See any issues with this traditional seq2seq paradigm? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) It’s crazy that the entire “ meaning ” of the 1 st sequence is expected to be packed into this one embedding, and that the encoder then never interacts w/ the decoder again. Hands free. Hidden layer Input layer The brown dog chien ran brun couru a Le DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Sequence-to-Sequence (seq2seq) Instead, what if the decoder, at each step, pays attention to a distribution of all of the encoder’s hidden states? CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention Le Hidden layer Input layer The brown dog ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention chien Le Hidden layer Input layer The brown dog ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention chien brun Le Hidden layer Input layer The brown dog ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention chien brun a Le Hidden layer Input layer The brown dog ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention chien brun couru a Le Hidden layer Input layer The brown dog ran DECODER RNN ENCODER RNN CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

seq2seq + Attention Attention: • greatly improves seq2seq results • allows us to visualize the contribution each word gave during each step of the decoder Image source: Fig 3 in Bahdanau et al., CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2015

CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 37

Self-Attention • Models direct relationships between all words in a given sequence (e.g., sentence) • Does not concern a seq2seq (i.e., encoder-decoder RNN) framework • Each word in a sequence can be transformed into an abstract representation (embedding) based on the weighted sums of the other words in the same sequence CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Self-Attention This is a large simplification. The representations are created Output representation from using Query, Key, and Value vectors, produced from learned weight matrices during Training Input vectors The brown dog ran CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Self-Attention This is a large simplification. Output The representations are created representations from using Query, Key, and Value vectors, produced from learned weight matrices during Training Input vectors The brown dog ran CS109B, P ROTOPAPAS , G LICKMAN , T ANNER

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos - PowerPoint PPT Presentation

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Seq2Seq +Attention Transformers +BERT Embeddings CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2 Illustration:

The Modelling and Simulation Process 1. History of Modelling and Simulation 2. Modelling and

(Modelling) Semantics of Modelling Languages Hans Vangheluwe 7 September 2010, Lisboa, Portugal

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Integration of a modelling language and a distributed programming language Alexis Grall

LECTURE 3: BUSINESS ARCHITECTURE ASPECTS: BUSINESS PROCESS MODELLING 1 Lecture 3 : Business

Physical Modelling Physical Modelling with with ModelVision ModelVision, , Physical Modelling

Modelling and Synthesis of User Interfaces for Complex, Web-Based Modelling Environments Jacob

Course Overview CMPUT 654: Modelling Human Strategic Behaviour Strategic Modelling This course

Course Overview CMPUT 654: Modelling Human Strategic Behaviour Strategic Modelling This

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

(Domain-Specific) Modelling Language Engineering Hans Vangheluwe 5 September 2010, Lisboa,

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave

State College Nursing (RN) Program Electronic Information Session Fall 2020 Lake Worth

Risk Management Framework November 2004 Risk Facilitator: Dennis J Clark 9/21/2005 LBA Risk

Evaluating risk minimisation effectiveness Where are we now? Elspeth Kay Director, RMP

QA Framework for monitoring in 2012- 2013 Anne Trotter Standards Compliance Manager 25

Seminar on Land-Use Planning and Industrial Safety Session 5 Practical Session with a Role

RomReal Limited Investor presentation Second Quarter (Q2) 2018 results Harris Palaondas - IR

WAYNE METALS SPECIFIC TOOLS Problem Solving & Root Cause Analysis 8-STEP PDCA PROCESS

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos - PowerPoint PPT Presentation

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner Outline Seq2Seq +Attention Transformers +BERT Embeddings CS109B, P ROTOPAPAS , G LICKMAN , T ANNER 2 Illustration:

The Modelling and Simulation Process 1. History of Modelling and Simulation 2. Modelling and

(Modelling) Semantics of Modelling Languages Hans Vangheluwe 7 September 2010, Lisboa, Portugal

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Integration of a modelling language and a distributed programming language Alexis Grall

LECTURE 3: BUSINESS ARCHITECTURE ASPECTS: BUSINESS PROCESS MODELLING 1 Lecture 3 : Business

Physical Modelling Physical Modelling with with ModelVision ModelVision, , Physical Modelling

Modelling and Synthesis of User Interfaces for Complex, Web-Based Modelling Environments Jacob

Course Overview CMPUT 654: Modelling Human Strategic Behaviour Strategic Modelling This course

Course Overview CMPUT 654: Modelling Human Strategic Behaviour Strategic Modelling This

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

(Domain-Specific) Modelling Language Engineering Hans Vangheluwe 5 September 2010, Lisboa,

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave

State College Nursing (RN) Program Electronic Information Session Fall 2020 Lake Worth

Risk Management Framework November 2004 Risk Facilitator: Dennis J Clark 9/21/2005 LBA Risk

Evaluating risk minimisation effectiveness Where are we now? Elspeth Kay Director, RMP

QA Framework for monitoring in 2012- 2013 Anne Trotter Standards Compliance Manager 25

Seminar on Land-Use Planning and Industrial Safety Session 5 Practical Session with a Role

RomReal Limited Investor presentation Second Quarter (Q2) 2018 results Harris Palaondas - IR

WAYNE METALS SPECIFIC TOOLS Problem Solving &amp; Root Cause Analysis 8-STEP PDCA PROCESS

WAYNE METALS SPECIFIC TOOLS Problem Solving & Root Cause Analysis 8-STEP PDCA PROCESS