Transformers Pre-trained Language Models LING572 Advanced - PowerPoint PPT Presentation

Language Modeling ● Recent innovation: use language modeling (a.k.a. next word prediction) ● And variants thereof ● Linguistic knowledge: ● The students were happy because ____ … ● The student was happy because ____ … ● World knowledge: ● The POTUS gave a speech after missiles were fired by _____ ● The Seattle Sounders are so-named because Seattle lies on the Puget _____ 28

Language Modeling is “Unsupervised” ● An example of “unsupervised” or “semi-supervised” learning ● NB: I think that “un-annotated” is a better term. Formally, the learning is supervised. But the labels come directly from the data, not an annotator. ● E.g.: “Today is the first day of 575.” ● (<s>, Today) ● (<s> Today, is) ● (<s> Today is, the) ● (<s> Today is the, first) ● … 29

Data for LM is cheap 30

Data for LM is cheap Text 30

Text is abundant ● News sites (e.g. Google 1B) ● Wikipedia (e.g. WikiText103) ● Reddit ● …. ● General web crawling: ● https://commoncrawl.org/ 31

The Revolution will not be [Annotated] Yann LeCun 32 https://twitter.com/rgblong/status/916062474545319938?lang=en

ULMFiT Universal Language Model Fine-tuning for Text Classification (ACL ’18) 33

ULMFiT 34

ULMFiT 35

Deep Contextualized Word Representations   Peters et. al (2018) 36

Deep Contextualized Word Representations   Peters et. al (2018) ● NAACL 2018 Best Paper Award 36

Deep Contextualized Word Representations   Peters et. al (2018) ● NAACL 2018 Best Paper Award ● E mbeddings from L anguage Mo dels (ELMo) ● [aka the OG NLP Muppet] 36

Deep Contextualized Word Representations   Peters et. al (2018) ● Comparison to GloVe: Source Nearest Neighbors playing, game, games, played, players, plays, player, Play, GloVe play football, multiplayer Chico Ruiz made a Kieffer, the only junior in the group, was commended for spectacular play on his ability to hit in the clutch, as well as his all-round Alusik’s grounder… excellent play. biLM Olivia De Havilland …they were actors who had been handed fat roles in a signed to do a successful play , and had talent enough to fill the roles Broadway play for competently, with nice understatement. Garson… 37

Deep Contextualized Word Representations   Peters et. al (2018) ● Used in place of other embeddings on multiple tasks: SQuAD = Stanford Question Answering Dataset SNLI = Stanford Natural Language Inference Corpus SST -5 = Stanford Sentiment Treebank 38 figure: Matthew Peters

BERT: Bidirectional Encoder Representations from Transformers Devlin et al NAACL 2019 39

Overview ✓ ● Encoder Representations from Transformers: ● Bidirectional: ………? ● BiLSTM (ELMo): left-to-right and right-to-left ● Self-attention: every token can see every other ● How do you treat the encoder as an LM (as computing P ( w t | w t − 1 , w t − 2 , …, w 1 ) )? ● Don’t: modify the task 40

Masked Language Modeling ● Language modeling: next word prediction ● Masked Language Modeling (a.k.a. cloze task): fill-in-the-blank ● Nancy Pelosi sent the articles of ____ to the Senate. ● Seattle ____ some snow, so UW was delayed due to ____ roads. P ( w t | w t + k , w t +( k − 1) , …, w t +1 , w t − 1 , …, w t − ( m +1) , w t − m ) ● I.e. ● (very similar to CBOW: continuous bag of words from word2vec) ● Auxiliary training task: next sentence prediction. ● Given sentences A and B, binary classification: did B follow A in the corpus or not? 41

Schematically 42

Some details 43

Some details ● BASE model: ● 12 Transformer Blocks ● Hidden vector size: 768 ● Attention heads / layer: 12 ● Total parameters: 110M 43

Some details ● BASE model: ● 12 Transformer Blocks ● Hidden vector size: 768 ● Attention heads / layer: 12 ● Total parameters: 110M ● LARGE model: ● 24 Transformer Blocks ● Hidden vector size: 1024 ● Attention heads / layer: 16 ● Total parameters: 340M 43

Input Representation 44

Input Representation ● [CLS], [SEP]: special tokens 44

Input Representation ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? 44

Input Representation ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? ● Position embeddings: provide position in sequence (learned, not fixed, in this case) 44

Input Representation 🧑🧑🤕🤕 ● [CLS], [SEP]: special tokens ● Segment: is this a token from sentence A or B? ● Position embeddings: provide position in sequence (learned, not fixed, in this case) 44

WordPiece Embeddings ● Another solution to OOV problem, from NMT context (see Wu et al 2016) ● Main idea: ● Fix vocabulary size |V| in advance [for BERT: 30k] ● Choose |V| wordpieces (subwords) such that total number of wordpieces in the corpus is minimized ● Frequent words aren’t split, but rarer ones are ● NB: this is a small issue when you transfer to / evaluate on pre-existing tagging datasets with their own vocabularies. 45

Training Details ● BooksCorpus (800M words) + Wikipedia (2.5B) ● Masking the input text. 15% of all tokens are chosen. Then: ● 80% of the time: replaced by designated ‘[MASK]’ token ● 10% of the time: replaced by random token ● 10% of the time: unchanged ● Loss is cross-entropy of the prediction at the masked positions. ● Max seq length: 512 tokens (final 10%; 128 for first 90%) ● 1M training steps, batch size 256 = 4 days on 4 or 16 TPUs 46

Initial Results 47

Ablations ● Not a given (depth doesn’t help ELMo); possibly a difference between fine- tuning vs. feature extraction ● Many more variations to explore 48

Transformers Pre-trained Language Models LING572 Advanced - PowerPoint PPT Presentation

Transformers Pre-trained Language Models LING572 Advanced Statistical Methods for NLP March 10, 2020 1 Announcements Thanks for being here! Please be active on Zoom chat! Thats the only form of interaction; I wont be able to

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

Commonsense Knowledge in Pre-trained Language Models Vered Shwartz July 5th, 2020 Commonsense

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Libraries and Tools Transformers, AllenNLP LING575 Analyzing Neural Language Models Shane

Contextual Word Representations with BERT and Other Pre-trained Language Models Jacob Devlin

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Language Models with Transformers Chenguang Wang, Mu Li, Alexander J. Smola Amazon Web Services

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin,

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

SoC SoC Design Design g Lecture L Lecture 3: Introduction to ASICs 3 I : Introduction to

Commissioning of the ATLAS Tile Hadronic Calorimeter with cosmic muons, single beams and first

Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich The 29th CREST Open

Static and dynamic verification Software inspections Concerned with analysis of the static

Conflicts of Interest Universal Coverage Strategies in California I have no financial conflicts.

Accidentally Unhealthy: Americas unintentional relationship with employers in an inefficient

Content Title Content goes here. How to Be a Single-Payer Advocate During a Pandemic Friday