Ad Advanced ed Pre-tr training languag language m e models - PowerPoint PPT Presentation

Ad Advanced ed Pre-tr training languag language m e models dels a br a brie ief in introduct ductio ion Xiachong Feng

Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : 《 Attention is all you need 》 4. Word embedding and pre-trained model 5. ELMo : 《 Deep contextualized word representations 》 6. OpenAI GPT : 《 Improving Language Understanding by Generative Pre-Training 》 7. BERT : 《 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 》 8. Conclusion

Encode Enc der-De Decoder • When the sentence is short , context vector 𝑧 * = 𝑔(𝐷) may retain some important information 𝑧 , = 𝑔(𝐷, 𝑧 * ) • When the sentence is long , context vector 𝑧 = = 𝑔(𝐷, 𝑧 * , 𝑧 , ) will lose some information such as semantic. 𝐷 = 𝐺(𝑦 * , 𝑦 * , … 𝑦 . ) 𝑧 : = 𝑕(𝐷, 𝑧 * , 𝑧 , , … , 𝑧 :;* ) 𝑈𝑏𝑠𝑕𝑓𝑢 =< 𝑧 * , 𝑧 , , … 𝑧 5 > 𝑇𝑝𝑣𝑠𝑑𝑓 =< 𝑦 * , 𝑦 , , … 𝑦 . > Encoder-Decoder Framework

Sof Soft-At Attention 𝑧 * = 𝑔(𝐷 * ) A 𝐷 : = > 𝑏 :? ℎ ? 𝑧 , = 𝑔(𝐷 , , 𝑧 * ) ?@* 𝑧 = = 𝑔(𝐷 = , 𝑧 * , 𝑧 , )

Cor Core idea of of Attention on A 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 𝑅𝑣𝑓𝑠𝑧, 𝑇𝑝𝑣𝑠𝑑𝑓 = > 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : ∗ 𝑊𝑏𝑚𝑣𝑓 : :@* 𝐸𝑝𝑢: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = 𝑅𝑣𝑓𝑠𝑧 N 𝐿𝑓𝑧 : 𝑅𝑣𝑓𝑠𝑧 N 𝐿𝑓𝑧 : 𝐷𝑝𝑡𝑗𝑜𝑓: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = | 𝑅𝑣𝑓𝑠𝑧 | N ||𝐿𝑓𝑧 : || 𝑁𝑀𝑄: 𝑇𝑗𝑛𝑗𝑚𝑏𝑠𝑗𝑢𝑧 𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : = MLP(𝑅𝑣𝑓𝑠𝑧, 𝐿𝑓𝑧 : )

Attention Ti Timeline 2014 2015-2016 Recurrent Models Attention-based Of Visual attention RNN/CNN in NLP 2014-2015 2017 Attention in Self-Attention Neural machine translation (Transformer)

At Attention is all you need Key words • Transformer • Faster • Encoder-Decoder • Scaled Dot-Product Attention • Multi-Head Attention • Position encoding • Residual connections

A A High-Le Level Look Look

Enc Encode der-De Decoder 1. The encoders are all identical in structure (yet they do not share weights). 2. The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. 3. The outputs of the self-attention layer are fed to a feed-forward neural network . The exact same feed-forward network is independently applied to each position. 4. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence

Enc Encode der De Detail ail independent dependent 1. Word embedding 2. Self-attention 3. FFNN

Se Self-At Attention Hig High h Level el As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

Se Self-At Attention in Detail The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors Size of 512 Query vector Size of 64 Key vector Value vector

Se Self-At Attention in Detail • The second step in calculating self- attention is to calculate a score. • The third and forth steps are to divide the scores by 8, then pass the result through a softmax operation. • The fifth step is to Scaled Dot-Product Attention multiply each value vector by the softmax score • The sixth step is to sum up the weighted value vectors.

Se Self-At Attention in Detail The self-attention calculation in matrix form

Mu Multi-hea head d atten ention

Po Positional Encoding

Th The Residuals

Enc Encode der-De Decoder

De Decoder

Li Linear a r and Sof Softma max La Layer

Tr Transformer Matrix Attention: K=V≠Q Q K FFNN output V Position embedding Self attention: K=V=Q Word embedding

Language mod La model Language model is a probability distribution over a • sequences of words. 𝑄 𝑥 * , 𝑥 , , … , 𝑥 . = 𝑞 𝑥 * 𝑞 𝑥 , 𝑥 * 𝑞 𝑥 = 𝑥 * , 𝑥 , … N-Gram Models • Uni-gram • Bi-gram • Tri-gram • Neural network language models(NNLM) •

NNLM NNLM 𝑎 Z = tanh(𝑋𝑦 Z + 𝑞) 𝑧 Z = 𝑉𝑨 Z + 𝑟 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑧 Z )

NNLM a NNLM and W Word2V 2Vec Neural probabilistic language model(2003) Word2vec(2013)

Pr Pre-tr train ainin ing Word embedding • Word2vec • Glove • FastText • … • Transfer learning •

Ov Overview

EL ELMo ELMo ( E mbeddings from L anguage Mo dels) • complex characteristics of word use (syntax and • semantics) across linguistic contexts (polysemy) • Feature-Based • ELMo representations are deep, in the sense that • they are a function of all of the internal layers of the biLM. The higher-level LSTM states capture context- • dependent aspects of word meaning, while lower- level states model aspects of syntax.

Bi Bidirection onal language mod models Forward language model • Backward language model • Jointly maximizes the log likelihood of the forward • and backward directions Token representation Softmax layer share some weights between directions instead of using completely independent parameters.

Em Embe beddi dding ng fr from langua nguage mode dels ELMo is a task specific combination of the intermediate layer • representations in the biLM. For k-th token, L-layer bi-directional Language models • computes 2L+1 representations: For a specific down-stream task, ELMo would learn a weight • to combine these representations(In the simplest just selects the top layer )

Em Embe beddi dding ng fr from langua nguage mode dels

Us Using biLMs for supervised NLP P tasks Concatenate the ELMo vector with initial word embedding • and pass representation into the task RNN. Including ELMo at the output of the task RNN by introducing • another set of output specific linear weights. Add a moderate amount of dropout to ELMo, in some cases to • regularize the ELMo weights by adding to the loss.

Expe Experiment 1. Question answering 2. Textual entailment 3. Semantic role labeling 4. Coreference resolution 5. Named entity extraction 6. Sentiment analysis

Ad Advanced ed Pre-tr training languag language m e models - PowerPoint PPT Presentation

Ad Advanced ed Pre-tr training languag language m e models dels a br a brie ief in introduct ductio ion Xiachong Feng Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : Attention is all you need 4. Word embedding

C C # # 7, 7, 8, 8, and beyon ond: lang languag uage e fe features from design to to

Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau

IMPLIC IMPLICATION & TION & EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag

PRE GAMES TRAINING CAMPS Northern Ireland PAUL SCOTT SPORT NORTHERN IRELAND What is a Pre Games

Participant Manual August 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Participant Manual July 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li

Advanced Nutrition Course Advanced Nutrition Course 6 Week Advanced Nutrition Live Online

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

Sepsis Awareness Training Clinical Staff Pre-Training Assessment Please complete the

Sepsis Awareness Training Non-Clinical Staff Pre-Training Survey Please complete the

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Lecture 3: Advanced SQL 1 / 64 Advanced SQL Relational Language Relational Language User

PORTAL FOR DISTANCE LEARNING AND ADVANCED TRAINING PORTAL FOR DISTANCE LEARNING AND ADVANCED

THE OECD SCIENCE, TECHNOLOGY AND INNOVATION OUTLOOK 2018: MAIN MESSAGES AND KNOWLEDGE

Veronese The Choice between Virtue and Vice (ca. 1565) Jeppe von Platz Kants System of

The Naproche System Daniel K uhlwein University of Nijmegen daniel.kuehlwein@gmail.com

PoS Tagging June 2, 2009 Text Annotation Be ata B. Megyesi beata.megyesi@lingfil.uu.se 1

Seman<cs of Language Learning Language from The meaning

By Syed Safeer Hussain National Electric Power Regulatory Authority (NEPRA) Pakistan 1 Scheme

Welcome Information for new starters fosach.link@gmail.com Who are the Friends? Welcome

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

Sambuz

Useful Links

Newsletter

Mail Us

Ad Advanced ed Pre-tr training languag language m e models - PowerPoint PPT Presentation

Ad Advanced ed Pre-tr training languag language m e models dels a br a brie ief in introduct ductio ion Xiachong Feng Ou Outline 1. Encoder-Decoder 2. Attention 3. Transformer : Attention is all you need 4. Word embedding

C C # # 7, 7, 8, 8, and beyon ond: lang languag uage e fe features from design to to

Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau

IMPLIC IMPLICATION &amp; TION &amp; EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag

PRE GAMES TRAINING CAMPS Northern Ireland PAUL SCOTT SPORT NORTHERN IRELAND What is a Pre Games

Participant Manual August 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Participant Manual July 2017 Pre-CERCLA Screening Training Pre-CERCLA Screening Course 1

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li

Advanced Nutrition Course Advanced Nutrition Course 6 Week Advanced Nutrition Live Online

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

Sepsis Awareness Training Clinical Staff Pre-Training Assessment Please complete the

Sepsis Awareness Training Non-Clinical Staff Pre-Training Survey Please complete the

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Compliance Training 2012 Compliance Training 2012 Training Objectives Training Objectives

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Lecture 3: Advanced SQL 1 / 64 Advanced SQL Relational Language Relational Language User

PORTAL FOR DISTANCE LEARNING AND ADVANCED TRAINING PORTAL FOR DISTANCE LEARNING AND ADVANCED

THE OECD SCIENCE, TECHNOLOGY AND INNOVATION OUTLOOK 2018: MAIN MESSAGES AND KNOWLEDGE

Veronese The Choice between Virtue and Vice (ca. 1565) Jeppe von Platz Kants System of

The Naproche System Daniel K uhlwein University of Nijmegen daniel.kuehlwein@gmail.com

PoS Tagging June 2, 2009 Text Annotation Be ata B. Megyesi beata.megyesi@lingfil.uu.se 1

Seman&lt;cs of Language Learning Language from The meaning

By Syed Safeer Hussain National Electric Power Regulatory Authority (NEPRA) Pakistan 1 Scheme

Welcome Information for new starters fosach.link@gmail.com Who are the Friends? Welcome

UBER RUSH AND REBUILDING UBERS DISPATCHING PLATFORM motivation CHAPTER 1 OF 8 MOTIVATION

Sambuz

Useful Links

Newsletter

Mail Us

IMPLIC IMPLICATION & TION & EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag

Seman<cs of Language Learning Language from The meaning