(Even More) Language Modeling: Multi-Task Learning, and Building - PowerPoint PPT Presentation

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro

Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders

Remember Multi-class Classification from Deck 5 Given input 𝑦 , predict discrete label 𝑧 If 𝑧 ∈ {0,1} (or 𝑧 ∈ If 𝑧 ∈ {0,1, … , 𝐿 − 1} (for Single {True, False} ), then a finite K), then a multi-class output binary classification task classification task If multiple 𝑧 𝑚 are Each 𝑧 𝑚 could be binary or Multi- predicted, then a multi- multi-class output label classification task Given input 𝑦 , predict multiple discrete labels 𝑧 = (𝑧 1 , … , 𝑧 𝑀 ) Multi-label Classification

Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict

Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict • Multi-label classification often involves multiple labels for the same task – E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”)

Multi-Label vs. Multi-Task • These can be considered the same thing but often they’re different • “Task”: a thing of interest to predict • Multi-label classification often involves multiple labels for the same task – E.g., sentiment (a tweet could be both “HAPPY” and “EXCITED”) • Multi- task learning is for different “tasks,” e.g., – Task 1: Category of document (SPORTS, FINANCE, etc.) – Task 2: Sentiment of document – Task 3: Part-of-speech per token – Task 4: Syntactic parsing – …

Multi-Task Learning Single-Task Learning Train a system to “do one thing” (make predictions for one task) y h x

Multi-Task Learning Single-Task Learning If you have multiple (T) Train a system to “do one thing” tasks, then train (make predictions for one task) multiple systems y 1 y y 2 y T h 1 h h 2 h T x x x x

Multi-Task Learning Single-Task Learning If you have multiple (T) Train a system to “do one thing” tasks, then train (make predictions for one task) multiple systems y 1 y y 2 y T different decoders h 1 h h 2 h T different encoders x x x x

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) y Key idea/assumption: if the tasks are somehow related, can we leverage h an ability to do task i well into an ability to do task j well? x

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) Key idea/assumption: if y the tasks are somehow related, can we leverage an ability to do task i h well into an ability to do task j well? Example: could features/embeddings x useful for language modeling (task i) also be useful for part-of-speech tagging (task j)?

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T Key idea/assumption: if the tasks are somehow related, can we h h leverage an ability to do task i well into an ability to do task j well? x x

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T h h x x

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T h h same encoder learns good, general features/embeddings x x

Multi-Task Learning Single-Task Learning Multi-Task Learning Train a system to “do one thing” Train a system to “do multiple (make predictions for one task) things” (make predictions for T different tasks) … y 1 y 2 y y T different decoders learn how to use those reps. for each task h h same encoder learns good, general features/embeddings x x

General Multi-Task Training Procedure Given: T different corpora 𝐷 1 , … 𝐷 𝑈 for tasks 𝑢 , … , (𝑦 𝑂 𝑢 𝑢 , 𝑧 𝑂 𝑢 𝑢 )} 𝑢 , 𝑧 1 𝐷 𝑢 = { 𝑦 1 Encoder 𝐹 and T different decoders 𝐸 1 , … 𝐸 𝑈 These have weights (parameters) you need to learn

General Multi-Task Training Procedure Given: T different corpora 𝐷 1 , … 𝐷 𝑈 for tasks 𝑢 , … , (𝑦 𝑂 𝑢 𝑢 , 𝑧 𝑂 𝑢 𝑢 )} 𝑢 , 𝑧 1 𝐷 𝑢 = { 𝑦 1 Encoder 𝐹 and T different decoders 𝐸 1 , … 𝐸 𝑈 Until converged or done: 1. Select the next task t 𝑢 from 𝐷 𝑢 𝑢 , 𝑧 𝑗 2. Randomly sample an instance 𝑦 𝑗 𝑢 , 𝑧 𝑗 𝑢 3. Train the encoder 𝐹 and decoder 𝐷 𝑢 on 𝑦 𝑗

WARNING: Multi-task learning did not begin in 2008

Two Well-Known Instances of Multi- Task Learning in NLP Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL)

Two Well-Known Instances of Multi- Task Learning in NLP Collobert and Weston (2008, ICML) BERT [Devlin et al., 2019 NAACL) We’ll return to this

Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

Remember Semantic Role Labeling (SRL) from Deck 4 • For each predicate (e.g., verb) 1. find its arguments (e.g., NPs) 2. determine their semantic roles John drove Mary from Austin to Dallas in his Toyota Prius. The hammer broke the window. – agent: Actor of an action – patient: Entity affected by the action – source: Origin of the affected entity – destination: Destination of the affected entity – instrument: Tool used in performing action. – beneficiary: Entity for whom action is performed Slide thanks to Ray Mooney (modified) Slide courtesy Jason Eisner, with mild edits

Remember Uses of Semantic Roles from Deck 4 • Find the answer to a user’s question – “Who” questions usually want Agents – “What” question usually want Patients – “How” and “with what” questions usually want Instruments – “Where” questions frequently want Sources/Destinations. – “For whom” questions usually want Beneficiaries – “To whom” questions usually want Destinations • Generate text – Many languages have specific syntactic constructions that must or should be used for specific semantic roles. • Word sense disambiguation, using selectional restrictions – The bat ate the bug . (what kind of bat? what kind of bug?) • Agents (particularly of “eat”) should be animate – animal bat, not baseball bat • Patients of “eat” should be edible – animal bug, not software bug – John fired the secretary. John fired the rifle. Patients of fire 1 are different than patients of fire 2 Slide thanks to Ray Mooney (modified) Slide courtesy Jason Eisner, with mild edits

Collobert and Weston (2008, ICML) Core task: Semantic Role Labeling Present a unified architecture for doing five other, related NLP tasks • Part-of-Speech Tagging • Chunking • Named Entity Recognition • Language Modeling • Prediction of Semantic Relatedness

Part of Speech Tagging (sequence is probably not right!) Noun Verb Noun Prep Noun Noun y 0 y 1 y 2 y 3 y 4 y 5 h 0 h 1 h 2 h 3 h 4 h 5 x 0 x 1 x 2 x 3 x 4 x 5 British Left Waffles on Falkland Islands

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence

Syntactic Parsing (One Option) (parse is probably not right!) (parse from the Berkeley parser: https://parser.kitaev.io/)

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Syntactic parsing: produce an analysis of a sentence according to some grammatical rules

Part-of-speech tagging: assign a part-of-speech tag to every word in a sentence Chunking: A Syntactic parsing: Shallow produce an analysis of a Syntactic sentence Parsing according to some grammatical rules

(Even More) Language Modeling: Multi-Task Learning, and Building - PowerPoint PPT Presentation

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders Remember

Statements and open sentences Statements: 2 is an even integer. 3 is an even integer.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Identify the Break-Even Point 1 What does it mean to break-even? 2

Motivating examples (1) Program Analysis and Transformation oddEven even(X), even(s(X)).

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Even more on Speech Even more on Speech Perception: It s not just s not just Perception:

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Conferencing Scenarios draft-even-xcon-conference-scenarios-00.txt Roni Even Roni.even@ polycom

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

T w o W e e k S p r i n g S a l e ! SAVE EVEN MORE SAVE EVEN MORE O N O N T

T w o W e e k W i n t e r S a l e ! SAVE EVEN MORE SAVE EVEN MORE N E O O N

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Integrals With Only Even Powers cos 2 xdx Integrals With Only Even Powers cos 2 xdx

Contributing to Open Source Part 1: Your Expectations, Project Selection, and Protocol OSS

Deep learning in computer vision and natural language processing Yifeng Tao School of Computer

Constituency-based Hyponymy Extraction COMP 762 Chianyu Liu, 260576898 Hyponym and Hypernym

Using Language Modeling for Spam Detec7on in Social Reference

VICORE PHARMA AB Untangling the Dualistic Components of the RAS in PF the Yin and Yang Dr Rohit

A Computa1onal Framework for Social Capital in Online Communi1es

Community Network Clouds Amin Khan, Felix Freitag Universitat Politcnica de Catalunya,

Tax Exempt Bonds 101 This webinar will be a basic introduction to Tax-Exempt Bonds. It will

(Even More) Language Modeling: Multi-Task Learning, and Building - PowerPoint PPT Presentation

(Even More) Language Modeling: Multi-Task Learning, and Building Blocks of Transformers CMSC 473/673 Frank Ferraro Outline Multi-Task Learning The Attention Mechanism Transformer Language Models as General Language Encoders Remember

Statements and open sentences Statements: 2 is an even integer. 3 is an even integer.

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Identify the Break-Even Point 1 What does it mean to break-even? 2

Motivating examples (1) Program Analysis and Transformation oddEven even(X), even(s(X)).

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Even more on Speech Even more on Speech Perception: It s not just s not just Perception:

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Conferencing Scenarios draft-even-xcon-conference-scenarios-00.txt Roni Even Roni.even@ polycom

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

T w o W e e k S p r i n g S a l e ! SAVE EVEN MORE SAVE EVEN MORE O N O N T

T w o W e e k W i n t e r S a l e ! SAVE EVEN MORE SAVE EVEN MORE N E O O N

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Integrals With Only Even Powers cos 2 xdx Integrals With Only Even Powers cos 2 xdx

Contributing to Open Source Part 1: Your Expectations, Project Selection, and Protocol OSS

Deep learning in computer vision and natural language processing Yifeng Tao School of Computer

Constituency-based Hyponymy Extraction COMP 762 Chianyu Liu, 260576898 Hyponym and Hypernym

Using Language Modeling for Spam Detec7on in Social Reference

VICORE PHARMA AB Untangling the Dualistic Components of the RAS in PF the Yin and Yang Dr Rohit

A Computa1onal Framework for Social Capital in Online Communi1es

Community Network Clouds Amin Khan, Felix Freitag Universitat Politcnica de Catalunya,

Tax Exempt Bonds 101 This webinar will be a basic introduction to Tax-Exempt Bonds. It will

Why Transformers Work. More info blablabla More info blablabla More info blablabla More