Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University - PowerPoint PPT Presentation

Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University University of Washington of Jerusalem

Premise: Big Models T5 11b #parameters https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ � 2 https://medium.com/huggingface/distilbert-8cf3380435b5

Problems with Big Models Research community https://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/ � 3

Problems with Big Models General AI Community https://towardsdatascience.com/too-big-to-deploy-how-gpt-2-is-breaking-production-63ab29f0897c � 4

Problems with Big Models Global Community Strubell et al. (2019) � 5

Green AI Schwartz* , Dodge*, Smith & Etzioni (2019) • Goals: • Enhance reporting of computational budgets • Add a price-tag for scientific results • Promote e ffi ciency as a core evaluation for NLP • Inference, training, model selection (e.g., hyperparameter tuning) • In addition to accuracy � 6

Big Models are Important • Push the limits of SOTA • Released large pre-trained models save compute • Large models are potentially faster to train • Li et al. (2020) • But, big models have concerning side a ff ects • Inclusiveness, adoption, environment • Our goal is to mitigate these side a ff ects � 7

Outline Enhanced Efficient Reporting Methods � 8

Is Model A > Model B? Reimers & Gurevych (2017) Model F1 Model A 91.21 Model B 90.94

Is Model A > Model B? Melis et al. (2018) Perplexity ( � ) Carefully Tuned (1500 trails) ß � 10

BERT Performs on-par with RoBERTa/ XLNet with better Random Seeds Dodge, Ilharco, Schwartz et al. (2020) � 11

Un fair Comparison Is Model A > Model B ? � 12

Better(?) Comparison Is Model A > Model B? | Budget � 13

Budget-Aware Comparison Performance | Budget (Clark et al., 2020) � 14

Expected Validation Dodge, Gururangan, Card, Schwartz & Smith, 2019 • Input: a set of experimental results � { V 1 , …, V n } • Define � V * k = max i ∈ {1,…, k } V i • Expected validation performance : � 𝔽 [ V * k | k ] • k=1: � mean ({ V 1 , …, V n }) • k=2: � mean ({ max ( V i , V j ) ∀ 1 ≤ i < j ≤ n }) • k=n: � V * n = max i ∈ {1,…, n } V i � 15

Expected Validation Example: SST5 � 16

Expected Validation Properties • Doesn’t require rerunning any experiment • An analysis of existing results • More comprehensive than • Reporting max (the rightmost point in our plots) • Reporting mean (the leftmost point in our plots) • https://github.com/dodgejesse/show_your_work � 17

Reporting Recap • Budget-aware comparison • Expected validation performance • Estimation of the amount of computation required to obtain a given accuracy � 18

Reporting Open Questions • How much will we gain by pouring more compute ? • What should we report? • Number of experiments • Time • FLOPs • Energy (KW) • Carbon? • Bigger models, faster training? • Li et al. (2020) � 19

Green NLP Goals Enhanced Efficient Reporting Methods � 20

Efficient Methods What are we making What are we more e ffi cient? measuring? Space Time Inference Training Model Energy Selection http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html https://blog.inten.to/speeding-up-bert-5528e18bb4ea https://blog.rasa.com/compressing-bert-for-faster-prediction-2/ � 21

Efficient #inference • Model distillation #space; #time; #energy • Hinton et al. (2015); MobileBERT (Sun et al., 2019); DistilBERT (Sanh et al., 2019) • Pruning #space / Structural Pruning #space; #time; #energy • Han et al. (2016); SNIP (Lee et al., 2019); LTH (Frankle & Corbin, 2019) • MorphNet (Gordon et al., 2018); Michel et al. (2019); LayerDrop (Fan et al., 2020) • Dodge, Schwartz et al. (2019) • Quantization #space; #time; #energy • Gong et al. (2014); Q8BERT (Zafrir et al., 2019); Q-BERT (Shen et al., 2019) � 22

#space Efficiency • Weight Factorization • ALBERT (Lan et al., 2019); Wang et al., 2019 • Weight Sharing • Inan et al., 2016; Press & Wolf, 2017 � 23

Early Stopping #modelselection; #time; #energy • Stop least promising experiments early on • Successive halving (Jamieson & Talwalkar, 2016) • Hyperband (Lee et al., 2017) • Works for random seeds too! • Dodge, Ilharco, Schwartz , et al. (2020) � 24

Other Efficient Methods • Replacing dot-product attention with locally-sensitive hashing • #inference; #space; #time; #energy • Reformer (Kitaev et al., 2020) • More e ffi cient usage of the input • #inference; #training; #space; #time; #energy • ELECTRA (Clark et al., 2020) • Analytical solution of the backward pass • #inference; #space • Deep equilibrium model (Bai et al., 2019) � 25

Efficiency/Accuracy Tradeoff #inference; #time; #energy Schwartz et al., in review Performance | Budget (Clark et al., 2020) � 26

Easy/Hard Instances Variance in Language 1. The movie was awesome. 2. I could definitely see why this movie received such great critiques, but at the same time I can’t help but wonder whether the plot was written by a 12 year-old or by an award-winning writer. � 27

Matching Model and Instance Complexity Run an e ffi cient model on “easy” instances, and an expensive model on “hard” instances � 28

Pretrained BERT Fine-tuning Prediction Layer n Layer n-1 Layer n-2 Slowest, most accurate Layer 2 Layer 1 Layer 0 Input � 29

Faster, less Accurate Layer n Layer n-1 Prediction Layer n-2 Layer 2 Layer 1 Layer 0 Input � 30

Fastest, least Accurate Layer n Layer n-1 Layer n-2 Layer 2 Layer 1 Prediction Layer 0 Input � 31

Our Approach: Training Time Layer n Prediction Layer n-1 Layer n-2 Prediction Layer 2 Prediction Layer 1 Layer 0 Prediction Input � 32

Our Approach: Test Time Prediction Layer n No Layer l Prediction Yes Early exit Layer n/2 Is confident? No Prediction Yes Is confident? Early exit Layer 2 Layer 1 No Prediction Yes Early exit Layer 0 Is confident? Input � 33

Calibrated Confidence Scores • We interpret the softmax label scores as model confidence • We calibrate our model to encourage the confidence level to correspond to the probability that the model is correct (DeGroot and Fienberg, 1983) • We use temperature calibration (Guo et al., 2017) exp( z i / T ) � pred = argmax i ∑ j exp( z j / T ) • Speed/accuracy tradeo ff controlled by a single early-exit confidence threshold � 34

Experiments • BERT-large-uncased (Devlin et al., 2019) • Output classifiers added to layers 0,4,12 and 23 • Datasets • 3 Text classification, 2 NLI datasets � 35

Baselines Standard baseline E ffi cient baselines Prediction Layer n Layer n Layer n Layer n-1 Layer n-1 Layer n-1 Prediction Layer n-2 Layer n-2 Layer n-2 Layer 2 Layer 2 Layer 2 Layer 1 Layer 1 Layer 1 Prediction Layer 0 Layer 0 Layer 0 Input Input Input � 36

Strong Baselines! � 37

Better Speed/Accuracy Tradeoff � 38

Better Speed-Accuracy Tradeoff � 39

More about our Approach • No e ff ective growth in parameters • < 0.005% additional parameters • Training is not slower • A single trained model provides multiple options along the speed/accuracy tradeo ff • A single parameter: confidence threshold • Caveat: requires batch size=1 during inference � 40

Recap • E ffi cient inference • Simple instances exit early, hard instances get more compute • Training is not slower than the original BERT model • One model fits all! • A single parameter controls for the speed/accuracy curve � 41

Efficiency Open Questions • Can we drastically reduce the price of training BERT ? • Sample e ffi ciency • What makes a good sparse structure? • What makes a good hyperparameter/random seed? � 42

Think Green • Show your work! • E ffi ciency , not just accuracy � 43

More about me Understanding the NLP Development Cycle Datasets Models Experiments * Rational Recurrences   * Show your Work   * Annotation Artifacts   (Schwartz et al., 2018; Peng et al., (Dodge et al., 2019;2020) (Schwartz et al., 2017; Gururangan 2018; Merrill et al., in review) et al., 2018) * LSTMs Exploit Linguistic * Inoculation by Fine-Tuning:   Attributes of Data (Liu et al., 2018) A Method for Analyzing Challenge Datasets (Liu et al., 2019) � 44

Amazing Collaborators! � 45

Come to Jerusalem!

Think Green • E ffi ciency research opportunities • Can we drastically reduce the price of training BERT ? • Sample e ffi ciency • What makes a good sparse structure / hyperparameter / random seed ? • Reporting research opportunities • How much will we gain by pouring more compute ? • Better reporting methods � 47

Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University - PowerPoint PPT Presentation

Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University University of Washington of Jerusalem Premise: Big Models T5 11b #parameters

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Connecticut Green Bank Green Bank 2.0 Green Bonds US Maine Green Bank Summit June 25, 2020

Its Not Its Not Easy Being Green: Easy Being Green: Green Screen as Green Screen as

Download the brief at www.nahb.org/smr 2020 Green SmartMarket Surveys Green Building Market

Clean and Green John Schram Comstock Clean and Green Free Clean and Green Disposal Dates for

What is Green? What does it mean to be green? Why is being green important?

Green Jobs, Decent Work and Sustainable Development Ana Sanchez Green Jobs Programme Green Jobs

The Green Deal Tracy Vegro Director, Green Deal Contents 1. Introducing the Green Deal 2. ECO

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Requirements Project overview and update TONY BREEDS PRINCIPAL SOFTWARE ENGINEER - REDHAT What

#prep X Assembly 05: Wire Harness It is HIGHLY recommended to watch the video for this part. It's a

DDoS: Barbarians At The Gate(way) Examination of actors, tools and defenses #whoami Dave Lewis

Typing the Numeric Tower Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt, Matthias Felleisen

Express: Lowering the Cost of Metadata-hiding Communication with Cryptographic Privacy Saba

Android malware that wont make you fall asleep ukasz Siewierski lukasz.siewierski@cert.pl

What is RIOT? An operating system for IoT devices too small for Linux A free, open source

ECE 650 Systems Programming & Engineering Spring 2018 Introduction to SQL Tyler Bletsch

Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University - PowerPoint PPT Presentation

Green NLP Roy Schwartz AIlen Institute for AI/ Hebrew University University of Washington of Jerusalem Premise: Big Models T5 11b #parameters

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Connecticut Green Bank Green Bank 2.0 Green Bonds US Maine Green Bank Summit June 25, 2020

Its Not Its Not Easy Being Green: Easy Being Green: Green Screen as Green Screen as

Download the brief at www.nahb.org/smr 2020 Green SmartMarket Surveys Green Building Market

Clean and Green John Schram Comstock Clean and Green Free Clean and Green Disposal Dates for

What is Green? What does it mean to be green? Why is being green important?

Green Jobs, Decent Work and Sustainable Development Ana Sanchez Green Jobs Programme Green Jobs

The Green Deal Tracy Vegro Director, Green Deal Contents 1. Introducing the Green Deal 2. ECO

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Requirements Project overview and update TONY BREEDS PRINCIPAL SOFTWARE ENGINEER - REDHAT What

#prep X Assembly 05: Wire Harness It is HIGHLY recommended to watch the video for this part. It's a

DDoS: Barbarians At The Gate(way) Examination of actors, tools and defenses #whoami Dave Lewis

Typing the Numeric Tower Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt, Matthias Felleisen

Express: Lowering the Cost of Metadata-hiding Communication with Cryptographic Privacy Saba

Android malware that wont make you fall asleep ukasz Siewierski lukasz.siewierski@cert.pl

What is RIOT? An operating system for IoT devices too small for Linux A free, open source

ECE 650 Systems Programming &amp; Engineering Spring 2018 Introduction to SQL Tyler Bletsch

ECE 650 Systems Programming & Engineering Spring 2018 Introduction to SQL Tyler Bletsch