Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI - PowerPoint PPT Presentation

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI Lab) EMC 2 Workshop @ NeurIPS 2019

Motivation • Pre-trained Transformer language models such as BERT, have demonstrated State-of-the-Art results for a variety of NLP tasks • BERT poses a challenge to deployment in production environments Google: “Some of the models we can build with BERT are so complex that • they push the limits of what we can do using traditional hardware”* • Hardware supporting 8bit Integer quantization is emerging • Quantization to 8bit Integers can accelerate inference using only 25% of the memory footprint This Photo by Unknown Author is licensed under CC BY-SA-NC * https://www.blog.google/products/search/search-language-understanding-bert/ 2

Quantization-Aware Training Graph Training (QAT) a e uanti ation • Train Neural Networks (NN) to be Bias quantized at the inference stage GE Inference Graph • Fake quantization is used to a e uanti ation a e uanti ation eights introduce the quantization error training nt Re e uanti e • We apply Fake Quantization on all nt nt Bias GEMM & Word/Position Embedding nt layers GE • Sensitive operations are kept in nt nt FP32 (Softmax, LN, GELU) uanti e eights 3

Experiments & Results • GLUE benchmark and SQuADv1.1 Baseline Our 8bit BERT Relative Task Metric Score (STD) Score (STD) Error Datasets CoLA MCC 58.48 (1.54) 58.48 (1.32) 0.00% • QAT while Fine-tune pre-trained MRPC F1 90 .00 (0.23) 89.56 (0.18) -0.49% MRPC- L* F1 90.86 (0.55) 90.90 (0.29) BERT-Base/Large 0.04% QNLI Acc. 90.30 (0.44) 90.62 (0.29) 0.35% • Reported Mean and STD over five QNLI- L* Acc. 91.66 (0.15) 91.74 (0.36) 0.09% experiments QQP F1 87.84 (0.19) 87.96 (0.35) 0.14% RTE Acc. 69.70 (1.50) 68.78 (3.52) -1.32% • Relative error induced by SST-2 Acc. 92.36 (0.59) 92.24 (0.27) -0.13% quantization is less than 1% STS-B PCC 89.62 (0.31) 89.04 (0.17) -0.65% STS-B- L* PCC 90.34 (0.21) 90.12 (0.13) -0.24% SQuADv1.1 F1 88.46 (0.15) 87.74 (0.15) -0.81% * - L means BERT-Large was used 4

Comparison with Dynamic Quantization • Compare QAT to post training Baseline DQ 8bit BERT Relative Task Metric Score (STD) Score (STD) Error quantization CoLA MCC 58.48 (1.54) 56.74 (0.61) -2.98% MRPC F1 90 .00 (0.23) • Dynamic Quantization (DQ) 87.88 (2.03) -2.36% MRPC- L* F1 90.86 (0.55) 88.18 (2.19) -2.95% • Applied DQ on baseline models for QNLI Acc. 90.30 (0.44) 89.34 (0.61) -1.06% each dataset QNLI- L* Acc. 91.66 (0.15) 88.38 (2.22) -3.58% QQP F1 87.84 (0.19) 84.98 (0.97) -3.26% • DQ method produces significantly RTE Acc. 69.70 (1.50) 63.32 (4.58) -9.15% worse results over all tasks SST-2 Acc. 92.36 (0.59) 91.04 (0.43) -1.43% STS-B PCC 89.62 (0.31) 87.66 (0.41) -2.19% STS-B- L* PCC 90.34 (0.21) 83.04 (5.71) -8.08% SQuADv1.1 F1 88.46 (0.15) 80.02 (2.38) -9.54% *- L means BERT-Large was used 5

Conclusions • We have presented a method for quantizing BERT to 8bit for a variety of NLP tasks with minimum loss in accuracy • We compared our Quantization-Aware Training method to a Post-Training Quantization method and shown our method produces significantly better results • Future directions are to run Q8BERT with supporting hardware and apply other compression methods to BERT and combine them. • We made our work available for the community in our open-source library NLP Architect NervanaSystems/nlp-architect 6

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI - PowerPoint PPT Presentation

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI Lab) EMC 2 Workshop @ NeurIPS 2019 Motivation Pre-trained Transformer language models such as BERT, have demonstrated State-of-the-Art results for a variety of NLP tasks

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019 Motivation

CELL SELECTION IN Guy Grebla OFDMA WIRELESS NETWORKS Slides: Moshe Gabel MOSHE GABEL 1 MODERN

About the guy in front Conservation Biology BSC3052 About the guy in front About the guy in

WIND ENERGY CAN MITIGATE MARKET POWER Ofir Rubin Department of Public Policy and Administration

Guy Burnett Guy Burnett Senior Compliance Officer Senior Compliance Officer Financial Integrity

Hierarchical Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca

Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit

Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Decision Trees MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University

Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University

Driven Holographic CFTs Moshe Rozali University of British Columbia Numerical Holography

algorithm to assess brain atrophy in MS patients. Dr. Ofir Zmira Student : Lital Abraham 4th

Parish 5 Year Pastoral Plan AGENDA AND PURPOSE Opening Prayer T o introduce the new 5

Dr. Stacy Sechrist Dr. Stacy Dr. Stacy Dr. Stacy Sechrist Sechrist Sechrist

Located in the second largest flake graphite producing July 2019 Corporate Presentation district

Introduction to Energy Commission Clean Energy Programs SB 350 Disadvantaged Communities Advisory

Summary of Progress and Future Direction SAB Discussion, August 16, 2010 Dr. Alan D. Hecht

Comments on the draft BPA hazard assessment protocol 14 September 2017 | Four Points by Sheraton,

Neighborhood Center 6 th Street Corridor Transportation Commission Meeting July 26, 2017 1

Alternatives Presentation March 28, 2016 Bay Road Corridor Pedestrian and Bicycle Mobility Study

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI - PowerPoint PPT Presentation

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat (Intel AI Lab) EMC 2 Workshop @ NeurIPS 2019 Motivation Pre-trained Transformer language models such as BERT, have demonstrated State-of-the-Art results for a variety of NLP tasks

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019 Motivation

CELL SELECTION IN Guy Grebla OFDMA WIRELESS NETWORKS Slides: Moshe Gabel MOSHE GABEL 1 MODERN

About the guy in front Conservation Biology BSC3052 About the guy in front About the guy in

WIND ENERGY CAN MITIGATE MARKET POWER Ofir Rubin Department of Public Policy and Administration

Guy Burnett Guy Burnett Senior Compliance Officer Senior Compliance Officer Financial Integrity

Hierarchical Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Data Exploration &amp; Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca

Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit

Multidimensional Scaling MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Decision Trees MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University

Distances &amp; Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University

Driven Holographic CFTs Moshe Rozali University of British Columbia Numerical Holography

algorithm to assess brain atrophy in MS patients. Dr. Ofir Zmira Student : Lital Abraham 4th

Parish 5 Year Pastoral Plan AGENDA AND PURPOSE Opening Prayer T o introduce the new 5

Dr. Stacy Sechrist Dr. Stacy Dr. Stacy Dr. Stacy Sechrist Sechrist Sechrist

Located in the second largest flake graphite producing July 2019 Corporate Presentation district

Introduction to Energy Commission Clean Energy Programs SB 350 Disadvantaged Communities Advisory

Summary of Progress and Future Direction SAB Discussion, August 16, 2010 Dr. Alan D. Hecht

Comments on the draft BPA hazard assessment protocol 14 September 2017 | Four Points by Sheraton,

Neighborhood Center 6 th Street Corridor Transportation Commission Meeting July 26, 2017 1

Alternatives Presentation March 28, 2016 Bay Road Corridor Pedestrian and Bicycle Mobility Study

Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca

Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University