Deep (Transfer) Learning for NLP on Small Data Sets Evaluating - PowerPoint PPT Presentation

Public: For presentation at NVIDIA GTC Conference Public Talk ID: S9610 Deep (Transfer) Learning for NLP on Small Data Sets Evaluating efficacy and application of techniques Hanoz Bhathena and Raghav 'Mady' Madhavan UBS Evidence Lab March 20, 2019

Public: For presentation at NVIDIA GTC Conference Public Disclaimer Opinions and views shared here are our personal ones, and not those of UBS or UBS Evidence Lab. Any mention of Companies, Public or Private, and/or their Brands, Products or Services is for illustrative purposes only and does not reflect a recommendation.

Agenda • Problem & Motivation • Transfer Learning Fundamentals • Transfer Learning for small datasets in NLP • Experiments • Results • Conclusion • Future Work • Q & A 2

Problem • Large (labeled) datasets has been the fuel that has powered the deep learning revolution of NLP • However, in common business contexts, labeled data can be scarce • Examples: – Financial documents – Legal documents – Client feedback emails – Classification from Clinical visits • Issues: – Expensive to get labeling services – Data privacy concerns – Experimentation phase (unknown payoff; when to stop tagging?) 3

Motivation Enable building deep learning models when small quantities of labeled data are available Increase usability of deep learning for NLP tasks Decrease time required to develop models Democratize model development beyond NLP experts 4

Deep learning with less labeled data • Transfer learning • Semi-supervised learning • Artificial data augmentation • Weak supervision • Zero-shot learning • One-shot learning • Few shot learning • ……. 5

Deep Transfer Learning Introduction Use a model trained for one or more tasks to solve another different, but somewhat related, task Transfer Learning Pre Training Learning Algorithm Pre- Learning Task Trained Data Specific Algorithm Model Model (Source Domain) Data (Target Domain) After supervised learning — Transfer Learning will be the next driver of ML commercial success - Andrew Ng, NIPS 2016 6

Transfer Learning in Computer Vision Source: Stanford CS231N lecture slides: Fei-Fei Li & Justin Johnson & Serena Yeung 7

Transfer Learning – General Rule Source: Stanford CS231N lecture slides: Fei-Fei Li & Justin Johnson & Serena Yeung 8

So, what about Transfer Learning for NLP? • Is there a source dataset like ImageNet for NLP? • Does this dataset require annotations? Or can we leverage unsupervised learning somehow? • What are some common model architectures for NLP problems that optimize for knowledge transfer? • How low can we go in terms of data requirements in our target domain? • Should we tune the entire pre-trained model or just use it as a feature generator for downstream tasks? 9

Transfer Learning for NLP – Pre-2018 • Word2Vec (Feature based and Fine-tunable) (https://arxiv.org/abs/1310.4546) • Glove (Feature based and Fine-tunable) (https://nlp.stanford.edu/pubs/glove.pdf) • FastText (Feature based and Fine-tunable) (https://arxiv.org/abs/1607.04606) • Sequence Autoencoders (Feature based and Fine-tunable) (https://arxiv.org/abs/1511.01432) • LSTM language model pre-training (Feature based and Fine-tunable) (https://arxiv.org/abs/1511.01432) 10

Transfer Learning for NLP – 2018 and Beyond • Supervised Learning of Universal Sentence Representations from NLI Data (InferSent) (https://arxiv.org/abs/1705.02364) ** • Deep contextualized word representations (ELMo) (https://arxiv.org/abs/1802.05365) • Universal Sentence Encoder (https://arxiv.org/abs/1803.11175) • OpenAI GPT (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf) • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/abs/1810.04805) • Universal Language Model Fine-tuning for Text Classification (ULMFiT) (https://arxiv.org/abs/1801.06146) • GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (https://arxiv.org/abs/1804.07461, https://github.com/nyu-mll/GLUE-baselines) • OpenAI GPT 2 (https://d4mucfpksywv.cloudfront.net/better-language- models/language_models_are_unsupervised_multitask_learners.pdf) ** This was actually published in 2017 11

What is GLUE and how is our objective different? • Because with exception of WNLI (and perhaps RTE), most of these datasets are still too large to create especially for experimental projects in a commercial setting. • Is it possible to create meaningful deep learning models for classification on just a few hundred samples? Source: Original GLUE paper (https://arxiv.org/abs/1804.07461) 12

Deep contextualized word representations (ELMo) • Generates context dependent word embeddings • Example: the word vector for the word "bank" in the sentence "I am going to the bank" will be different from the vector for the sentence "We can bank on him" • The model comprises of a character level CNN model followed by a L=2 layer bi-directional LSTM model • Weighted average of the embeddings from char-CNN and the hidden vectors from the 2 layer bi-LSTM • Language model pretraining on the 1B Word Benchmark • Pre-trained model is available on Tensorflow-Hub and AllenNLP 13

Universal Sentence Encoder • Two types: Deep Averaging Network (DAN) and Transformer network • Multi-task training on a combination of supervised and unsupervised training objectives • Trained on varied datasets like Wikipedia, web news, blogs • Uses attention to compute context aware word embeddings which are combined into a sentence level representation • Pre-trained model is available on Tensorflow-Hub 14

BERT • Uses the encoder half of Transformer • The input is tokenized using a WordPiece tokenizer (Wu et al., 2016) • Training on a dual task: Masked LM and next sentence prediction • The next sentence prediction task learns to predict, given two sentences A and B, whether the second sentence (B) comes after the first one (A) • This enables the BERT model to understand sentence relationships and thereby a higher level understanding capability compared to just a language model training • Data for pre-training: BookCorpus (800mn words) + English Wikipedia (2.5bn words) • BERT obtains SOTA results on 11 NLP tasks in the GLUE benchmark 15

BERT vs ELMo - Architecture Source: Original BERT paper 16

Experiments: Setup ➢ Feature based learning: Only train the final layer(s) Transfer learning training paradigms ➢ Finetune based learning: Fine tune all layers using a small learning rate ➢ Baseline CNN (with and without pretrained Glove embeddings) ➢ ELMo Models to evaluate ➢ Universal Sentence Encoder ➢ BERT ➢ Mean, Standard Deviation of Out-of-Sample Accuracy after N trials Evaluation Criteria ➢ No explicit attempt to optimize hyperparameters ➢ Some pre-trained model architecture will be well suited for all applications Apriori Expectations ➢ Either finetuning or feature mode will emerge a consistent winner 17

Experiment 1: IMDB Rating Application – Sentiment classification model on IMDB movie reviews – Binary classification problem: positive or negative – 25,000 Training samples; 12,500 positive and 12,500 negative – 25,000 Test samples; 12,500 positive and 12,500 negative 18

Experiment 1: IMDB Rating Application Naïve baseline model: CNN with BatchNorm and Dropout WITHOUT pretrained Glove 100 Trials each Mean Test Accuracy Std. Dev. Test Accuracy 90% 4% 80.9% 3.6% 80% 70% 3% 60% 63.3% 50% 2% 40% 30% 1% 20% 10% 0.2% 0% 0% 100 200 300 400 500 600 1000 Training Size Using 25,000 training sample yields: 87.1% Source: UBS Evidence Lab 19

Experiment 1: IMDB Rating Application More realistic baseline model: CNN with BatchNorm and Dropout WITH pretrained Glove 100 Trials each Mean Test Accuracy Std. Dev. Test Accuracy 84% 4% 82.2% 82% 3.4% 80% 3% 78% 76% 2% 74% 72% 72.4% 1% 70% 0.3% 68% 66% 0% 100 200 300 400 500 600 1000 Training Size Using 25,000 training sample yields: 89.8% Source: UBS Evidence Lab 20

Experiment 1: IMDB Rating Application Universal Sentence Encoder: DAN Fine Tuning based Training – 10 Trials each Feature based Training – 10 Trials each Mean Test Accuracy Std. Dev. Test Accuracy Mean Test Accuracy Std. Dev. Test Accuracy 85% 2% 85% 10% 81.5% 9% 81.0% 1.7% 80% 80% 8% 7% 75% 75% 6% 5.5% 74.1% 1% 5% 70% 70% 4% 3% 0.4% 65% 65% 2% 1.0% 1% 61.8% 60% 0% 60% 0% 100 200 300 400 500 600 1000 100 200 300 400 500 600 1000 Training Size Training Size Using 25,000 training sample yields: 86.6% Using 25,000 training sample yields: 82.6% Source: UBS Evidence Lab 21

Experiment 1: IMDB Rating Application BERT Fine Tuning based Training – 100 Trials each Feature based Training – 10 Trials each Mean Test Accuracy Std. Dev. Test Accuracy Mean Test Accuracy Std. Dev. Test Accuracy 90% 7% 90% 5% 4.7% 88.4% 77.6% 80% 88% 6% 5.8% 4% 70% 86% 5% 60% 84% 3% 57.8% 4% 50% 82% 40% 80% 3% 2% 30% 78% 78.3% 2% 1.2% 20% 76% 1% 1% 0.4% 10% 74% 0% 0% 72% 0% 100 200 300 400 500 600 1000 100 200 300 400 500 750 1000 Training Size Training Size Using 25,000 training sample yields: 92.5% Using 25,000 training sample yields: 81.8% Source: UBS Evidence Lab 22

Deep (Transfer) Learning for NLP on Small Data Sets Evaluating - PowerPoint PPT Presentation

Public: For presentation at NVIDIA GTC Conference Public Talk ID: S9610 Deep (Transfer) Learning for NLP on Small Data Sets Evaluating efficacy and application of techniques Hanoz Bhathena and Raghav 'Mady' Madhavan UBS Evidence Lab March

Transfer Learning in NLP Helping Small Teams Account for Small Datasets Ryan Smith

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Deep learning for NLP: Introduction CS 6956: Deep Learning for NLP Words are a very fantastical

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Resource Creation and Enrichment using Deep Learning Kevin Patel Guided by: Prof. Shivaram

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Investor Presentation March 2016 1 SAFE HARBOR During the course of this presentation the

Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA

WMAP 5-Year Results: Measurement of f NL Eiichiro Komatsu University of Texas at Austin

21 February 2018 Important notice and disclaimer CONTENT OF PRESENTATION FOR INFORMATION PURPOSES

Friday 05 July 2019 Welcome Twitter: @AdultPSWNetwork @PCFSWNetwor #PSWJointConference19

ROUND5 Update and Future Directions Hayo Baan 1 , Sauvik Bhattacharya 1 , Scott Fluhrer 2 , Oscar

Commercial & Utility Scale Photovoltaic Array Management Gordy Presher 585-388-4314

Vermont A Health System for the 21 st Century William C. Hsiao Jonathan Gruber Steven Kappel

Deep (Transfer) Learning for NLP on Small Data Sets Evaluating - PowerPoint PPT Presentation

Public: For presentation at NVIDIA GTC Conference Public Talk ID: S9610 Deep (Transfer) Learning for NLP on Small Data Sets Evaluating efficacy and application of techniques Hanoz Bhathena and Raghav 'Mady' Madhavan UBS Evidence Lab March

Transfer Learning in NLP Helping Small Teams Account for Small Datasets Ryan Smith

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Deep learning for NLP: Introduction CS 6956: Deep Learning for NLP Words are a very fantastical

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Resource Creation and Enrichment using Deep Learning Kevin Patel Guided by: Prof. Shivaram

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Investor Presentation March 2016 1 SAFE HARBOR During the course of this presentation the

Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA

WMAP 5-Year Results: Measurement of f NL Eiichiro Komatsu University of Texas at Austin

21 February 2018 Important notice and disclaimer CONTENT OF PRESENTATION FOR INFORMATION PURPOSES

Friday 05 July 2019 Welcome Twitter: @AdultPSWNetwork @PCFSWNetwor #PSWJointConference19

ROUND5 Update and Future Directions Hayo Baan 1 , Sauvik Bhattacharya 1 , Scott Fluhrer 2 , Oscar

Commercial &amp; Utility Scale Photovoltaic Array Management Gordy Presher 585-388-4314

Vermont A Health System for the 21 st Century William C. Hsiao Jonathan Gruber Steven Kappel

Commercial & Utility Scale Photovoltaic Array Management Gordy Presher 585-388-4314