Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 - - PowerPoint PPT Presentation

โ–ถ
peter izsak shira guskin moshe wasserblat intel ai lab
SMART_READER_LITE
LIVE PREVIEW

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 - - PowerPoint PPT Presentation

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab EMC 2 Workshop @ NeurIPS 2019 Motivation Named Entity Recognition (NER) is a widely used Information Extraction task in many industrial applications and use cases Ramping up on


slide-1
SLIDE 1

Peter Izsak, Shira Guskin, Moshe Wasserblat Intel AI Lab

EMC2 Workshop @ NeurIPS 2019

slide-2
SLIDE 2

2

Motivation

  • Named Entity Recognition (NER) is a widely used Information Extraction task in many industrial applications

and use cases

  • Ramping up on a new domain can be difficult

โ–ช Lots of unlabeled data, little of no labeled data and often not good enough for training a model with good performance Solution A ? Hire a linguist or data scientist to tune/build model ? Hire annotators to label more data or buy similar dataset ? Time/compute resource limitations Solution B ? Pre-trained Language Models such as BERT, GPT, ELMo are great at low-resource scenarios ? Require great compute and memory resources and suffer from high latency in inference ? Deploying such models in production or on edge devices is a major issue

This Photo by Unknown Author is licensed under CC BY

slide-3
SLIDE 3

3

Enhancing a Compact Model

  • Approach:
  • Train a compact model (3M parameters) using a large

pre-trained

  • Pre-trained word embeddings (non-shared embeddings)
  • Utilize labeled and unlabeled data:
  • Knowledge Distillation
  • Pseudo-labeling
slide-4
SLIDE 4

4

Compact Model Teacher Model Distillation Loss Task Loss KL Divergence Unlabeled examples Labeled examples soft targets soft targets labels pseudo-labels annotated labels

๐‘€๐‘ข๐‘๐‘ก๐‘™ = แ‰Š CrossEntropy(เทœ ๐‘ง, ๐‘ง) ๐‘š๐‘๐‘๐‘“๐‘š๐‘“๐‘’ ๐‘“๐‘ฆ๐‘๐‘›๐‘ž๐‘š๐‘“ CrossEntropy(เทœ ๐‘ง, เทœ ๐‘ง๐‘ข๐‘“๐‘๐‘‘โ„Ž๐‘“๐‘ ) ๐‘ฃ๐‘œ๐‘š๐‘๐‘๐‘“๐‘š๐‘“๐‘’ ๐‘“๐‘ฆ๐‘๐‘›๐‘ž๐‘š๐‘“ ๐‘€๐‘’๐‘—๐‘ก๐‘ข๐‘—๐‘š๐‘š๐‘๐‘ข๐‘—๐‘๐‘œ = KL(๐‘š๐‘๐‘•๐‘—๐‘ข๐‘ก๐‘ข๐‘“๐‘๐‘‘โ„Ž๐‘“๐‘ ||๐‘š๐‘๐‘•๐‘—๐‘ข๐‘ก๐‘‘๐‘๐‘›๐‘ž๐‘๐‘‘๐‘ข) ๐‘€๐‘๐‘ก๐‘ก = ๐›ฝ โ‹… ๐‘€๐‘ข๐‘๐‘ก๐‘™ + ๐›พ โ‹… ๐‘€๐‘’๐‘—๐‘ก๐‘ข๐‘—๐‘š๐‘š๐‘๐‘ข๐‘—๐‘๐‘œ, ๐›ฝ + ๐›พ = 1.0

Model training setup Integrated model knowledge distillation and pseudo-labeling in loss function Models

  • Teacher โ€“ BERT-base/large (110M/340M params.)
  • Compact โ€“ LSTM-CNN with Softmax/CRF (3M params.)

Low-resource Dataset Simulation

  • CoNLL 2003 (English) โ€“ PER/ORG/DATE/MISC
  • Generate random training sets with labeled/unlabeled examples
  • Train set size: 150/300/750/1500/3000
  • Report averaged F1 (20 experiments per train set size)

Training procedure

1. Fine-tune BERT with labeled data 2. Train compact model using modified loss

slide-5
SLIDE 5

5

Compact model performance

BERT-base as teacher BERT-large as teacher

Batch size 1 32 64 128 Speedup 3.3-4.3 28.6-33.7 40-45.2 49.9-55.6 Batch size 1 32 64 128 Speedup 8.1-10.6 85.2-100.4 109.5-123.8 123.6-137.8

Inference speed

  • n CPU

16% 6.1% 18.9% 12.9%

slide-6
SLIDE 6

6

Takeaways

  • Compact models perform equally well as pre-trained LM in low-resource scenarios, and with superior

inference speed and with compression rate is 36x-113x vs. BERT

  • Compact models are preferable for deployment vs. pre-trained LM in such use-cases
  • Many directions to explore:
  • Compact model topology โ€“ how small/simple can we make the model?
  • Other NLP tasks, pre-trained LM
  • Other ways to utilize unlabeled data
  • Code available in Intel AIโ€™s NLP Architect open source library

NervanaSystems/nlp-architect