TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew - - PowerPoint PPT Presentation

tvm for ads ranking facebook
SMART_READER_LITE
LIVE PREVIEW

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew - - PowerPoint PPT Presentation

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at Facebook . . . ad 1 ad 2 ad 3 ad n model 2 model 2 model 1 model 3 . . . model k batch 1 batch 2 predictions X 2 Ads Ranking at Facebook:


slide-1
SLIDE 1

TVM for Ads Ranking @ Facebook

Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch

slide-2
SLIDE 2

model 2 batch 2

2

Ads Ranking at Facebook

X

predictions ad 1 ad 2 ad 3 ad n model 1 model 2 batch 1 model 3 model k . . . . . .

slide-3
SLIDE 3
  • Parallel execution between model evaluation
  • Each model runs on a single thread
  • For each model, there can be multiple

batches executing at the same time. In this case, weights are global and shared between threads, but activations are thread local

  • Model weights are refreshed every few
  • hours. Therefore, activations needs to be

released at the end of each inference to avoid running out of memory

  • Batch size is dynamic
  • C++ only
  • Mutiple CPU architectures: avx512, avx2

3

Ads Ranking at Facebook: Production Requirements

X

predictions ad 1 ad 2 ad 3 model 1 model 2 batch 1 model 2 batch 2

slide-4
SLIDE 4

MLP: Multilayer perceptron (sequence of FC + activation function)

TVM

Model Architecture

https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/

EMB

slide-5
SLIDE 5

5

Implementation

  • JIT (not AOT): because models are updated

periodically

  • Graph runtime does not manage memory
  • weights are shared between threads for the same

model

  • activations are shared by instances of all graph

runtimes

  • release activation after each iteration to avoid OOM

Performance

  • Use MKL for FC for simplicity
  • 5-10% speedup from fusion
  • Runtime overhead eats into speedup

Ads Ranking Models

Dense features + embeddings from caffe2 batch_size x graph runtime batch_size 1 graph runtime batch_size 2 graph runtime batch_size n prediction

slide-6
SLIDE 6

6

Relay VM

  • Handles dynamic shapes
  • JIT compilation
  • Dynamic memory allocation

What's Next

Performance

  • Autotuning at scale
  • FBGEMM for fp16 and int8
  • Embedding lookup