CS 744: TPU Shivaram Venkataraman Fall 2019 Administrivia Midterm - - PowerPoint PPT Presentation

cs 744 tpu
SMART_READER_LITE
LIVE PREVIEW

CS 744: TPU Shivaram Venkataraman Fall 2019 Administrivia Midterm - - PowerPoint PPT Presentation

CS 744: TPU Shivaram Venkataraman Fall 2019 Administrivia Midterm 2, Dec 10 th Papers from Dataflow Model toTPU Similar format, cheat sheet etc. Poster session Dec 13 th Template Printing instructions Reimbursement


slide-1
SLIDE 1

CS 744: TPU

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

Administrivia

Midterm 2, Dec 10th – Papers from Dataflow Model toTPU – Similar format, cheat sheet etc. Poster session Dec 13th – Template – Printing instructions – Reimbursement

slide-3
SLIDE 3

Serverless Computing Compute Accelerators Infiniband Networks Non-Volatile Memory

slide-4
SLIDE 4

MOTIVATION

Capacity demands on datacenters New workloads Metrics Total cost of ownership (Depends on price ?) Power/operation Performance/operation Goal: Improve cost-performance by 10x over GPUs

slide-5
SLIDE 5

WORKLOAD

DNN: RankBrain, LSTM: subset of GNM Translate CNNs: Inception, DeepMind AlphaGo

slide-6
SLIDE 6

WORKLOAD: ML INFERNCE

Quantization à Lower precision, energy use 8-bit integer multiplies (unlike training), 6X less energy and 6X less area Need for predictable latency and not throughput e.g., 7ms at 99th percentile

slide-7
SLIDE 7

TPU DESIGN CONTROL

slide-8
SLIDE 8

COMPUTE

slide-9
SLIDE 9

DATA

slide-10
SLIDE 10

INSTRUCTIONS

CISC format (why ?) 1. Read_Host_Memory 2. Read_Weights 3. MatrixMultiply/Convolve 4. Activate 5. Write_Host_Memory

slide-11
SLIDE 11

SYSTOLIC EXECUTION

Problem: Reading a large SRAM uses much more power than arithmetic!

slide-12
SLIDE 12

ROOFLINE MODEL

Operational Intensity: MAC Ops/weight byte TeraOps/sec

slide-13
SLIDE 13

HASWELL ROOFLINE

TeraOps/sec Operational Intensity: MAC Ops/weight byte

slide-14
SLIDE 14

COMPARISON WITH CPU, GPU

slide-15
SLIDE 15

ENERGY PROPORTIONALITY

slide-16
SLIDE 16

SELECTED LESSONS

  • Latency more important than throughput for inference
  • LSTMs and MLPs are more common than CNNs
  • Performance counters are helpful
  • Remember architecture history
slide-17
SLIDE 17

SUMMARY

New workloads à new hardware requirements Domain specific design (understand workloads!) No features to improve the average case No caches, branch prediction, out-of-order execution etc. Simple design with MACs, Unified Buffer gives efficiency Drawbacks No sparse support, training support (TPU v2, v3) Vendor specific ?

slide-18
SLIDE 18

DISCUSSION

https://forms.gle/zhH9eCbdjMnaRLRB8

slide-19
SLIDE 19
slide-20
SLIDE 20

How would TPUs impact serving frameworks like Clipper? Discuss what specific effects it could have on distributed serving systems architecture