CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - - PowerPoint PPT Presentation

cs 744 tpu
SMART_READER_LITE
LIVE PREVIEW

CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm Papers from SCOPE to TPU Thu 2 . . Similar format etc. Piazza


slide-1
SLIDE 1

CS 744: TPU

Shivaram Venkataraman Fall 2020

good

morning

!

slide-2
SLIDE 2

Administrivia

Midterm 2, Dec 3rd – Papers from SCOPE to TPU – Similar format etc. Presentations Dec 8, 10 – Sign up sheet – Presentation template – Trial run?

Next

Tue

. .

Fairness in

ML

,

Course

summary

Thu

. .

Midterm

2

Piazza

week

after

:

project Presentations

  • 4

min

talks

  • 3

to

4

slides

  • Problem

statement

  • Approach
  • Initial

results

  • In- progress / final

report

Dee 17h

slide-3
SLIDE 3

MOTIVATION

Capacity demands on datacenters New workloads Metrics Power/operation Performance/operation Total cost of ownership Goal: Improve cost-performance by 10x over GPUs

e-

Voice

search

ML

model to

convert

speech

to

text

Latency (or Hut) →

workloads

are

latency sensitive

Buy ( Build )

t

  • perate
slide-4
SLIDE 4

WORKLOAD

DNN: RankBrain, LSTM: subset of GNM Translate CNNs: Inception, DeepMind AlphaGo

CNN

are

  • nly

5%

,

Number

  • f

weights

is

not

MPs

are

61%

correlated

with

batch

size

and

  • ps

enns

have

very high

Mlp

&

Lsm

hare

same

  • ps 1- Byte

L

  • ps / byte

and batch

size

÷

← :

D

O

slide-5
SLIDE 5

WORKLOAD: ML INFERNCE

Quantization à Lower precision, energy use 8-bit integer multiplies (unlike training), 6X less energy and 6X less area Need for predictable latency and not throughput e.g., 7ms at 99th percentile

I

convert

model

weights

from

32

hit

float

8 bit integer

T

  • Focus
  • n

Inference

caches

improve average

branch prediction

case

scenario

  • nly

!

slide-6
SLIDE 6

TPU DESIGN CONTROL

  • t )

PCIe

inter

connect

for compatibility

Pae

has

limited

bandwidth

& latency

  • ns

issued

queried

µ

from

host

↳ Instruction

buffer

  • simple

single

thread

!

slide-7
SLIDE 7

COMPUTE

L

8

  • bit

7

L

16

bit

>

O

  • 255

O

  • 512

matrix

Multiply

Unit

↳ Fully

connected

convolutions

24%

area

  • f

the chip

MAC

s

Multiply

t

t

Accumulate

  • → 8 bit

integer

  • r

lb

  • bit

ran

Separate

compute

unit for Activation

& Normalize /Pool

slide-8
SLIDE 8

DATA

A X

=

B

8GB

in size

Models

can

=

=

fit here

.
  • OO

C)

Models (or weights )

are

stored

in

  • ff
  • chip
  • €7

DRAM arts)

  • "

a::m÷::'

DO

unified

buffer

  • I

t

d

(3)

pipeline fetching

  • f

weights

with

matrix

multiply

\

Intermediate

results

accumulated

and

then

stored

in

Unified Buffer

slide-9
SLIDE 9

INSTRUCTIONS

CISC format (why ?) 1. Read_Host_Memory 2. Read_Weights 3. MatrixMultiply/Convolve 4. Activate 5. Write_Host_Memory

→ Specialized

instruction

set

CISC

Instructions

encode

  • perations

that take

many

cycles

to

run

slide-10
SLIDE 10

SYSTOLIC EXECUTION

Problem: Reading a large SRAM uses much more power than arithmetic!

  • Typical

cpu

baches

( L1

, L2 etc)
  • r Registers

,

have

inputs

to

compute

units

I ¥

Tpu

wave

  • like

propagation

  • f data

Data

reuse

for

every element

predictable execution

predictable performance

,

slide-11
SLIDE 11

ROOFLINE MODEL

Operational Intensity: MAC Ops/weight byte TeraOps/sec

band ) Head

  • tart,
  • x. axis
:
  • perational

intensity

f- Slope

part

( memory

µ

Amount of compute

per

;

byte

  • f data

read

y

  • axis
:
  • perations I second

!

  • Blue line

comes from

hardware

spec

  • corns

are

compute

intensive

Memory

bound compute

bound

  • ↳ em

& Mvps

are

close

to

peak perf

  • f

hardware

c-

slide-12
SLIDE 12

HASWELL ROOFLINE

TeraOps/sec Operational Intensity: MAC Ops/weight byte

Cpu scope

ends at

lo

  • ps!weight byte

⑥ Number of

"

l

points

d

' .

away

.

roofline

. .

A

⑥ Much

i

lower

I

Tera Ops /

i

second

'
slide-13
SLIDE 13

COMPARISON WITH CPU, GPU

T ry

  • ffchip
=.IE?g%cam.myxpower-me1argaegsm.d;nhigherm

dukes

, L1 , L2 , 43 7

A

D

  • O

O

  • v1
  • I
  • .

a

t

2x

tower

J

  • APUs

bring

down

power

configured

Cpu

and am

used

when

idle

  • Tp vs

not

so

much

slide-14
SLIDE 14

SELECTED LESSONS

  • Latency more important than throughput for inference
  • LSTMs and MLPs are more common than CNNs
  • Performance counters are helpful
  • Remember architecture history

to

also

improve

compilers

  • f

DNN models

slide-15
SLIDE 15

SUMMARY

New workloads à new hardware requirements Domain specific design (understand workloads!) No features to improve the average case No caches, branch prediction, out-of-order execution etc. Simple design with MACs, Unified Buffer gives efficiency Drawbacks No sparse support, training support (TPU v2, v3) Vendor specific ?

slide-16
SLIDE 16

DISCUSSION

https://forms.gle/tss99VSCMeMjZx7P6

slide-17
SLIDE 17

① Larger

batches

have higher

tput

but

also higher

tail latent

Tyer

ewes

per see

I

O

O G

O

IPO

tuts

are

much

higher ③

Tpu

can

be

at higher

with

while

meeting

7ms

latency

target

apu

has

higher

IPS

at

same

batch

size

compared

to CPU

relates to

avg

.

latency

slide-18
SLIDE 18

How would TPUs impact serving frameworks like Clipper? Discuss what specific effects it could have on distributed serving systems architecture

Tpvs

have

8h13 to share

many

models

↳ but

this

might

break

containers

in

Clipper

stragglers

are

less

frequent

Batching

  • ( Auto

batching)

can

be

very

helpful

slide-19
SLIDE 19

NEXT STEPS

No class Thursday! Happy Thanksgiving! Next week schedule: Tue: Fairness in ML, Summary Thu: Midterm 2

slide-20
SLIDE 20

ENERGY PROPORTIONALITY