CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

cs 744 tvm
SMART_READER_LITE
LIVE PREVIEW

CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

Machine Virtual x.dk ! Tensor L , Llvm T CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Assignment - Course project titles - Project proposal aka Introduction (10/16) < Introduction ] writeup page 2


slide-1
SLIDE 1

CS 744: TVM

Shivaram Venkataraman Fall 2020

Tensor

Virtual

Machine

x.dk!

L,

Llvm

T

slide-2
SLIDE 2

ADMINISTRIVIA

  • Course project titles
  • Project proposal aka Introduction (10/16)

Introduction Related Work Timeline (with eval plan)

  • Midterm: Oct 22

Assignment

<

]→

2 page writeup
slide-3
SLIDE 3

MACHINE LEARNING: STACK

Distributed

no owed Train

efferent

just lie ↳ forward pass

→ Interplay

inference &

quondam \

, training make

distributed

easy

(

dealing inference

groin

v Hardware and saddle
slide-4
SLIDE 4

MOTIVATION: PERFORAMNCE PORTABILITY

Pytoreh model file "intent :?÷

:www.rayf/4TE-iTIy

confute primitives matrix cow multiply ed you want high performance I 1 across hardware backends
  • Dependence
  • n
vendor specific
  • q
libraries ML models evolve fast new
  • perators
new combination of
  • perators Y
Not

available

in existing vendor libraries
slide-5
SLIDE 5

AM

→ Python code describes ML model Tvm

. . )

=

→ Binary

file that

runs
  • n hardware
slide-6
SLIDE 6

OPTIMIZATION COMPUTATION GRAPHS

Operator Fusion Data layout

[ I

¥÷÷÷¥÷÷÷÷÷:* "

"÷ . :÷÷÷:

  • T
1-1
  • perators
, " map "
Sum reduction , scaling after ↳(Spg
  • Kow
Major , column

Major ,

Blocked

g

, Infest teat is represented 2- layer NN

.in/TEHtEi/:::IS...:: :

as layout
slide-7
SLIDE 7

TENSOR EXPRESSION LANGUAGE

Common Arithmetic, Math operations Know the shape of the output and the data accessed
  • perator cry

expressed

in tensor expression

language

↳tensor)

math operations
slide-8
SLIDE 8

CODE GENERATION

Nested parallelism Tensorization

Halide →

expression

OpenMP ← gu

+ of imtmhns

ead does
  • a. ←

anime fi

:÷÷÷÷÷÷:÷÷i÷÷÷

'jaihe " for i in l : " for j . in t :S threads can use as , Bs tdgmptd.im#isIL;i!dffIHdeu- poker

=doopiterah#

bad , store, add whet is the hardware instruction set
  • =
Allows you to
  • -
register
  • perator

Extensible !

intrinsic

slide-9
SLIDE 9

Latency HIDING

What is the goal?

Some

as

Pytorch

etc .

9

↳ Overlap

computation

and communication Schedule that

utilizes

  • memory

bandwidth &

compute units

ig:*

year 'fad

slide-10
SLIDE 10

AUTOMATING OPTIMIZATION

Goal: Create a specialized operator for input shape and layout Challenge: Choose appropriate schedule optimizations Tiling size, loop unrolling Automate the optimizer!

  • r
.

lots

  • f

different

choices & also lots of

parameters Hunts

to choose .

Fim

Ml ?

  • ""

m

. what configurations

I

to Try

slide-11
SLIDE 11

ML-Based Cost model

Machine Learning Model Design Choices Speed: Faster than time it takes to evaluate a config Quality: Use a rank objective to predict the relative order of runtime Gradient tree boosting model memory access count reuse ratio of each memory buffer at each loop level

  • ne-hot encoding of loop annotations
as to n seconds take ← code

generate

d

features

slide-12
SLIDE 12

ML-BASED COST MODEL

Iteration Select a batch of candidates Collect data Use as training data to update the model How to select candidates? Parallel Simulated Annealing Start from a random config Walk to a nearby config à Successful if cost decreases Else Reject model perf when using ← config , tom >

LE

  • y
( Cz , 20ms

each

candidate

is ccz , 8ms> 41 a

wyignratim l ahhhh

< £ , 'fail>

fief

a ← :

harder:c , → step a)

above

trashy data

  • toaB7✓

~

, Aa

↳ → ↳

'

Task

model is cj better than b Yes go &try d,
  • n
cluster No → generate another
  • yer config
slide-13
SLIDE 13

Distributed device pool

Pool of devices to speed up profiling RPC interface to run a trial on device Share device pools for multiple graphs

slide-14
SLIDE 14

SUMMARY

TVM: Compiler for ML inference models Support high performance for range of models, hardware devices Key ideas Graph-level optimizations Tensor expression language: Code-gen, Latency hiding etc ML based Cost Model for automation

→ operator fusion
slide-15
SLIDE 15

DISCUSSION

https://forms.gle/WiVgJ3abGXXgfBN99
slide-16
SLIDE 16

Consider that you are building an optimizer for Spark programs instead of ML

  • inference. What would be some configuration knobs that you could similarly

tune? What might be different from the TVM optimizer? Similar

logic

latency

hiding

  • verlap
comp , communication

7rYYdimemim'operatorfmon→mapBoperahn

access patterns

↳laa#

  • perators
are user defined

challenging

??
  • Partitioning
can you automate ↳ number
  • f

partitions / co- partitioning

\

performance ! Had . cache config space ! Persistence

manually

insert
slide-17
SLIDE 17 What is your takeaway from the following figure?

fastingqm

  • n

f- T

" bae:Enea.:c. .

honey!

week

, unite:b .
  • r
slide-18
SLIDE 18

NEXT STEPS

Next class: Ray Course project: Oct 16 (introductions) Midterm: Oct 22

latency

hiding

in

spark ?

D

rddlsmaf tasks > ✓ credence tasks

|D÷i;:

D

rddz map ← Hanffiles :

Edna > ←

no comm

D

wait