Offload Annotations: Bringing Heterogeneous Computing to Existing - - PowerPoint PPT Presentation

offload annotations bringing heterogeneous computing to
SMART_READER_LITE
LIVE PREVIEW

Offload Annotations: Bringing Heterogeneous Computing to Existing - - PowerPoint PPT Presentation

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17) Background: Hardware Commoditization NVIDIA


slide-1
SLIDE 1

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17)

slide-2
SLIDE 2

Background: Hardware Commoditization

2

NVIDIA GPU

slide-3
SLIDE 3

Background: CPUs vs. GPUs

Core Core Core Core Control Cache Memory Memory

4-way parallelism 512GB memory 1000-ways parallelism! 16GB memory CPUs GPUs

(PCI-E)

Costly data transfers!

3

slide-4
SLIDE 4

Background: Data Science on the CPU

+

(CPU)

4

Popular Python data science libraries for the CPU.

slide-5
SLIDE 5

NVIDIA GPU

Trend: Data Science on the GPU

+

5

NEW Python data science libraries for the GPU. (cuDF, cuML, etc.) Lots of parallel data!

slide-6
SLIDE 6

Trend: CPU Libraries vs. GPU Libraries

https://github.com/rapidsai/cudf https://cupy.chainer.org/ https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html https://github.com/rapidsai/cuml 6

slide-7
SLIDE 7

Trend: CPU Libraries vs. GPU Libraries

https://github.com/rapidsai/cudf https://cupy.chainer.org/ https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html https://github.com/rapidsai/cuml

Are GPU libraries as straightforward to use as they seem?

7

slide-8
SLIDE 8

Motivating Example

8

cuML

slide-9
SLIDE 9

cuml cuml

Motivating Example

Missing Functions

9

slide-10
SLIDE 10

cuml cuml

Motivating Example

Missing Functions Manual Data Transfers

X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) X_test = transfer(X_test, GPU) result = transfer(result, CPU)

10

slide-11
SLIDE 11

cuml cuml

Motivating Example

Missing Functions Manual Data Transfers Small GPU Memory

X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) X_test[i,j]=transfer(X_test[i,j], GPU) result[i,j]=transfer(result[i,j], CPU) for (i,j) in split(X_test): [i,j] [i,j]) [i,j] [i,j]) [i,j] [i,j])

11

slide-12
SLIDE 12

cuml cuml

Motivating Example

Missing Functions Manual Data Transfers Small GPU Memory Scheduling

X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) X_test[i,j]=transfer(X_test[i,j], GPU) result[i,j]=transfer(result[i,j], CPU) for (i,j) in split(X_test): [i,j] [i,j]) [i,j] [i,j]) [i,j] [i,j])

??? ???

12

slide-13
SLIDE 13

Solution: Offload Annotations

The annotator writes offload annotations (OAs) for CPU libraries. An end user imports the annotated library instead of the CPU

  • library. Our runtime, Bach, automatically schedules data transfers

and pages computation.

13

slide-14
SLIDE 14

With less developer effort:

  • 1. Match handwritten GPU performance

Goals

14

slide-15
SLIDE 15

With less developer effort:

  • 1. Match handwritten GPU performance
  • 2. Scale to data sizes larger than GPU memory

Goals

15

slide-16
SLIDE 16

With less developer effort:

  • 1. Match handwritten GPU performance
  • 2. Scale to data sizes larger than GPU memory
  • 3. Beat CPU performance

Goals

16

slide-17
SLIDE 17

multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

17

GPU library CPU library

slide-18
SLIDE 18

multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

18

GPU library CPU library corresponding functions

slide-19
SLIDE 19

inputs outputs arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

19

slide-20
SLIDE 20

arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)

  • nes = @oa_alloc(ret, func=torch.ones)(np.ones)

Step 1: Annotator – Allocation Annotations

20

Allocations only have a return type.

slide-21
SLIDE 21

arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)

  • nes = @oa_alloc(ret, func=torch.ones)(np.ones)

Step 1: Annotator – Allocation Annotations

21

What’s in an offload split type? "offload split type"

slide-22
SLIDE 22

API Description device(value) Which device the value is on. to(value, device) Transfers [value] to [device].

  • ffloading

API

Step 1: Annotator – Offload Split Type

22

slide-23
SLIDE 23

API Description device(value) Which device the value is on. to(value, device) Transfers [value] to [device].

  • ffloading

API

Step 1: Annotator – Offload Split Type

23

API Implementation for NdArrayType() device(value) ...if isinstance(value, torch.Tensor): ... to(value, device) ...value.to(torch.device('cpu')).numpy()

slide-24
SLIDE 24

API Description size(value) Number of elements in the value. split(start, end, value) Splits a value to enable paging. merge(values) Merges split values.

splitting API [Mozart SOSP ‘19] (optional)

Step 1: Annotator – Offload Split Type

24

slide-25
SLIDE 25

API Description size(value) Number of elements in the value. split(start, end, value) Splits a value to enable paging. merge(values) Merges split values.

splitting API [Mozart SOSP ‘19] (optional)

Step 1: Annotator – Offload Split Type

25

API Implementation for NdArrayType() size(value) return value.shape[-1] split(start, end, value) return value[start, end] merge(values) return np.concatenate(values)

slide-26
SLIDE 26

Step 1: Annotator – Offload Split Type

26

DataFrameType() ModelType() NdArrayType()

slide-27
SLIDE 27

import numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b)

Step 2: End User

(Simple, somewhat dumb, Python program.)

27

End User ≠ Annotator

slide-28
SLIDE 28

import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) Import the annotated library instead of the CPU library.

Step 2: End User

28

slide-29
SLIDE 29

import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) np.evaluate()

Step 2: End User

Explicitly materialize lazy values with included evaluate() function.

29

slide-30
SLIDE 30

Generate a lazy computation graph and do a topological sort.

Step 3: Runtime - Scheduling

? ? ? ? ? 1 2 4 5 3 np.sqrt(b) 1 2 4 5 3 a = np.ones() np.arcsin(a) b = np.ones() np.multiply(a,b) = Allocation = CPU = GPU

30

Bach

slide-31
SLIDE 31

Assign functions to the CPU/GPU based on whether a GPU library implementation is provided in the annotation.

Step 3: Runtime - Scheduling

?

? ? 1 2 4 5 3 1 2 4 5 3 = Allocation = CPU = GPU No. GPU library imple- mentation provided? Yes. Yes. np.sqrt(b) a = np.ones() np.arcsin(a) b = np.ones() np.multiply(a,b)

31

Bach

slide-32
SLIDE 32

Assign allocations to the CPU/GPU so they are on the same device as the first function that uses the data.

Step 3: Runtime - Scheduling

? ? ?

1 2 4 5 3 1 2 4 5 3 = Allocation = CPU = GPU np.sqrt(b) a = np.ones() np.arcsin(a) b = np.ones() np.multiply(a,b)

32

Bach

slide-33
SLIDE 33

Automatically transfer data using the offloading API.

Step 3: Runtime – Offloading API

? ? ?

1 2 4 5 3 1 2 4 5 3

  • - transfer to CPU --
  • - transfer to GPU --

= Allocation = CPU = GPU np.sqrt(b) a = np.ones() np.arcsin(a) b = np.ones() np.multiply(a,b)

33

Bach

slide-34
SLIDE 34

np.sqrt(b) a = np.ones() np.arcsin(a) b = np.ones() np.multiply(a,b)

Automatically page large datasets using the splitting API.

Step 3: Runtime – Splitting API

? ? ?

1 2 4 5 3 1 2 4 5 3 Split Merge

  • - transfer to CPU --
  • - transfer to GPU --

Data Size = 2^28 = Allocation = CPU = GPU

34

Bach

slide-35
SLIDE 35

Naive cost-benefit analysis between data transfer and computation cost.

Step 3: Runtime – Scheduling Heuristics

? ? ? ? 1 2 4 5 3 = Allocation = CPU = GPU

GPU Compute + Data Transfer CPU Compute Data Size = 2^28 (optional)

35

slide-36
SLIDE 36

? ? ? ? 1 2 4 5 3 = Allocation = CPU = GPU

GPU Compute + Data Transfer CPU Compute Data Size = 2^10

36

Naive cost-benefit analysis between data transfer and computation cost.

Step 3: Runtime – Scheduling Heuristics

(optional)

slide-37
SLIDE 37

Naive implementations of cost estimators. Estimated Cost Data Size CPU Compute GPU Compute + Data Transfer 2^10 2^28 CPU GPU

37

Step 3: Runtime – Scheduling Heuristics

(optional)

slide-38
SLIDE 38

Evaluation

4 library integrations and 8 data science and ML workloads.

38

slide-39
SLIDE 39

~130 LOC per library including offloading / splitting APIs and function annotations.

Integration Experience

39

slide-40
SLIDE 40

Speedup: max 1200x, median 6.3x.

Evaluation: Summary

40

slide-41
SLIDE 41

Evaluation: Summary

41

slide-42
SLIDE 42

With less developer effort, Bach can:

  • 1. Match handwritten GPU

performance

Evaluation: Summary

42

slide-43
SLIDE 43

With less developer effort, Bach can:

  • 1. Match handwritten GPU

performance

  • 2. Scale to data sizes larger than

GPU memory

Evaluation: Summary

43

slide-44
SLIDE 44

With less developer effort, Bach can:

  • 1. Match handwritten GPU

performance

  • 2. Scale to data sizes larger than

GPU memory

  • 3. Beat CPU performance

Evaluation: Summary

44

2.3x 6.8x

slide-45
SLIDE 45

Crime Index saves time by eliminating the initial data transfer, while the allocation still fits in GPU memory.

In-Depth Evaluation: Allocations

45

4.6x 1.1x

slide-46
SLIDE 46

At smaller data sizes, TSVD schedules all computation

  • n the CPU.

In-Depth Evaluation: Heuristics

46

11x

slide-47
SLIDE 47

[Motivating Example] The "fit" phase dominates the runtime until the "predict" phase can split/page data into the GPU.

In-Depth Evaluation: Splitting/Paging Datasets

47

6.2x 2.0x

slide-48
SLIDE 48

Evaluation: Summary

1.7x 6.9x 0.81x 5.7x 1200x 6.8x 4.6x 11x Max Speedup

48

slide-49
SLIDE 49

Evaluation: Summary

1.7x 6.9x 0.81x 5.7x 1200x 6.8x 4.6x 11x Max Speedup

49

slide-50
SLIDE 50

OAs enable heterogeneous GPU computing in existing libraries and workloads with little to no code modifications.

Conclusion

github.com/stanford-futuredata/offload-annotations gyuan@cs.stanford.edu With less developer effort, Bach + OAs can: § Match handwritten GPU performance § Scale to data sizes larger than GPU memory § Beat CPU performance

50