Supercomputing as a Service: Massively-Parallel Jobs on FaaS - - PowerPoint PPT Presentation

supercomputing as a service massively parallel jobs on
SMART_READER_LITE
LIVE PREVIEW

Supercomputing as a Service: Massively-Parallel Jobs on FaaS - - PowerPoint PPT Presentation

Supercomputing as a Service: Massively-Parallel Jobs on FaaS Platforms Sadjad Fouladi Stanford University Compiling clang takes >2 hours. https://xkcd.com/303/ R O T I D E "MY VIDEO'S ENCODING!" ENCODING! Compressing a


slide-1
SLIDE 1

Supercomputing as a Service: Massively-Parallel Jobs on FaaS Platforms

Sadjad Fouladi Stanford University

slide-2
SLIDE 2

https://xkcd.com/303/

Compiling clang takes >2 hours.

slide-3
SLIDE 3

E D I T O R "MY VIDEO'S ENCODING!"

ENCODING!

Compressing a 15-minute 4K video takes ~7.5 hours.

slide-4
SLIDE 4

A N I M A T O R " M Y A N I M A T I O N ' S R E N D E R I N G ! "

RENDERING!

Rendering each frame of Monsters University took 29 hours.

slide-5
SLIDE 5

Many of these pipelines take hours and hours to finish.

The Problem

slide-6
SLIDE 6

Can we achieve interactive speeds in these applications?

The Question

slide-7
SLIDE 7

Massive Parallelism*

The Answer * well, probably.

slide-8
SLIDE 8

How to get thousands of threads?

  • The largest companies are able to operate massive datacenters that can

support such levels of parallelism.

  • But, end users and developers are unable to scale their resource footprint to

thousands of parallel threads on demand in an efficient and scalable manner.

8

slide-9
SLIDE 9

Classic Approach: VMs

  • Infrastructure-as-a-Service
  • Thousands of threads
  • Arbitrary Linux executables

👏 Minute-scale startup time (OS has to boot up, ...) 👏 High minimum cost

9

slide-10
SLIDE 10

Cloud function services have (as yet) unrealized power

  • AWS Lambda, Google Cloud Functions, IBM Cloud Functions, Azure

Functions, etc.

  • Intended for event handlers and Web microservices, but...
  • Features:

✔ Thousands of threads ✔ Arbitrary Linux executables ✔ Sub-second startup ✔ Sub-second billing

10

3,600 threads for one second → 10¢

slide-11
SLIDE 11

Supercomputing as a Service

11

Cancel Remotely (~5 secs, 50¢) Locally (~5 hours)

Compressing this video will take a long

  • time. How do you want to execute this

job?

Encoding

slide-12
SLIDE 12

Two projects that we did based on this promise:

  • ExCamera: Low-Latency Video Processing
  • gg: make -j1000 (and other jobs) on FaaS infrastructure

12

slide-13
SLIDE 13

ExCamera: Low-Latency Video Processing Using Thousands of Tiny Threads

Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. "Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads." In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDIʼ17).

slide-14
SLIDE 14

What we currently have

  • People can make changes to a word-processing document
  • The changes are instantly visible for the others

14

slide-15
SLIDE 15

What we would like to have

  • People can interactively edit and transform a video
  • The changes are instantly visible for the others

for Video?

slide-16
SLIDE 16

"Apply this awesome filter to my video."

slide-17
SLIDE 17

"Look everywhere for this face in this movie."

slide-18
SLIDE 18

"Remake Star Wars Episode I without Jar Jar."

slide-19
SLIDE 19

Challenges in low-latency video processing

  • Low-latency video processing would need thousands of threads, running in

parallel, with instant startup.

  • However, the finer-grained the parallelism, the worse the compression

efficiency.

19

slide-20
SLIDE 20

First challenge: thousands of threads

  • We built mu, a library for designing and deploying

general-purpose parallel computations on a commercial “cloud function” service.

  • The system starts up thousands of threads in

seconds and manages inter-thread communication.

  • mu is open-source software: https://github.com/

excamera/mu

20

λ λ λ λ

rendezvous server local machine

slide-21
SLIDE 21

Second challenge: parallelism hurts compression efficiency

  • Existing video codecs only expose a simple interface that's not suitable for

massive parallelism.

  • We built a video codec in explicit state-passing style, intended for massive

fine-grained parallelism.

  • Implemented in 11,500 lines of C++11 for Google's VP8 format.

21

decode(state, frame) → (state′, image)
 encode(state, image) → interframe
 rebase(state, image, interframe) → interframe′

slide-22
SLIDE 22

ExCamera 2.6 mins 14.8-minute 4K Video @20dB vpxenc Single-Threaded 453 mins vpxenc Multi-Threaded 149 mins YouTube (H.264) 37 mins

slide-23
SLIDE 23

ExCamera

  • Two major contributions:
  • Framework to run 5,000-way parallel jobs with IPC on a commercial

“cloud function” service.

  • Purely functional video codec for massive fine-grained parallelism.
  • 56× faster than existing encoder, for <$6.

23

slide-24
SLIDE 24

gg: make -j1000 (and other jobs) on function-as-a-service infrastructure

Sadjad Fouladi, Dan Iter, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, Keith Winstein

slide-25
SLIDE 25

What is gg?

  • gg is a system for executing

interdependent software workflows across thousands of short-lived “lambdas”.

25

hello

(stripped)

libc hello libhello.a hello.c hello.i dirname.c dirname.i closeout.c closeout.i string.h stdio.h hello.o hello.s closeout.o closeout.s dirname.o dirname.s

slide-26
SLIDE 26

"Thunk" abstraction

26

hello

(stripped)

libc hello libhello.a hello.c hello.i dirname.c dirname.i closeout.c closeout.i string.h stdio.h hello.o hello.s closeout.o closeout.s dirname.o dirname.s

{ "function": { "exe": "g++", "args": ["-S", "dirname.i", "-o",...], "hash": "A5BNh" }, "infiles": [ { "name": "dirname.i", "order": 1, "hash": "SoYcD" }, { "name": "g++", "order": 0, "hash": "A5BNh" } ], "outfile": "dirname.s" }

slide-27
SLIDE 27

"Thunk" abstraction

  • Thunk is an abstraction for

representing a morsel of computation in terms of a function and its complete functional footprint.

  • Thunks can be forced anywhere, on

the local machine, or on a remote VM, or inside a lambda function.

27

{ "function": { "exe": "g++", "args": ["-S", "dirname.i", "-o",...], "hash": "AsBNh" }, "infiles": [ { "name": "dirname.i", "order": 1, "hash": "SoYcD" }, { "name": "g++", "order": 0, "hash": "ts0sB" } ], "outfile": "dirname.s" }

slide-28
SLIDE 28

Execution

  • Generating the dependency graph in terms of thunks:


gg-infer make

  • Forcing the thunk, recursively:


gg-force --jobs 1000 bin/clang

28

slide-29
SLIDE 29

Compiling FFmpeg using gg

29

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5 10 15 20 25 30

worker #

Fetching the dependencies Executing the thunk Uploading the results

preprocess, compile and assemble archive, link and strip job completed

5080 5095 5115 5 10 15 20 25 30

time (s) worker #

job completed archive, link and strip

slide-30
SLIDE 30

Evaluation

30

single-core gg (λ) ffmpeg 9m 45s 35s inkscape 33m 35s 1m 15s llvm 1h 16m 18s 1m 11s

slide-31
SLIDE 31

gg is open-source software

https://github.com/StanfordSNR/gg

31

slide-32
SLIDE 32

Takeaways

  • The future is granular, interactive and massively parallel.
  • Many applications can benefit from this "Laptop Extension" model.
  • Better platforms are needed to be built to support "bursty" massively-parallel

jobs.

32

slide-33
SLIDE 33

JUST USE GG!

33

https://github.com/StanfordSNR/gg