Tiny functions for lots of things Keith Winstein joint work with: - - PowerPoint PPT Presentation

tiny functions for lots of things
SMART_READER_LITE
LIVE PREVIEW

Tiny functions for lots of things Keith Winstein joint work with: - - PowerPoint PPT Presentation

Tiny functions for lots of things Keith Winstein joint work with: Francis Y. Yan , Sadjad Fouladi , John Emmons , Riad S. Wahby , Emre Orbay , Brennan Shacklett , William Zeng , Dan Iter , Shuvo Chaterjee, Catherine Wu Daniel Reiter Horn


slide-1
SLIDE 1 Tiny functions for lots of things Keith Winstein joint work with: Francis Y. Yan , Sadjad Fouladi , John Emmons , Riad S. Wahby , Emre Orbay , Brennan Shacklett , William Zeng , Dan Iter , Shuvo Chaterjee, Catherine Wu Daniel Reiter Horn , Ken Elkabany , Chris Lesniewski-Laas , Karthikeyan Vasuki Balasubramaniam , Rahul Bhalerao , George Porter , Anirudh Sivaraman Stanford University Saratoga High School Dropbox UC San Diego MIT
slide-2
SLIDE 2 Message of this talk

◮ A little “functional-ish” programming goes a long way. ◮ It’s worth refactoring megamodules (codecs, TCP, compilers, machine learning) using ideas from functional programming. ◮ Just the ability to name, save, and restore program states is powerful in its own right.

slide-3
SLIDE 3 Breaking megamodules into functions Lepton: JPEG recompression in a distributed filesystem ExCamera: Fast interactive video encoding Salsify: Videoconferencing with co-designed codec and transport protocol gg: IR for “laptop to lambda” jobs with 8,000-way parallelism
slide-4
SLIDE 4 Breaking megamodules into functions Lepton: JPEG recompression in a distributed filesystem ◮ “functional” JPEG codec for boundary-oblivious sharding ExCamera: Fast interactive video encoding ◮ “functional” video codec for fine-grained parallelism Salsify: Videoconferencing with co-designed codec and transport protocol ◮ “functional” codec to explore an execution path without committing gg: IR for “laptop to lambda” jobs with 8,000-way parallelism ◮ “functional” representation of practical parallel pipelines
slide-5
SLIDE 5 System 1: Lepton (distributed JPEG recompression) Daniel Reiter Horn, Ken Elkabany, Chris Lesniewski-Lass, and KW, The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files for a File-Storage Service, in NSDI 2017 (Community Award winner).
slide-6
SLIDE 6

Storage Overview at Dropbox

  • ¾ Media
  • Roughly an Exabyte in storage
  • Can we save backend space?
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%

Other Videos JPEGs

slide-7
SLIDE 7

JPEG File

7x7 1x7 7x1 DC

  • Header
  • 8x8 blocks of pixels

– DCT transformed into 64 coefs

  • Lossless

– Each divided by large quantizer

  • Lossy

– Serialized using Huffman code

  • Lossless

Image credit: wikimedia

slide-8
SLIDE 8 Idea: save storage with transparent recompression ◮ Requirement: byte-for-byte reconstruction of original file ◮ Approach: improve bottom “lossless” layer only ◮ Replace DC-predicted Huffman code with an arithmetic code ◮ Use a probability model to predict “1” vs. “0”
slide-9
SLIDE 9 Prior work 15 20 30 40 50 100 150 200 6 7 8 9 10 15 20 25 Decompression speed (Mbits/s) Compression savings (percent) Better MozJPEG (arithmetic) JPEGrescan (progressive) packjpg (global sort + big model + arithmetic)
slide-10
SLIDE 10 Challenge: distributed filesystem with arbitrary chunk boundaries bytes 0..N-1 bytes N..2N-1 bytes 2N..end server #272 server #140 server #803
slide-11
SLIDE 11 Challenge: distributed filesystem with arbitrary chunk boundaries representing bytes 0..N-1 representing bytes N..2N-1 representing bytes 2N..end server #272 server #140 server #803 Lepton Lepton Lepton
slide-12
SLIDE 12 Challenge: distributed filesystem with arbitrary chunk boundaries representing bytes 0..N-1 representing bytes N..2N-1 representing bytes 2N..end server #272 server #140 server #803 Lepton Lepton Lepton bytes 0..N-1 bytes N..2N-1 bytes 2N..end
slide-13
SLIDE 13 Requirements for distributed compression ◮ Store and decode file in independent chunks ◮ Can start at any byte offset ◮ Achieve > 100 Mbps decoding speed per chunk ◮ Don’t lose data ◮ Immune to adversarial/pathological input files ◮ Every time program changed, qualify on a billion images ◮ Three compilers (with and without sanitizers) must match on all billion images
slide-14
SLIDE 14 Challenges ◮ Baseline JPEG is encoded as a stream of Huffman codewords with opaque state (DC prediction). ◮ encode(HuffmanTable, vector<Coefficient>) → vector<bit> ◮ How to encode chunk of original file, starting in midstream? ◮ Midstream = in the middle of a Huffman codeword ◮ Midstream = unknown DC (average) value
slide-15
SLIDE 15 When the client retrieves a chunk of a JPEG file, how does the fileserver re-encode that chunk from Lepton back to JPEG?
slide-16
SLIDE 16 Making the state of the JPEG encoder explicit ◮ Formulate JPEG encoder in explicit state-passing style ◮ Implement DC-predicted Huffman encoder that can resume from any byte boundary ◮ encode(HuffmanTable, vector<bit>, int dc, vector<Coefficient>) → vector<bit>
slide-17
SLIDE 17 Results 15 20 30 40 50 100 150 200 6 7 8 9 10 15 20 25 Decompression speed (Mbits/s) Compression savings (percent) Better MozJPEG (arithmetic) JPEGrescan (progressive) packjpg (global sort + big model + arithmetic)
slide-18
SLIDE 18 Results 15 20 30 40 50 100 150 200 6 7 8 9 10 15 20 25 Decompression speed (Mbits/s) Compression savings (percent) Better Lepton MozJPEG (arithmetic) JPEGrescan (progressive) packjpg (global sort + big model + arithmetic)
slide-19
SLIDE 19

Deployment

  • Lepton has encoded 150 billion files

– 203 PiB of JPEG files – Saving 46 PiB – So far…

  • Backfilling at > 6000 images per second
slide-20
SLIDE 20

Power Usage at 6,000 Encodes

21:00 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 03:00 50 100 150 200 250 300 Chassis 3ower (k:)

slide-21
SLIDE 21 Lepton concluding thoughts ◮ A little bit of functional programming can go a long way. ◮ Functional JPEG codec lets Lepton distribute decoding with arbitrary chunk boundaries and parallelize within each chunk.
slide-22
SLIDE 22 System 2: ExCamera (fine-grained parallel video processing) Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and KW, Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads, in NSDI 2017. https://ex.camera
slide-23
SLIDE 23

What we currently have

  • People can make changes to a word-processing document
  • The changes are instantly visible for the others

3

slide-24
SLIDE 24

What we would like to have

  • People can interactively edit and transform a video
  • The changes are instantly visible for the others

for Video?

slide-25
SLIDE 25

"Apply this awesome filter to my video."

slide-26
SLIDE 26

"Look everywhere for this face in this movie."

slide-27
SLIDE 27

"Remake Star Wars Episode I without Jar Jar."

slide-28
SLIDE 28

Can we achieve interactive collaborative video editing
 by using massive parallelism? Currently, running such pipelines on videos takes hours and hours, even for a short video.

The Problem The Question

slide-29
SLIDE 29

The challenges

  • Low-latency video processing would need thousands of threads, running in

parallel, with instant startup.

  • However, the finer-grained the parallelism, the worse the compression

efficiency.

9

slide-30
SLIDE 30

Enter ExCamera

  • We made two contributions:
  • Framework to run 5,000-way parallel jobs with IPC on a commercial

“cloud function” service.

  • Purely functional video codec for massive fine-grained parallelism.
  • We call the whole system ExCamera.

10

slide-31
SLIDE 31

9

slide-32
SLIDE 32

Now we have the threads, but...

  • With the existing encoders, the finer-grained the parallelism, the worse the

compression efficiency.

18

slide-33
SLIDE 33

Video Codec

  • A piece of software or hardware that compresses and decompresses digital

video.

19

1011000101101010001 0001111111011001110 0110011101110011001 0010000...001001101 0010011011011011010 1111101001100101000 0010011011011011010

Encoder Decoder

slide-34
SLIDE 34

How video compression works

  • Exploit the temporal redundancy in adjacent images.
  • Store the first image on its entirety: a key frame.
  • For other images, only store a "diff" with the previous images: an interframe.

20

In a 4K video @15Mbps, a key frame is ~1 MB, but an interframe is ~25 KB.

slide-35
SLIDE 35

Existing video codecs only expose a simple interface

encode([!,!,...,!]) → keyframe + interframe[2:n] decode(keyframe + interframe[2:n]) → [!,!,...,!]

21

compressed video

slide-36
SLIDE 36

encode(i[1:200]) → keyframe1 + interframe[2:200] [thread 01] encode(i[1:10]) → kf1 + if[2:10] [thread 02] encode(i[11:20]) → kf11 + if[12:20] [thread 03] encode(i[21:30]) → kf21 + if[22:30]

[thread 20] encode(i[191:200]) → kf191 + if[192:200]

Traditional parallel video encoding is limited

22

finer-grained parallelism ⇒ more key frames ⇒ worse compression efficiency

parallel ↓ serial ↓

+1 MB +1 MB +1 MB

slide-37
SLIDE 37

We need a way to start encoding mid-stream

  • Start encoding mid-stream needs access to intermediate computations.
  • Traditional video codecs do not expose this information.
  • We formulated this internal information and we made it explicit: the “state”.

23

slide-38
SLIDE 38

The decoder is an automaton

24

state

interframe

state state state

key frame interframe interframe

slide-39
SLIDE 39

The state is consisted of reference images and probability models

prob tables’

target state

  • utput

source state

frame

prob tables

slide-40
SLIDE 40

What we built: a video codec in explicit state-passing style

  • VP8 decoder with no inner state:

decode(state, frame) → (state′, image)

  • VP8 encoder: resume from specified state

encode(state, image) → interframe

  • Adapt a frame to a different source state

rebase(state, image, interframe) → interframe′

25

slide-41
SLIDE 41

Putting it all together: ExCamera

  • Divide the video into tiny chunks:
  • [Parallel] encode tiny independent chunks.
  • [Serial] rebase the chunks together and remove extra keyframes.

26

slide-42
SLIDE 42
  • 1. [Parallel] Download a tiny chunk of raw video

27

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

slide-43
SLIDE 43
  • 2. [Parallel] vpxenc → keyframe, interframe[2:n]

28

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

Google's VP8 encoder


encode(img[1:n]) → keyframe + interframe[2:n]

slide-44
SLIDE 44
  • 3. [Parallel] decode → state ↝ next thread

29

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

Our explicit-state style decoder


decode(state, frame) → (state′, image)

slide-45
SLIDE 45
  • 4. [Parallel] last thread’s state ↝ encode

30

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

Our explicit-state style encoder


encode(state, image) → interframe

slide-46
SLIDE 46
  • 5. [Serial] last thread’s state ↝ rebase → state ↝ next thread

31

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

Adapt a frame to a different source state


rebase(state, image, interframe) → interframe′

slide-47
SLIDE 47
  • 5. [Serial] last thread’s state ↝ rebase → state ↝ next thread

32

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

Adapt a frame to a different source state


rebase(state, image, interframe) → interframe′

slide-48
SLIDE 48
  • 6. [Parallel] Upload finished video

33

1 6 1 1 1 5

thread 1

7 12 1 1 1 11

thread 2

13 18 1 1 1 17

thread 3

19 24 1 1 1 23

thread 4

slide-49
SLIDE 49

Wide range of different configurations

34

ExCamera[n, x]

number of frames in each chunk

slide-50
SLIDE 50

Wide range of different configurations

35

ExCamera[n, x]

number of chunks "rebased" together

slide-51
SLIDE 51

How well does it compress?

37

16 17 18 19 20 21 22 5 10 20 30 40 50 70

quality (SSIM dB) average bitrate (Mbit/s) vpx (1 thread) vpx (multithreaded)

slide-52
SLIDE 52

How well does it compress?

38

16 17 18 19 20 21 22 5 10 20 30 40 50 70

quality (SSIM dB) average bitrate (Mbit/s) ExCamera[6, 1] vpx (1 thread) vpx (multithreaded)

slide-53
SLIDE 53

How well does it compress?

39

16 17 18 19 20 21 22 5 10 20 30 40 50 70

quality (SSIM dB) average bitrate (Mbit/s) ExCamera[6, 1]

ExCamera[6, 16]

vpx (1 thread)

±3%

slide-54
SLIDE 54

ExCamera[6, 16] 2.6 mins 14.8-minute 4K Video @20dB vpxenc Single-Threaded 453 mins vpxenc Multi-Threaded 149 mins YouTube (H.264) 37 mins

slide-55
SLIDE 55 ExCamera concluding thoughts ◮ Functional video codec lets ExCamera parallelize at fine granularity. ◮ Many interactive jobs call for similar approach: ◮ Image and video filters ◮ 3D artists ◮ Compilation and software testing ◮ Interactive machine learning ◮ Database queries ◮ Data visualization ◮ Genomics ◮ Search ◮ Distributed systems will need to treat application state as a first-class object. ◮ Every program soon: do in 1 hour do in 1 second for 9¢
slide-56
SLIDE 56 System 3: Salsify (videoconferencing) Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S. Wahby, and KW, Salsify: low-latency network video through tighter integration between a video codec and a transport protocol, in NSDI 2018. https://snr.stanford.edu/salsify
slide-57
SLIDE 57

WebRTC (Chrome 65)

slide-58
SLIDE 58

Current systems do not react fast enough to network variations, end up congesting the network, causing stalls and glitches.

slide-59
SLIDE 59

video codec transport protocol

Today's systems combine two (loosely-coupled) components

10

slide-60
SLIDE 60

Two distinct modules, two separate control loops

11

target bit rate video codec transport protocol

300 packets/s 24 frames/s

compressed frames

slide-61
SLIDE 61

Transport tells us how big the next frame should be, but...

It’s challenging for any codec to choose the appropriate
 quality settings upfront to meet a target size—they tend to

  • ver-/undershoot the target.

19

slide-62
SLIDE 62

How to get an accurate frame out of an inaccurate codec

  • Trial and error: Encode with different quality settings, pick the one that fits.
  • Not possible with existing codecs.

20

slide-63
SLIDE 63

frame frame frame frame

After encoding a frame, the encoder goes through a state transition that is impossible to undo

21

slide-64
SLIDE 64

There’s no way to undo an encoded frame in current codecs

22

encode(🏟,🏟,...) → frames...

The state is internal to the encoder—no way to save/restore the state.

slide-65
SLIDE 65

Functional video codec to the rescue

encode(state, 🏟) → state′, frame

23

Salsify’s functional video codec exposes the state that can be saved/restored.

slide-66
SLIDE 66

Order two, pick the one that fits!

  • Salsify’s functional video codec can explore different execution paths

without committing to them.

  • For each frame, codec presents the transport with three options:

A slightly-higher-quality version, A slightly-lower-quality version, Discarding the frame.

24

better worse

50 KB 10 KB

slide-67
SLIDE 67

Salsify’s architecture:

Unified control loop

25

transport protocol & video codec

slide-68
SLIDE 68

Codec → Transport
 “Here’s two versions of the current frame.”

26

b e t t e r w

  • r

s e

50 KB 25 KB

30 KB

target frame size

slide-69
SLIDE 69

Transport → Codec
 “I picked option 2. Base the next frame on its exiting state.”

27

25 KB

30 KB

target frame size

slide-70
SLIDE 70

Codec → Transport
 “Here’s two versions of the latest frame.”

28

better worse

50 KB 25 KB

55 KB

target frame size

slide-71
SLIDE 71

Transport → Codec
 “I picked option 1. Base the next frame on its exiting state.”

29

50 KB

55 KB

target frame size

slide-72
SLIDE 72

Codec → Transport
 “Here’s two versions of the latest frame.”

30

better worse

70 KB 25 KB 50 KB

5 KB

target frame size

slide-73
SLIDE 73

Transport → Codec
 “I cannot send any frames right now. Sorry, but discard them.”

31

5 KB

target frame size

slide-74
SLIDE 74

Codec → Transport
 “Fine. Here’s two versions of the latest frame.”

32

better worse

4 5 K B 2 K B

50 KB

target frame size

slide-75
SLIDE 75

Transport → Codec
 “I picked option 1. Base the next frame on its exiting state.”

33

50 KB

4 5 K B

target frame size

slide-76
SLIDE 76

Goals for the measurement testbed

  • A system with


reproducible input video and
 reproducible network traces that runs
 unmodified version of the system-under-test.

  • Target QoE metrics: per-frame quality and delay.

36

slide-77
SLIDE 77

barcoded video video in/out (HDMI) HDMI to USB camera emulated network receiver HDMI output

slide-78
SLIDE 78

Sent Image Timestamp: T+0.000s Received Image Timestamp: T+0.765s Quality: 9.76 dB SSIM

slide-79
SLIDE 79

Evaluation results: Verizon LTE Trace

40

8 10 12 14 16 18 500 700 1000 2000 5000 7000

Video Quality (SSIM dB) Video Delay (95th percentile ms) WebRTC (VP9-SVC) Skype FaceTime Hangouts WebRTC

Better

slide-80
SLIDE 80

Evaluation results: Verizon LTE Trace

41

8 10 12 14 16 18 500 700 1000 2000 5000 7000

Video Quality (SSIM dB) Video Delay (95th percentile ms) WebRTC (VP9-SVC) Skype FaceTime Hangouts WebRTC Status Quo

(conventional transport and codec)

slide-81
SLIDE 81

Evaluation results: Verizon LTE Trace

42

8 10 12 14 16 18 500 700 1000 2000 5000 7000

Video Quality (SSIM dB) Video Delay (95th percentile ms) WebRTC (VP9-SVC) Skype FaceTime Hangouts WebRTC Status Quo

(conventional transport and codec)

Salsify (conventional codec)

slide-82
SLIDE 82

Evaluation results: Verizon LTE Trace

43

8 10 12 14 16 18 500 700 1000 2000 5000 7000

Video Quality (SSIM dB) Video Delay (95th percentile ms) Salsify WebRTC (VP9-SVC) Skype FaceTime Hangouts WebRTC Status Quo

(conventional transport and codec)

Salsify (conventional codec)

slide-83
SLIDE 83

Evaluation results: AT&T LTE Trace

44

8 9 10 11 12 13 14 15 16 200 300 500 700 1000 2000 5000

Video Quality (SSIM dB) Video Delay (95th percentile ms) WebRTC (VP9-SVC) Skype FaceTime Hangouts Salsify WebRTC

Better

slide-84
SLIDE 84

Evaluation results: T-Mobile UMTS Trace

45

9 10 11 12 13 14 3500 5000 7000 10000 14000 18000

Video Quality (SSIM dB) Video Delay (95th percentile ms) WebRTC (VP9-SVC) Skype FaceTime Hangouts Salsify WebRTC

Better

slide-85
SLIDE 85

WebRTC (Chrome 65)

slide-86
SLIDE 86

Improvements to video codecs may have reached the point of diminishing returns, but changes to the architecture of video systems can still yield significant benefits.

slide-87
SLIDE 87 System 4: gg (laptop to lambda) ◮ Kalev Alpernas, Cormac Flanagan, Sadjad Fouladi, Leonid Ryzhyk, Mooly Sagiv, Thomas Schmitz, and KW, Secure serverless computing using dynamic information flow control, Proc. ACM
  • Program. Lang. 2, OOPSLA, Article 118 (November 2018).
◮ Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and KW, From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers, in USENIX ATC 2019.
slide-88
SLIDE 88 Cloud functions as a new computing substrate ◮ Rent 8,000 nodes in seconds (but some are flaky) ◮ Nodes can communicate directly at 600 Mbps (but some paths are flaky) ◮ Lots of jobs could take advantage of this substrate ◮ Big compilations (compiling Chromium takes 16 hours on one core) ◮ Software test suites (unit tests, integration tests) ◮ Ray-tracing (rendering one frame of a movie can take >12 hours) ◮ Video editing ◮ Parallel jobs on large videos
slide-89
SLIDE 89 The gg intermediate representation ◮ Types: values and thunks ◮ Components ◮ raw inputs (“V” value name or “T” thunk name) ◮ forced inputs (“T” thunk name) ◮ outputs (named byte vector, may be another thunk) ◮ execution spec (e.g., Unix command line) ◮ Addressing scheme ◮ “V” + hash of a byte vector ◮ or “T” + hash of a thunk’s canonical representation + “#” + name of an output ◮ Can express ◮ Recursive fibonacci ◮ Y combinator ◮ Various everyday jobs ◮ Alpernas et al. (OOPSLA 2018): “Enforcing IFC policies is easy”
slide-90
SLIDE 90 Compilation
slide-91
SLIDE 91 Demo
slide-92
SLIDE 92 Compiling inkscape (600 kLOC) Tool Time Cost single-core make 32m 34s “make -j48” on a local 48-core machine 01m 40s icecc to a warm 48-core EC2 machine 06m 51s $2.30/hr icecc to a warm 384-core EC2 cluster 06m 57s $18.40/hr gg to AWS Lambda 01m 27s 50 cents/run
slide-93
SLIDE 93 Compiling Chromium (24,000 kLOC) Tool Time single-core make 15h 58m 20s “make -j48” on a local 48-core machine 38m 11s icecc to a warm 48-core EC2 machine 46m 01s icecc to a warm 384-core EC2 cluster 42m 18s gg to AWS Lambda 18m 55s
slide-94
SLIDE 94 Tiny functions for lots of things. . . ◮ A little “functional-ish” programming goes a long way. ◮ It’s worth refactoring megamodules (codecs, TCP, compilers, machine learning) using ideas from functional programming. ◮ The ability to name, save, and restore program states is powerful in its own right. INFORMATION SOURCE MESSAGE TRANSMITTER SIGNAL RECEIVED SIGNAL RECEIVER MESSAGE DESTINATION NOISE SOURCE
  • Fig. 1—Schematic diagram of a general communication system.
(Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) (Y F) = (F (Y F)) ◮ Lepton: JPEG recompression ◮ ExCamera: video encoding with thousands of tiny tasks ◮ Salsify: real-time video with “functional” codec and transport ◮ gg: IR for “laptop to lambda” jobs with 8,000-way parallelism