Encoding, Fast and Slow: Low-Latency Video Processing Using - PowerPoint PPT Presentation

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad Fouladi ¹ , Riad S. Wahby ¹ , Brennan Shacklett ¹ , Karthikeyan Vasuki Balasubramaniam ² , William Zeng ¹ , Rahul Bhalerao ² , Anirudh Sivaraman ³ , George Porter ² , Keith Winstein ¹ ¹ Stanford University, ² UC San Diego, ³ MIT https://ex.camera

Outline • Vision & Goals • mu: Supercomputing as a Service • Fine-grained Parallel Video Encoding • Evaluation • Conclusion & Future Work 2

The challenges • Low-latency video processing would need thousands of threads , running in parallel , with instant startup. • However, the finer-grained the parallelism, the worse the compression e ffi ciency. 9

Enter ExCamera • We made two contributions: • Framework to run 5,000-way parallel jobs with IPC on a commercial “cloud function” service. • Purely functional video codec for massive fine-grained parallelism . • We call the whole system ExCamera . 10

Where to find thousands of threads? • IaaS services provide virtual machines (e.g. EC2, Azure, GCE): Thousands of threads • Arbitrary Linux executables • ! Minute-scale startup time (OS has to boot up, ...) ! High minimum cost   3,600 threads on EC2 for one second → >$20 (60 mins EC2, 10 mins GCE) 12

Cloud function services have (as yet) unrealized power • AWS Lambda, Google Cloud Functions • Intended for event handlers and Web microservices, but... • Features: ✔ Thousands of threads ✔ Arbitrary Linux executables ✔ Sub-second startup ✔ Sub-second billing 3,600 threads for one second → 10 ¢ 13

mu , supercomputing as a service • We built mu , a library for designing and deploying general-purpose parallel computations on a commercial “cloud function” service. • The system starts up thousands of threads in seconds and manages inter- thread communication. • mu is open-source software: https://github.com/excamera/mu 14

Now we have the threads, but... • With the existing encoders, the finer-grained the parallelism, the worse the compression efficiency. 18

Video Codec • A piece of software or hardware that compresses and decompresses digital video. 1011000101101010001 0001111111011001110 0110011101110011001 Encoder Decoder 0010000...001001101 0010011011011011010 1111101001100101000 0010011011011011010 19

How video compression works • Exploit the temporal redundancy in adjacent images. • Store the first image on its entirety: a key frame . • For other images, only store a "diff" with the previous images: an interframe . In a 4K video @15Mbps, a key frame is ~1 MB , but an interframe is ~25 KB . 20

Existing video codecs only expose a simple interface compressed video encode ([ ! , ! ,..., ! ]) → keyframe + interframe[2:n] decode (keyframe + interframe[2:n]) → [ ! , ! ,..., ! ] 21

Traditional parallel video encoding is limited serial ↓ encode (i[1:200]) → keyframe 1 + interframe[2:200] parallel ↓ [thread 01] encode (i[1:10]) → kf 1 + if[2:10] +1 MB [thread 02] encode (i[11:20]) → kf 11 + if[12:20] +1 MB [thread 03] encode (i[21:30]) → kf 21 + if[22:30] ⠇ +1 MB [thread 20] encode (i[191:200]) → kf 191 + if[192:200] finer-grained parallelism ⇒ more key frames ⇒ worse compression efficiency 22

We need a way to start encoding mid-stream • Start encoding mid-stream needs access to intermediate computations. • Traditional video codecs do not expose this information. • We formulated this internal information and we made it explicit: the “state” . 23

The decoder is an automaton key frame interframe interframe interframe state state state state 24

What we built: a video codec in explicit state-passing style • VP8 decoder with no inner state: decode (state, frame) → (state ′ , image) • VP8 encoder: resume from specified state encode (state, image) → interframe • Adapt a frame to a different source state rebase (state, image, interframe) → interframe ′ 25

Putting it all together: ExCamera • Divide the video into tiny chunks: • [Parallel] encode tiny independent chunks. • [Serial] rebase the chunks together and remove extra keyframes. 26

1. [Parallel] Download a tiny chunk of raw video thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 27

2. [Parallel] vpxenc → keyframe, interframe[2:n] thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 Google's VP8 encoder   encode(img[1:n]) → keyframe + interframe[2:n] 28

3. [Parallel] decode → state ↝ next thread thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 Our explicit-state style decoder   decode(state, frame) → (state ′ , image) 29

4. [Parallel] last thread’s state ↝ encode thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 Our explicit-state style encoder   encode(state, image) → interframe 30

5. [Serial] last thread’s state ↝ rebase → state ↝ next thread thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 Adapt a frame to a different source state   rebase (state, image, interframe) → interframe ′ 31

5. [Serial] last thread’s state ↝ rebase → state ↝ next thread thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 Adapt a frame to a different source state   rebase(state, image, interframe) → interframe ′ 32

6. [Parallel] Upload finished video thread 1 thread 2 thread 3 thread 4 1 1 1 1 5 6 7 1 1 1 11 12 13 1 1 1 17 18 19 1 1 1 23 24 33

14.8 -minute 4K Video @20dB vpxenc Single-Threaded 453 mins vpxenc Multi-Threaded 149 mins YouTube (H.264) 37 mins ExCamera[6, 16] 2.6 mins

Takeaways • Low-latency video processing • Two major contributions: • Framework to run 5,000-way parallel jobs with IPC on a commercial “cloud function” service. • Purely functional video codec for massive fine-grained parallelism . • 56 × faster than existing encoder, for <$6. https://ex.camera | excamera@cs.stanford.edu 44

Encoding, Fast and Slow: Low-Latency Video Processing Using - PowerPoint PPT Presentation

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad Fouladi , Riad S. Wahby , Brennan Shacklett , Karthikeyan Vasuki Balasubramaniam , William Zeng , Rahul Bhalerao , Anirudh Sivaraman

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Big and Small Steps for Fast and Slow Provability Paula Henk illc , University of Amsterdam

Fast-slow systems with chaotic noise David Kelly Ian Melbourne Courant Institute New York

Machine Learning Machine Learning Fast & Slow Fast & Slow Suman Deb Roy Suman Deb Roy

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SEARCHING: FAST AND SLOW Susan Dumais http://research.microsoft.com/~sdumais #TAIA2014 Jul

Integrating new major Integrating new major components on fast and slow components on fast and

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad

Debra Prinzing SLOW FLOWERS COLLECTIONS Datisca cannabina ECOMMERCE: Direct to Consumer What

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

Cracking the Habit Code 21 days to keeping your resolutions 1 Day 3: Start Small & Go Slow

Syed Aftab Rashid id, Geoffrey Nelissen and Eduardo Tovar 4/12/2016 Main CPU Cache Memory

scoot Introducing Fast, Cheap, Personal Transportation Free, But Slow (<10 MPH, $0/trip)

Di Digital Transm smissi ssion on 01204325 Data Communications and Computer Networks Chaipo

Analysis and Improvement of Differential Computation Attacks against Internally-Encoded White-Box

Differential Encoding for Real-Time Status Updates Sanidhay Bhambay Sudheer Poojary Parimal

Rank Analysis of Cubic Multivariate Cryptosystems John Baena 1 Daniel Cabarcas 1 Daniel Escudero 2

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett

Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue

Encoding Meshes in Differential Coordinates Daniel Cohen-Or Tel Aviv University Outline

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Encoding, Fast and Slow: Low-Latency Video Processing Using - PowerPoint PPT Presentation

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad Fouladi , Riad S. Wahby , Brennan Shacklett , Karthikeyan Vasuki Balasubramaniam , William Zeng , Rahul Bhalerao , Anirudh Sivaraman

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Big and Small Steps for Fast and Slow Provability Paula Henk illc , University of Amsterdam

Fast-slow systems with chaotic noise David Kelly Ian Melbourne Courant Institute New York

Machine Learning Machine Learning Fast &amp; Slow Fast &amp; Slow Suman Deb Roy Suman Deb Roy

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SEARCHING: FAST AND SLOW Susan Dumais http://research.microsoft.com/~sdumais #TAIA2014 Jul

Integrating new major Integrating new major components on fast and slow components on fast and

Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads Sadjad

Debra Prinzing SLOW FLOWERS COLLECTIONS Datisca cannabina ECOMMERCE: Direct to Consumer What

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

Cracking the Habit Code 21 days to keeping your resolutions 1 Day 3: Start Small &amp; Go Slow

Syed Aftab Rashid id, Geoffrey Nelissen and Eduardo Tovar 4/12/2016 Main CPU Cache Memory

scoot Introducing Fast, Cheap, Personal Transportation Free, But Slow (&lt;10 MPH, $0/trip)

Di Digital Transm smissi ssion on 01204325 Data Communications and Computer Networks Chaipo

Analysis and Improvement of Differential Computation Attacks against Internally-Encoded White-Box

Differential Encoding for Real-Time Status Updates Sanidhay Bhambay Sudheer Poojary Parimal

Rank Analysis of Cubic Multivariate Cryptosystems John Baena 1 Daniel Cabarcas 1 Daniel Escudero 2

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action Annett

Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue

Encoding Meshes in Differential Coordinates Daniel Cohen-Or Tel Aviv University Outline

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Machine Learning Machine Learning Fast & Slow Fast & Slow Suman Deb Roy Suman Deb Roy

Cracking the Habit Code 21 days to keeping your resolutions 1 Day 3: Start Small & Go Slow

scoot Introducing Fast, Cheap, Personal Transportation Free, But Slow (<10 MPH, $0/trip)