GPU-B ASED D EEP L EARNING IN C LOUD AND E MBEDDED S YSTEMS F - - PowerPoint PPT Presentation

▶

Aug 16, 2022 138 likes •322 views

GPU-B ASED D EEP L EARNING IN C LOUD AND E MBEDDED S YSTEMS F REDERICK S OO , CTO April 4, 2016 Confidential and Proprietary Nauto is launching a connected camera for professional drivers Drive more than most consumers Exposed to

SLIDE 1

Confidential and Proprietary

GPU-BASED DEEP LEARNING IN CLOUD AND EMBEDDED SYSTEMS FREDERICK SOO, CTO

April 4, 2016

SLIDE 2

Nauto is launching a connected camera for professional drivers

Drive more than most

consumers

Exposed to passenger

and driver liability

Driver quality unknown -

small number of very bad drivers

SLIDE 3

Massive shift in transportation due to synergistic technologies

Autonomous

90% reduction in accidents

Connected Electric Shared

$0.08 / mile

85% efficient drivetrain 50-70% utilization Fleet

ptimization

SLIDE 4

Why use deep learning?

Good at visual tasks Scalable Deployable

Most important for NAUTO

SLIDE 5

Small brains have a lot of functionality

26 billion neurons 1 million 10 million 100 million 20 watts 1mW 10mW 100mW

SLIDE 6

Required performance depends on use case

SLIDE 7

Small changes in F1 with size

Large networks can be

used in later stages of cascade

Order of magnitude

improvements in speed with basic exploration

Always worth

measuring performance/size tradeoff

SLIDE 8

Test your chipsets - algorithm speed important but not entire story

30 60 90 120 150 A B C D E Nauto CNN forward pass (msec) Embedded SoC

Chipsets released in

2014, 2015 and 2016

Pricing varying from

$25 to $60+

Varying degrees of

HW/SW support

SLIDE 9

Algorithm is not the bottleneck

Image processing Conversion to CNN space CNN forward pass Other steps 30msec 30msec … msec 15msec

SLIDE 10

Entire system must be optimized

Collect data Label Train Deploy years months months months/years Pre-GPU

SLIDE 11

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU years months months months/years Pre-GPU

SLIDE 12

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU days weeks weeks weeks Nauto prototype years months months months/years Pre-GPU

SLIDE 13

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU days weeks weeks weeks Nauto prototype years months months months/years Pre-GPU Nauto at- scale ? ? ? ?

SLIDE 14

Easy to think of optimization; hard to think of system

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald Knuth

SLIDE 15

Lessons

Embedded pipeline as important as raw CNN

performance

Match algorithm performance to use case
Overall system performance (data acquisition,

labeling, training) is where big progress to be made

SLIDE 16

The future is in distributed awareness

Real world search

SLIDE 17

GPU-BASED DEEP LEARNING IN CLOUD AND EMBEDDED SYSTEMS FREDERICK SOO, CTO

April 4, 2016

Nauto is launching a connected camera for professional drivers

consumers

and driver liability

small number of very bad drivers

Massive shift in transportation due to synergistic technologies

Autonomous

Connected Electric Shared

$0.08 / mile

Why use deep learning?

Good at visual tasks Scalable Deployable

Most important for NAUTO

Small brains have a lot of functionality

26 billion neurons 1 million 10 million 100 million 20 watts 1mW 10mW 100mW

Required performance depends on use case

Small changes in F1 with size

used in later stages of cascade

improvements in speed with basic exploration

measuring performance/size tradeoff

Test your chipsets - algorithm speed important but not entire story

30 60 90 120 150 A B C D E Nauto CNN forward pass (msec) Embedded SoC

2014, 2015 and 2016

$25 to $60+

HW/SW support

Algorithm is not the bottleneck

Image processing Conversion to CNN space CNN forward pass Other steps 30msec 30msec … msec 15msec

Entire system must be optimized

Collect data Label Train Deploy years months months months/years Pre-GPU

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU years months months months/years Pre-GPU

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU days weeks weeks weeks Nauto prototype years months months months/years Pre-GPU

Entire system must be optimized

Collect data Label Train Deploy weeks months months months/years Post-GPU days weeks weeks weeks Nauto prototype years months months months/years Pre-GPU Nauto at- scale ? ? ? ?

Easy to think of optimization; hard to think of system

Donald Knuth

Lessons

performance

labeling, training) is where big progress to be made

The future is in distributed awareness

Real world search

Team

Ludmila Levkova Nikhil Deshmukh Joe Virzi Jonathan Soo