Green Multicore David Moloney, CTO, Movidius 24 November 2011 - - PowerPoint PPT Presentation

green multicore
SMART_READER_LITE
LIVE PREVIEW

Green Multicore David Moloney, CTO, Movidius 24 November 2011 - - PowerPoint PPT Presentation

Green Multicore David Moloney, CTO, Movidius 24 November 2011 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round today @ 12:00) Focus on computational imaging and video Uniquely positioned for


slide-1
SLIDE 1

Green Multicore

David Moloney, CTO, Movidius 24 November 2011

slide-2
SLIDE 2
  • Fabless semiconductor company founded in

2005

– VC backed (completing C-round today @ 12:00) – Focus on computational imaging and video

  • Uniquely positioned for this market with a

software-programmable media processor with state-of-the-art GOPS/W performance

– Enables SW derivatives of the base silicon platform – Current 65nm product in mass-production and expected to ship 1-3M qty in 2012 – Next gen 28nm product in design will deliver the power of a desktop GPU in a 8x8mm BGA @ 350mW

Overview

slide-3
SLIDE 3

Myriad of Applications

Mobile phones Video/DSC Cameras Camera Modules Wireless Cameras Computational Cameras Consumer Electronics Robotics Medical HPC Aerospace Automotive

slide-4
SLIDE 4

Silicon Platform Applications Foundation Technology Software Modules Products

Technology - Platform Approach

4

Video Edit 3D Video 3D Capture Anaglyph-3D

slide-5
SLIDE 5

Mobile Video Processing Workload

5

slide-6
SLIDE 6

GPU FLOPS/W Trend

1 2 3 4 5 6 7 G 100 GT 120 GT 130 GT 140 GTS 150 GT 210 GT 220 GT 240 GTS 250 GTX 260 GTX 260 GTX 260 GTX 275 GTX 280 GTX 285 GTX 295 GT 420 GT 430 GT 430 GT 440 GT 440 GTS 450 GTS 450 GTX 460 SE GTX 460 GTX 460 GTX 465 GTX 470 GTX 480

GPU GFLOPS/W Historical Trend

GPU GFLOPS/W Growing @ 1.4x per Year

slide-7
SLIDE 7

Movidius SHAVE Processor

  • Unique proprietary architecture

– Tailored to streaming workloads and architected for

  • utstanding OPS/mW/$ performance
  • Streaming Hybrid Architecture Vector Engine

– Hybrid of RISC, DSP, VLIW & GPU architectural features – 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32

  • Excellent Graphics and matrix mathematics support

– HW texture unit for good graphics performance – Predicated execution to eliminate branches – Compiler-friendly architecture – HW support for compressed data-structures (ex. matrices)

slide-8
SLIDE 8

Myriad Silicon Platform

Main Bus

64

50GFLOPS/W (IEEE 754 SP)

Stacked 16/64MB SDRAM die

DDR L2 Cache

MEBI NAL SEBI SDIO x2 SPI x3 LCD x2 LCD x2 LCD x2 Cam x2 USB2 OTG SDIO x3 SPI x3 SPI x3 SDIO x3 SW Controlled I/O Multiplexing SPI x3 I2C x2 SPI x3 I2S x2 RISC UART x2 JTAG TIM GPS TS FLSH Bridge CMX 128kB SVE6 TMU L1 CMX 128kB SVE7 TMU L1 CMX 128kB SVE4 TMU L1 CMX 128kB SVE5 TMU L1 CMX 128kB SVE2 TMU L1 CMX 128kB SVE3 TMU L1 CMX 128kB SVE0 TMU L1 CMX 128kB SVE1 TMU L1 128 32

Movidius IP

UART x2

slide-9
SLIDE 9

65nm Myriad SoC

16/64MB SDRAM Die 16/64 MB SDRAM SHAVE Variable-Length Instruction VRF 32x128 SRF 32x32 IRF 32x32 VAU SAU IAU LSU0 LSU1 IDC CMU 128-bit AXI SHAVE Bus 128kB 2-way L2 Myriad DDR2 Cont. TMU 1kB cache SHAVE Processor BRU DCU PEU Decoded instrs 128 kB 1k L1 128kB SRAM Tile 128kB Per SHAVE

180MHz

16/64 MB SDRAM

1.5GB/Sec

180MHz

12.2GB/Sec 17.3GB/Sec

128 kB

8.6GB/Sec

1k L1 128kB Per SHAVE

2.9GB/Sec 5.8GB/Sec 5.8GB/Sec

PEU LSU0 LSU1 BRU VAU SAU CMU IAU

slide-10
SLIDE 10

Myriad GOPS/Watt (Arithmetic)

VAU SAU IAU OP/W arith

20 40 60 80 100 120 140 160 180 200 int8 int16 int32 fp16 fp32

32 16 8 16 8 8 4 2 8 4 4 2 1

181 91 45 99 49

Myriad GOPS/W

PEU LSU0 LSU1 BRU VAU IAU CMU SAU

GOPS/W (arith)

slide-11
SLIDE 11

Myriad 65nm CMOS LP Die

SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE

CMX CMX CMX CMX CMX CMX CMX CMX

RISC sub-system Analog

Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius 2011 12 8 17.28 0.35 49.4 (1 KAIST 2011 5.8 0.28 21.1 (2 Intel 2007 80 1000 98.00 10.2 (4 Adapteva 2010 2 16 24.96 1.00 25.0 16MB Stacked SDRAM

1 2 3 4 5 6 7 8 9 1 1 1 1 2 1 3 1 4 1 5

Myriad DIE 16MB SDRAM DIE

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15
slide-12
SLIDE 12

Now I’ve got a Green Compute Platform

What can I do with it?

slide-13
SLIDE 13

MA1135 - 3D Converter Box Application

image “stripes” HDMI in HDMI

  • ut

20/Apr/2011 13

slide-14
SLIDE 14

Myriad Example Applications

14

SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8

slide-15
SLIDE 15

Application Development

Data-layout DRAM access DMA for streams Run quickly and switch off to minimize leakage Optimise clock- rates for each SHAVE Power-off domains

Power

Movidius Assembly Optimizer Code transformations Loop Unrolling Inline assembler

Optimize

Movidius Profiler

Profile

Movidius C- compiler -LLVM SHAVE0 SHAVE1 SHAVE2 SHAVE3 SHAVE4 SHAVE5 SHAVE6 SHAVE7

Compile

Intel Parallel Studio Refactor design Data-layout Use of DMA to handle streams

Partition

X86 C-code Visual Studio

App

slide-16
SLIDE 16

Lightfield Requirements

  • Replaces glass with SW

– CUDA implementation of Giorgiev (Adobe) LF algorithm – Very computationally expensive – Interpolation key kernel – Geforce GT120 at 130 GFLOPs and 50W (2.6GFLOPs/W)

  • http://en.wikipedia.org/wiki/G

eForce_100_Series

– GPU completes refocusing in 30ms (33.3fps) – 4fps on Myriad 65nm

Lytro Raytrix

slide-17
SLIDE 17

Performance Roadmap (Nvidia)

http://bit.ly/t6zo2j

slide-18
SLIDE 18

Fragrak 28nm Platform

Main Bus

64

450GFLOPS/W (IEEE 754 SP)

Stacked 256/512MB SDRAM die DDR3 LP L2 512kB

MEBI

NAL

SEBI SDIO x2 SPI x3 LCD x2 MIPI DSI 2x LCD x2 MIPI CSI 2x USB2 OTG SDIO x3 SPI x3 SPI x3 SDIO x3 SW Controlled I/O Multiplexing SPI x3 I2C x2 SPI x3 I2S x2

RISC

UART x2 JTAG

TIM GPS

TS FLSH

Brid ge 128

64

Movidius IP

UART x2

18

ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 04 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 08 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 12 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 16 XCB

slide-19
SLIDE 19

GFLOPS/W in Context

GPU rate of increase 1.4x per Year 7 Years to hit 50GFLOPS/W!

0.40 2.02 3.95 4.99 6.05 6.19 49.37 438.86 0.10 1.00 10.00 100.00 1000.00 GeForce G 100 Tesla C870 GeForce GT 120 GeForce GT 130 GeForce GT 140 GeForce GTS 150 Fermi GT 420 GeForce GT 210 GeForce GT 220 Tesla C1060 GeForce GT 240 GeForce GTS 250 Fermi GTX 465 GeForce GTX 260 Fermi GTS 450 GeForce GTX 260 Tesla C2050/C2070 GeForce GTX 260 Fermi GT 430 GeForce GTX 275 Tesla M2050 Tesla M2070/M2070Q GeForce GTX 280 GeForce GTX 285 GeForce GTX 295 Fermi GT 440 GeForce GT 420 Fermi GTX 460 SE GeForce GT 430 Fermi GTX 470 GeForce GT 430 GeForce GT 440 GeForce GT 440 Fermi GTX 480 GeForce GTS 450 Fermi GT 430 GeForce GTS 450 GeForce GTX 460 SE Fermi GTS 450 GeForce GTX 460 Fermi GTX 460 GeForce GTX 460 Fermi GTX 460 GeForce GTX 465 Fermi GT 440 GeForce GTX 470 GeForce GTX 480 Myriad Myriad2

Movidius 65nm 2011 Movidius 28nm 2012

slide-20
SLIDE 20

Myriad of Cameras – 1 Platform

  • Standard camera

– All optical focusing: bulky lenses & autofocus for close-ups – Wide aperture good for low-light but limits depth-of-field – Scale and cost due to established manufacturing processes

  • Lightfield camera (Plenoptic = Lightfield)

– Post-capture refocusing in software (Lytro) – Computationally expensive (GPU-based = cloud – Decouples aperture from Depth of Field (DoF)

  • Array Camera (Stereo is a 2x1 special case)

– Uses array of MxN completely focused cameras – Composite & interpolate array of low-res cameras (Levoy) – Individual camera control allows: HDR capture, fault- tolerance, slow-motion, power-saving etc.

slide-21
SLIDE 21

Movidius Computational Imaging

Silicon Platform Applications Foundation Technology Software Modules Products

Tiny 8x8mm Myriad BGA Conventional Cameras 3D Stereo Cameras Lightfield Cameras Array Cameras

slide-22
SLIDE 22

Summary

  • Movidius 65nm silicon platform

– Ground-breaking functionality in SW – Enabled by ground-breaking GFLOPS/W – Compact form-factor – In mass-production today – 10x better GFLOPS/W than GPU

  • Next generation 28nm SoC

– 9x perf/watt available in 2012 – 100x better GFLOPS/W than GPU

22

slide-23
SLIDE 23

Any questions?

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°248481 (PEPPHER Project, www.peppher.eu)

slide-24
SLIDE 24

References

1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-Stacking on Si-interposer" Proc. ISSCC 2011 2) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Network-

  • n-Chip in 65nm CMOS", Proc. ISSCC 2007, pp.5-7

3)

  • A. Olofsson, R. Trogan, O. Raikhman,

”A 25 GFLOPS/Watt Software Programmable Floating Point Accelerator”, HPEC 2010, 15-16 Sep 2010 4) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp.664-7

24