Green Multicore
David Moloney, CTO, Movidius 24 November 2011
Green Multicore David Moloney, CTO, Movidius 24 November 2011 - - PowerPoint PPT Presentation
Green Multicore David Moloney, CTO, Movidius 24 November 2011 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round today @ 12:00) Focus on computational imaging and video Uniquely positioned for
David Moloney, CTO, Movidius 24 November 2011
2005
– VC backed (completing C-round today @ 12:00) – Focus on computational imaging and video
software-programmable media processor with state-of-the-art GOPS/W performance
– Enables SW derivatives of the base silicon platform – Current 65nm product in mass-production and expected to ship 1-3M qty in 2012 – Next gen 28nm product in design will deliver the power of a desktop GPU in a 8x8mm BGA @ 350mW
Mobile phones Video/DSC Cameras Camera Modules Wireless Cameras Computational Cameras Consumer Electronics Robotics Medical HPC Aerospace Automotive
Silicon Platform Applications Foundation Technology Software Modules Products
4
Video Edit 3D Video 3D Capture Anaglyph-3D
5
1 2 3 4 5 6 7 G 100 GT 120 GT 130 GT 140 GTS 150 GT 210 GT 220 GT 240 GTS 250 GTX 260 GTX 260 GTX 260 GTX 275 GTX 280 GTX 285 GTX 295 GT 420 GT 430 GT 430 GT 440 GT 440 GTS 450 GTS 450 GTX 460 SE GTX 460 GTX 460 GTX 465 GTX 470 GTX 480
GPU GFLOPS/W Historical Trend
GPU GFLOPS/W Growing @ 1.4x per Year
– Tailored to streaming workloads and architected for
– Hybrid of RISC, DSP, VLIW & GPU architectural features – 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32
– HW texture unit for good graphics performance – Predicated execution to eliminate branches – Compiler-friendly architecture – HW support for compressed data-structures (ex. matrices)
Main Bus
64
50GFLOPS/W (IEEE 754 SP)
Stacked 16/64MB SDRAM die
DDR L2 Cache
MEBI NAL SEBI SDIO x2 SPI x3 LCD x2 LCD x2 LCD x2 Cam x2 USB2 OTG SDIO x3 SPI x3 SPI x3 SDIO x3 SW Controlled I/O Multiplexing SPI x3 I2C x2 SPI x3 I2S x2 RISC UART x2 JTAG TIM GPS TS FLSH Bridge CMX 128kB SVE6 TMU L1 CMX 128kB SVE7 TMU L1 CMX 128kB SVE4 TMU L1 CMX 128kB SVE5 TMU L1 CMX 128kB SVE2 TMU L1 CMX 128kB SVE3 TMU L1 CMX 128kB SVE0 TMU L1 CMX 128kB SVE1 TMU L1 128 32
Movidius IP
UART x2
16/64MB SDRAM Die 16/64 MB SDRAM SHAVE Variable-Length Instruction VRF 32x128 SRF 32x32 IRF 32x32 VAU SAU IAU LSU0 LSU1 IDC CMU 128-bit AXI SHAVE Bus 128kB 2-way L2 Myriad DDR2 Cont. TMU 1kB cache SHAVE Processor BRU DCU PEU Decoded instrs 128 kB 1k L1 128kB SRAM Tile 128kB Per SHAVE
180MHz
16/64 MB SDRAM
1.5GB/Sec
180MHz
12.2GB/Sec 17.3GB/Sec
128 kB
8.6GB/Sec
1k L1 128kB Per SHAVE
2.9GB/Sec 5.8GB/Sec 5.8GB/Sec
PEU LSU0 LSU1 BRU VAU SAU CMU IAU
VAU SAU IAU OP/W arith
20 40 60 80 100 120 140 160 180 200 int8 int16 int32 fp16 fp32
32 16 8 16 8 8 4 2 8 4 4 2 1
181 91 45 99 49
Myriad GOPS/W
PEU LSU0 LSU1 BRU VAU IAU CMU SAU
GOPS/W (arith)
SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE
CMX CMX CMX CMX CMX CMX CMX CMX
RISC sub-system Analog
Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius 2011 12 8 17.28 0.35 49.4 (1 KAIST 2011 5.8 0.28 21.1 (2 Intel 2007 80 1000 98.00 10.2 (4 Adapteva 2010 2 16 24.96 1.00 25.0 16MB Stacked SDRAM
1 2 3 4 5 6 7 8 9 1 1 1 1 2 1 3 1 4 1 5
Myriad DIE 16MB SDRAM DIE
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15What can I do with it?
image “stripes” HDMI in HDMI
20/Apr/2011 13
14
SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8
Data-layout DRAM access DMA for streams Run quickly and switch off to minimize leakage Optimise clock- rates for each SHAVE Power-off domains
Power
Movidius Assembly Optimizer Code transformations Loop Unrolling Inline assembler
Optimize
Movidius Profiler
Profile
Movidius C- compiler -LLVM SHAVE0 SHAVE1 SHAVE2 SHAVE3 SHAVE4 SHAVE5 SHAVE6 SHAVE7
Compile
Intel Parallel Studio Refactor design Data-layout Use of DMA to handle streams
Partition
X86 C-code Visual Studio
App
– CUDA implementation of Giorgiev (Adobe) LF algorithm – Very computationally expensive – Interpolation key kernel – Geforce GT120 at 130 GFLOPs and 50W (2.6GFLOPs/W)
eForce_100_Series
– GPU completes refocusing in 30ms (33.3fps) – 4fps on Myriad 65nm
Lytro Raytrix
http://bit.ly/t6zo2j
Main Bus
64
450GFLOPS/W (IEEE 754 SP)
Stacked 256/512MB SDRAM die DDR3 LP L2 512kB
MEBI
NAL
SEBI SDIO x2 SPI x3 LCD x2 MIPI DSI 2x LCD x2 MIPI CSI 2x USB2 OTG SDIO x3 SPI x3 SPI x3 SDIO x3 SW Controlled I/O Multiplexing SPI x3 I2C x2 SPI x3 I2S x2
RISC
UART x2 JTAG
TIM GPS
TS FLSH
Brid ge 128
64
Movidius IP
UART x2
18
ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 04 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 08 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 12 ICB CMX 128kB SHAVE CMX 128kB SHAVE 1 CMX 128kB SHAVE CMX 256kB SHAVE 16 XCB
GPU rate of increase 1.4x per Year 7 Years to hit 50GFLOPS/W!
0.40 2.02 3.95 4.99 6.05 6.19 49.37 438.86 0.10 1.00 10.00 100.00 1000.00 GeForce G 100 Tesla C870 GeForce GT 120 GeForce GT 130 GeForce GT 140 GeForce GTS 150 Fermi GT 420 GeForce GT 210 GeForce GT 220 Tesla C1060 GeForce GT 240 GeForce GTS 250 Fermi GTX 465 GeForce GTX 260 Fermi GTS 450 GeForce GTX 260 Tesla C2050/C2070 GeForce GTX 260 Fermi GT 430 GeForce GTX 275 Tesla M2050 Tesla M2070/M2070Q GeForce GTX 280 GeForce GTX 285 GeForce GTX 295 Fermi GT 440 GeForce GT 420 Fermi GTX 460 SE GeForce GT 430 Fermi GTX 470 GeForce GT 430 GeForce GT 440 GeForce GT 440 Fermi GTX 480 GeForce GTS 450 Fermi GT 430 GeForce GTS 450 GeForce GTX 460 SE Fermi GTS 450 GeForce GTX 460 Fermi GTX 460 GeForce GTX 460 Fermi GTX 460 GeForce GTX 465 Fermi GT 440 GeForce GTX 470 GeForce GTX 480 Myriad Myriad2
Movidius 65nm 2011 Movidius 28nm 2012
– All optical focusing: bulky lenses & autofocus for close-ups – Wide aperture good for low-light but limits depth-of-field – Scale and cost due to established manufacturing processes
– Post-capture refocusing in software (Lytro) – Computationally expensive (GPU-based = cloud – Decouples aperture from Depth of Field (DoF)
– Uses array of MxN completely focused cameras – Composite & interpolate array of low-res cameras (Levoy) – Individual camera control allows: HDR capture, fault- tolerance, slow-motion, power-saving etc.
Silicon Platform Applications Foundation Technology Software Modules Products
Tiny 8x8mm Myriad BGA Conventional Cameras 3D Stereo Cameras Lightfield Cameras Array Cameras
– Ground-breaking functionality in SW – Enabled by ground-breaking GFLOPS/W – Compact form-factor – In mass-production today – 10x better GFLOPS/W than GPU
– 9x perf/watt available in 2012 – 100x better GFLOPS/W than GPU
22
Any questions?
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°248481 (PEPPHER Project, www.peppher.eu)
1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-Stacking on Si-interposer" Proc. ISSCC 2011 2) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Network-
3)
”A 25 GFLOPS/Watt Software Programmable Floating Point Accelerator”, HPEC 2010, 15-16 Sep 2010 4) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp.664-7
24