GTC 2015 Session S5429 Creating Dense Mixed GPU and FPGA Systems - - PowerPoint PPT Presentation

gtc 2015 session s5429
SMART_READER_LITE
LIVE PREVIEW

GTC 2015 Session S5429 Creating Dense Mixed GPU and FPGA Systems - - PowerPoint PPT Presentation

GTC 2015 Session S5429 Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL & CUDA Lance Brown, Director - HPC ColoradoEngineering.com Lance.brown@coloradoengineering.com 719-641-7287 Cell 27 March 2015


slide-1
SLIDE 1

GTC 2015 – Session S5429

Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL & CUDA

Lance Brown, Director - HPC ColoradoEngineering.com Lance.brown@coloradoengineering.com 719-641-7287 Cell

27 March 2015 ColoradoEngineering.com - Public Release 1

slide-2
SLIDE 2

We Can Solve Really Cool Problems Now

  • Heterogeneous computing is more than CPU + GPU
  • ARM processors changed the game
  • NVIDIA - GPU + ARM - CUDA
  • TI - DSP + ARM - OpenCL
  • Altera - FPGA + ARM – OpenCL
  • Scalable from handheld to Enterprise & HPC

27 March 2015 ColoradoEngineering.com - Public Release Slide 2

slide-3
SLIDE 3

Why Listen to CEI?

  • Been using FPGAs since 1985
  • Been solving massively parallel problems for over 30 years
  • We have/are designing multiple 24 & 32 layer boards featuring Altera

FPGAs & NVIDIA GPUs

  • Early adopter of new technologies and experts at marrying existing

technologies in new ways

27 March 2015 ColoradoEngineering.com - Public Release Slide 3

slide-4
SLIDE 4

Game Changer #1 Altera’s Hard Floating Point Unit IP & OpenCL

  • FPGAs have traditionally supported soft floating point
  • Altera introduced IEEE 754 Hard Floating Point with Arria 10
  • Arria 10 FPGAs are rated from 140 GigaFLOPS (GFLOPS) to 1.5

TeraFLOPS (TFLOPS)

  • Details at: https://www.altera.com/en_US/pdfs/literature/po/bg-

floating-point-fpga.pdf

  • OpenCV & Suricata Implementations Using OpenCL
  • Partial Reconfiguration for Streamlined OpenCL Development
  • On Intel’s 14 nm FinFET Fab

27 March 2015 ColoradoEngineering.com - Public Release Slide 4

slide-5
SLIDE 5

Game Changer #2 NVIDIA Makes Tegra K1 Available

  • GPU + ARM @ low power
  • Very important – camera interfaces galore
  • Can do significant processing at each edge node now
  • Jetson Kit – awesome eval kit & affordable
  • More importantly – chipset available through Arrow!
  • Details at: https://developer.nvidia.com/hardware-design-and-

development

27 March 2015 ColoradoEngineering.com - Public Release Slide 5

slide-6
SLIDE 6

CEI’s Epiphany – Ultimate CV Platform Altera Arria 10 & NVIDIA Tegra K1?

+

1500 GFLOPS 326 GFLOPS

27 March 2015 ColoradoEngineering.com - Public Release Slide 6

slide-7
SLIDE 7

First Union – Dual TK1s + Arria 10 HPC-A10-K1GPU

K61 Health Monitoring

HPC-A10 HPC-A10-K1GPU

X8 PCIE Gen3

GigE

2/4 GB Micron HMC QDR II+ 144 Mb 1334 MT/s QSFP+ 1 – 40 GbE 4 - 10 GbE QSFP+ 1 – 40 GbE 4 - 10 GbE

USB Blaster DisplayPort - Source DisplayPort - Sink USB 3.0 USB 3.0 SMA SMA

PCIE Switch VITA 57 FMC HPC (Optional) QDR II+ 144 Mb 1334 MT/s Tegra K1 System-On-Module TK1-SOM 16/32/ 64 GB eMMC 2/4/8 Gbit DDR3

USB GigE HDMI

Tegra K1 System-On-Module TK1-SOM 16/32/ 64 GB eMMC 2/4/8 Gbit DDR3

USB GigE HDMI SMA

X4 PCIE GEN2 EXTRA X4 PCIE GEN2

SMA CLK-IN

TK1-SOM Tegra K1 System-On-Module

16/32/64 GB eMMC 1/2/4 GB DDR3L

USB 2.0 GigE HDMI

2 Inches 2 Inches

External Power x4 PCI Gen2, Clocks, i2c JTAG UART

Available Stand-alone

27 March 2015 ColoradoEngineering.com - Public Release Slide 7

slide-8
SLIDE 8

HPC-A10-K1GPU Design Details

  • NVIDIA GPUDirect Support
  • TK1’s are root nodes
  • TK1’s can be field upgraded
  • 8 - High Speed 10GbE Ports
  • CUDA on TK1
  • OpenCL on Arria 10
  • 2 GB/s to each TK1
  • HMC is 17X faster than DDR3
  • 12 to 25 Camera/Sensor I/Os

27 March 2015 ColoradoEngineering.com - Public Release Slide 8

slide-9
SLIDE 9
  • 1 to 21 Cameras/Sensors
  • Makes dumb cameras smart
  • 10/40 GbE Sensors
  • OpenCL on FPGA
  • CUDA on Tegra

27 March 2015 ColoradoEngineering.com - Public Release Slide 9

Single Node

C C C C C C C C C

4 – 10 GbE 4 – 10 GbE Display Port USB/GigE USB/GigE

C C C C C C C C

FMC

C C C C

slide-10
SLIDE 10

Tesla K80s + HPC-A10-K1GPU

C C C C C

4 – 10 GbE 4 – 10 GbE Display Port USB/GigE USB/GigE

C C C C C C C C

FMC

C C C C

Telsa K80 Telsa K80 Telsa K80 Telsa K80

GPUDirect

27 March 2015 ColoradoEngineering.com - Public Release Slide 10

slide-11
SLIDE 11

27 March 2015 ColoradoEngineering.com - Public Release Slide 11

Sensor Gateway Smart Host Bus Adapter (HBA)

40 GbE 40 GbE FMC 40 GbE 40 GbE 40 GbE FMC 40 GbE

Sensor Cloud

Radar, MRI, PET, Camera, EW, etc Telsa K80 Cluster Telsa K80 Cluster

slide-12
SLIDE 12
  • Easy to do now
  • https://youtu.be/o5WtYiY5Hao
  • Proficient in a day or two
  • CAPI support too
  • 95% to 99% Efficient as VHDL

27 March 2015 ColoradoEngineering.com - Public Release Slide 12

Programming FPGAs with OpenCL

slide-13
SLIDE 13

EDGE Node Processing

  • Process on the EDGE using GRID
  • Distributed deep learning node
  • Low cost
  • 4G enabled
  • Fusion of Radar, EO, IO and Sound
  • Download apps from Google Play
  • Feedback to Tesla K80s via GRID
  • SmartCity Ready
  • Military Level Device Security Built-in

NVIDIA Tegra K1/X1

Computer Vision Video Compression 5 MP Camera 5 MP Camera 5 MP Camera 5 MP Camera

24 GHz Radar System

Motion Detection

Camera Queuing

COMMS Alerts Streaming Video

4G LTE WiFi BlueTooth USB

Altera Cyclone V Appliance Security

Patch Antenna Patch Antenna Patch Antenna Patch Antenna Directional Mic Directional Mic Directional Mic Directional Mic

27 March 2015 ColoradoEngineering.com - Public Release Slide 13

slide-14
SLIDE 14

Distributed Aperture System Distributed Sensors

  • Large vehicle/Military ADAS
  • SA360 systems
  • Retrofit casino camera systems
  • Make any sensor system smart
  • Tegra K1/X1’s Scalable
  • Mixture of CUDA & OpenCL

x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S x4 Gen2 PCIe 2 GB/S

64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 64 GB eMMC 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4 8 GB DDR4

USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE USB3 or GigE HDMI 4/8 GB HMC QDR-II+ Or QDR-IV HDMI HDMI HDMI HDMI HDMI HDMI HDMI HDMI

Altera Arria 10 SoC x2 ARM OpenCL NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265

NVIDIA Tegra X1

x4 ARM

CUDA/Linux OpenCV H.264/H.265 Removable SATA Storage 40/10 GbE Ports

Main Display GPU

27 March 2015 ColoradoEngineering.com - Public Release Slide 14

slide-15
SLIDE 15

Challenges Hardware, Interconnects & Software

  • FPGA + GPU
  • CUDA, OpenCL or CUDA + OpenCL
  • Working with MDA & AFRL on solutions
  • Bandwidth
  • Tegra K1/X1 are x4 Gen2 PCIe – limits number and resolution of sensors attached to

the Tegra.

  • More processing has to be done of Tegra, but that is okay since Tegra’s keep

increasing in power every year

  • Gen3 PCIe would be awesome
  • PCIe backplane – Using 40 GbE ports eliminates PCIe bottleneck
  • Root Nodes
  • Tegra wants to root complex. Non-transparent switches need to be used
  • If Tegra could be an endpoint, a whole new world would open up

27 March 2015 ColoradoEngineering.com - Public Release Slide 15

slide-16
SLIDE 16

Future Architectures Even Cooler Designs Possible

  • Altera
  • Arria 10 SoC
  • Eliminates need for x86 CPU to run OpenCL
  • Truly stand-alone appliances
  • 100 GbE interfaces
  • Stratix 10 and Stratix 10 SoC
  • >10 TFLOPs for 100W
  • Details: https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html
  • NVIDIA VOLTA
  • Looking for NVLink intermingling with FPGAs
  • Virtual FPGAs + Virtual GPUs
  • Allow instant scaling and data protection

27 March 2015 ColoradoEngineering.com - Public Release Slide 16

slide-17
SLIDE 17

Summary

  • GPU + FPGA can solve amazing and fun problems
  • Tegra K1/X1 provide incredible capability at low cost which reduces

the size of FPGA needed.

  • OpenCL and Hard Floating Point IP make the Altera FPGAs a great

partner with NVIDIA GPUs

  • CEI is making scalable solutions to allow application developers to

deploy from handheld to enterrpise/HPC

27 March 2015 ColoradoEngineering.com - Public Release Slide 17

slide-18
SLIDE 18

Hardware & Software Capabilities

  • Enterprise & Embedded SW
  • Net Centric, SOA, web services, J2EE,SQL
  • C/C++
  • CUDA & OpenCL
  • Embedded real time code, RTOS, hardware

drivers, Fault Detection / Fault Isolation, etc.

  • Simulations, APIs, and GUIs
  • Cognitive Software
  • Device Drivers
  • National Instruments Labview
  • DO-178C
  • FPGA designs (VHDL/Verilog/Simulink)
  • RF Design

▪ System / Subsystem Designs ▪ 30+ complex board designs

▪ 32 layer PCBs with blind and buried vias ▪ High speed (100s MHz  x GHz) ▪ Analog (RF & I/Q Receivers) ▪ Digital (FPGAs, DSPs, general purpose) ▪ ADC and DAC ▪ Standard and custom IO (busses, fabrics, SerDes, etc.) ▪ Ruggedization and thermal management ▪ CSWaP ▪ Serial I/O (e.g. PCIe, Serdes) ▪ DO-254 27 March 2015 ColoradoEngineering.com - Public Release 18

slide-19
SLIDE 19

For More Information

  • n Standard Products and

Custom Engineering Services

Call Us – 719-388-8582 Office Emails Us – lance.brown@coloradoengineering.com Visit Us – Colorado Springs, CO (Sunny 300+ Days) Browse Us – www.ColoradoEngineering.com

27 March 2015 ColoradoEngineering.com - Public Release 19