Green Flash Persistent Kernel : Real-Time, Low-Latency and High- - - PowerPoint PPT Presentation

green flash
SMART_READER_LITE
LIVE PREVIEW

Green Flash Persistent Kernel : Real-Time, Low-Latency and High- - - PowerPoint PPT Presentation

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014 Green Flash


slide-1
SLIDE 1

Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

GTC 2017

Green Flash

Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal

Julien BERNARD

slide-2
SLIDE 2

Green Flash

  • Public and private actors

– Paris Observatory – University of Durham – Microgate – PLDA

  • Part of Horizon 2020 : EU Research

and Innovation programme

  • 3 years project
  • 3,8 million €
  • Involve about 30 people
  • Research axes

– Real time HPC with accelerators and smart interconnects – Energy efficient platform based on FPGA – Real Time Controller (RTC) prototype for European – Extremely Large Telescope Adaptive

Optics (AO) system

slide-3
SLIDE 3

GTC 2017

Contributors

Maxime Lainé : software engineer Denis Perret : FPGA expert Arnaud Sevin : software lead Damien Gratadour : project lead Christophe Rouaud : PLDA project lead Gaetan Dufourcq : QuickPlay expert

slide-4
SLIDE 4

GTC 2017

E-ELT : Adaptive Optics

  • Compensate in real-time the

wavefront perturbations

  • Using a wavefront sensor - WFS to

measure them

  • Using a deformable mirror – DM to

reshape the wavefront

  • Commands to the mirror must be

computed in real-time (~ms rate)

slide-5
SLIDE 5

GTC 2017

RTC concept for ELT AO

slide-6
SLIDE 6

GTC 2017

RTC concept for ELT AO

slide-7
SLIDE 7

GTC 2017 RTC

Real Time controller

Legacy architecture

  • IE. SPARTA architecture

– DSP & CPU – VXS backplane Instrument WFS meas. DM com. Freq (Hz) Performance (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 Active elements Sensor Switch

slide-8
SLIDE 8

GTC 2017 RTC Node 0 RTC Node N-1 RTC Node ...

Real Time controller

Cluster network architecture

Instrument WFS meas. DM com. Freq (Hz) Performance (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 ELT 6 80k 3 15k 500

1.2k

Sensor 0 Active elements 0 Active elements 1 Sensor 2 Sensor 3 Sensor 4 Sensor 5 Sensor 1 Active elements 2 Switch

slide-9
SLIDE 9

GTC 2017

Legacy GPU programming

GPU GPU RAM CPU CPU RAM PCIe 10GbE NIC main { setup(); while(run){ recv(…); cudaMemcpy(…, HostToDevice); computing_kernel<<<>>>(…); cudaMemcpy(…, DeviceToHost); send(…); } }

slide-10
SLIDE 10

GTC 2017

Legacy GPU programming

cudaMemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)

slide-11
SLIDE 11

GTC 2017

Legacy GPU programming

Leaves not enough time for computations

slide-12
SLIDE 12

GTC 2017

Improvement

GPU direct & I/O Memory mapping Persistent Kernel

slide-13
SLIDE 13

GTC 2017

GPU direct & I/O Memory mapping

slide-14
SLIDE 14

GTC 2017

GPU direct & I/O Memory mapping

  • FPGA writes/reads directly to/from GPU memory
  • CPU free for other kind of computations

FPGA NIC

Host ram CPU app

Camera control FPGA control

  • Meas. Comp.

GPU ram

GPU

Camera protocol handler DMA DMC protocol handler DMA

UDP Offmoad Engine

Pixels bufger DM com bufger

DMA start

P C I- e 3 .

DMA

answers

Latency measurement DMA

measures

Pixels bufger

compute kernels

slide-15
SLIDE 15

GTC 2017

FPGA Development platform

Eased devel. Process using the QuickPlay tool from PLDA

slide-16
SLIDE 16

GTC 2017

FPGA Development platform

  • Single generic design /

multiple target boards

– ExpressK-US board

(hosting a Kintex UltraScale from Xilinx)

– ExpressGX V board

(hosting a Stratix V from Altera)

– μXlink board from

microgate (hosting a Arria 10 board from Altera)

slide-17
SLIDE 17

GTC 2017

Persistent Kernel

slide-18
SLIDE 18

GTC 2017

Classic implementation

slide-19
SLIDE 19

GTC 2017

Persistent kernel implementation

slide-20
SLIDE 20

GTC 2017

GPU direct, I/O Memory mapping & Persistent kernel

GPU GPU RAM CPU CPU RAM PCIe 10GbE FPGA NIC

start

main { setup(); persistent_kernel <<<>>>(…); … } persistent_kernel(…){ while(run){ pollMemory(…); computation(...); startDMATransfer(…); } }

slide-21
SLIDE 21

GTC 2017

Pipelining I/O and compute

FPGA PLDA XPressG5 GPU Tesla C2070 OS Debian wheezy Camera EVT HS-2000M 10GbE network

µsec iterations

No GPUDirect GPUDirect + persistent kernel SCAO Pyramid case: 240 x 240 pixels, encoded on 16b

slide-22
SLIDE 22

GTC 2017

Pipelining I/O and compute

slide-23
SLIDE 23

GTC 2017

DGX-1 benchmark

  • FPGA is replace by

CPU

  • Each node master

receive frame data

  • Work is shared between

all devices

  • RTC master send back

final resut

Node masters RTC Master Slaves

slide-24
SLIDE 24

GTC 2017

Result 1/2 : Time and jitter

Histogram

4 devices case with 10,048 slopes x 15,000 commands Average : 0.45ms Jitter peak to peak : 17µs

Variation : 1.8 %

Time in ms

slide-25
SLIDE 25

Result 2/2 : Sync & Intercom time

Intercommunication time Synchronize time

Average : 15µs Jitter : 8.8µs Average : 24µs Jitter : 12µs

slide-26
SLIDE 26

Conclusion & future work

  • Conclusion

– Using GPUDirect and a

persistent kernel allow efficient data delivery to the RTC

– Lower jitter – Simpler execution stream – QuickPlay tool from PLDA

  • Eased FPGA development cycle
  • Mix communication protocols and

data processing into the same streams

  • Expandable ecosystem, with

QuickStore / QuickAliance

  • Future

Test on AO bench (with DM and WFS)

Use multi nodes architecture

Test with fp16

slide-27
SLIDE 27

Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014

Thank you

Question ?

slide-28
SLIDE 28

GTC 2017

  • DGX-1 benchmark
  • Result 1/2 : Time and jitter
  • Result 2/2 : Sync & Intercom time
  • Conclusion & future work
  • Thank you
  • RTC AO prototype for E-ELT
  • Test pipeline
  • Time measurement strategies
  • Conclusion : Persistent kernel
  • future work
  • New features
  • Test architecture