Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014
GTC 2017
Green Flash
Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal
Julien BERNARD
Green Flash Persistent Kernel : Real-Time, Low-Latency and High- - - PowerPoint PPT Presentation
GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and High- Performance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014 Green Flash
Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014
Julien BERNARD
– Paris Observatory – University of Durham – Microgate – PLDA
and Innovation programme
– Real time HPC with accelerators and smart interconnects – Energy efficient platform based on FPGA – Real Time Controller (RTC) prototype for European – Extremely Large Telescope Adaptive
Optics (AO) system
GTC 2017
GTC 2017
wavefront perturbations
measure them
reshape the wavefront
computed in real-time (~ms rate)
GTC 2017
GTC 2017
GTC 2017 RTC
– DSP & CPU – VXS backplane Instrument WFS meas. DM com. Freq (Hz) Performance (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 Active elements Sensor Switch
GTC 2017 RTC Node 0 RTC Node N-1 RTC Node ...
Instrument WFS meas. DM com. Freq (Hz) Performance (GMAC/s) Sphere 1 2.6K 1 1.3k 1.5k 5.2 AOF 4 2.4k 1 1.2k 1k 11.8 ELT 6 80k 3 15k 500
1.2k
Sensor 0 Active elements 0 Active elements 1 Sensor 2 Sensor 3 Sensor 4 Sensor 5 Sensor 1 Active elements 2 Switch
GTC 2017
GPU GPU RAM CPU CPU RAM PCIe 10GbE NIC main { setup(); while(run){ recv(…); cudaMemcpy(…, HostToDevice); computing_kernel<<<>>>(…); cudaMemcpy(…, DeviceToHost); send(…); } }
GTC 2017
cudaMemcopy() overhead times (5.12Mo in, 64Ko out) Kernel launches overhead times Both cases : jitter of 20 to 30 µsec (40 µsec sometimes)
GTC 2017
Leaves not enough time for computations
GTC 2017
GTC 2017
GTC 2017
FPGA NIC
Host ram CPU app
Camera control FPGA control
GPU ram
GPU
Camera protocol handler DMA DMC protocol handler DMA
UDP Offmoad Engine
Pixels bufger DM com bufger
DMA start
P C I- e 3 .
DMA
answers
Latency measurement DMA
measures
Pixels bufger
compute kernels
GTC 2017
GTC 2017
– ExpressK-US board
(hosting a Kintex UltraScale from Xilinx)
– ExpressGX V board
(hosting a Stratix V from Altera)
– μXlink board from
microgate (hosting a Arria 10 board from Altera)
GTC 2017
GTC 2017
GTC 2017
GTC 2017
GPU GPU RAM CPU CPU RAM PCIe 10GbE FPGA NIC
start
main { setup(); persistent_kernel <<<>>>(…); … } persistent_kernel(…){ while(run){ pollMemory(…); computation(...); startDMATransfer(…); } }
GTC 2017
µsec iterations
No GPUDirect GPUDirect + persistent kernel SCAO Pyramid case: 240 x 240 pixels, encoded on 16b
GTC 2017
GTC 2017
CPU
receive frame data
all devices
final resut
Node masters RTC Master Slaves
GTC 2017
4 devices case with 10,048 slopes x 15,000 commands Average : 0.45ms Jitter peak to peak : 17µs
Variation : 1.8 %
Time in ms
Intercommunication time Synchronize time
Average : 15µs Jitter : 8.8µs Average : 24µs Jitter : 12µs
– Using GPUDirect and a
persistent kernel allow efficient data delivery to the RTC
– Lower jitter – Simpler execution stream – QuickPlay tool from PLDA
data processing into the same streams
QuickStore / QuickAliance
–
Test on AO bench (with DM and WFS)
–
Use multi nodes architecture
–
Test with fp16
Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014
GTC 2017