CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE - - PowerPoint PPT Presentation

casper and gpus
SMART_READER_LITE
LIVE PREVIEW

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE - - PowerPoint PPT Presentation

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE Frameworks MPI, heterogenous large systems Pipelines hashpipe, psrdata, bifrost, htgs Data transport DPDK, libVMA, NTOP Applications correlators,


slide-1
SLIDE 1

MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE

CASPER AND GPUS

  • Frameworks – MPI, heterogenous “large” systems
  • Pipelines – hashpipe, psrdata, bifrost, htgs
  • Data transport – DPDK, libVMA, NTOP
  • Applications – correlators, beamformers, spectrometers, FRB
  • Hardware configurations and future hardware roadmaps
slide-2
SLIDE 2

MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE

USEFUL LINKS

  • hashpipe - https://github.com/david-macmahon/hashpipe
  • psrdada - http://psrdada.sourceforge.net
  • bifrost - https://github.com/ledatelescope/bifrost
  • htgs - https://pages.nist.gov/HTGS/
  • DPDK - http://www.dpdk.org
  • libVMA - https://github.com/Mellanox/libvma
  • NTOP - https://github.com/ntop/PF_RING
slide-3
SLIDE 3

APPLICATIONS

  • FRB searching (Dan) - Building systems for GBT, Arecibo, FAST. Using Heimdall.
  • Building the whole FPGA/switch/GPU processing engine. Have they build the whole

“ultimate CASPER backend?” Not yet. There is a SETI GPU, an FRB GPU, etc. Heimdall dedispersion is the hardest computational task, but overall still swamped by the number of candidates.

  • Beamformers – Max Planck beamformer on MeerKAT (commensal backend)
  • Packet capture and beamforming in bifrost.
  • DifX (reported by Jonathon Weintroub) used some aspect of MPI to move existing

DifX X-engine into GPU? [From discussions with Arash after the meeting: he did need to hand-port FFTW to CuFFT, and some aspects of X-engine to CUDA kernel.]

  • Dan: Use GPU correlators for ~ 2**8 antennas. Not needed for small number of

antennas (e.g. VLBI).

slide-4
SLIDE 4

DATA TRANSPORT

  • DPDK, etc zero copy operations, bypass kernel space. Goes from NIC to GPU memory saving one hop.

RDMA – direct to GPU with Inifiniband, Rocky = similar over Ethernet, layer above RDMA. All still have to go through system memory.

  • DPDK – has to have an Intel NIC (or clone) – can get 80 Gb/sec into GPU (2x 40 Gb NICs). 


[Edit: DPDK does support some Mellanox / Broadcomm / Cisco / Chelsio chipset]

  • libVMA: equivalent with Mellanox NICs; 40 Gb/sec per NIC. Using SPEAD packets
  • Would like a SPEAD reader using DPDK for psrdada, bifrost, etc.
  • Dan – bottleneck into PCs is packets/sec, not bits/sec, want giant packets. (Jumbo = 9k packets).
  • NICs now supporting interrupt coalescing – will wait for e.g. 10 packets before it interrupts the CPU.

Dave’s hashpipe uses this. Kernel tuning parameters critical – need a CASPER memo for this. Danny – maybe one exists. Application code also needs to be bound to correct processor. Threads need to be locked to the correct core.

  • Dan: action item – group to get together to identify memo(s) of “required reading” before attempting to

develop HPC code. Group to consist of: John Ford, Dave MacMahon, Danny Price. “How to do high speed data transport”.

slide-5
SLIDE 5

HOW TO DO HIGH-SPEED DATA TRANSPORT:

  • Digital signal processing using stream high performance computing:

A 512-input broadband correlator for radio astronomy, J Kocz, LJ Greenhill, BR Barsdell… arXiv:1401.8288

  • A Scalable Hybrid FPGA/GPU FX Correlator, J Kocz, LJ Greenhill, BR

Barsdell… - Journal of Astronomical Instrumentation, 2014

  • The Breakthrough Listen Search for Intelligent Life: A Wideband

Data Recorder System for the Robert C. Byrd Green Bank Telescope, D MacMahon, DC Price, M Lebofsky…, arXiv:1707.06024

  • An Efficient Real-time Data Pipeline for the CHIME Pathfinder Radio

Telescope X-Engine, A Recnik, K Bandura, N Denman… arXiv: 1503.06189 A READING LIST FOR THE CURIOUS CASPERITE

slide-6
SLIDE 6

HARDWARE CONFIGURATIONS

  • Danny: Breakthrough uses 4U servers from SuperMicro, dual xeons, capture raw voltages to
  • disk. After observations – play back through NVIDIA 1080 gaming cards – one per node.
  • Typically BTL/GBT use one GPU per box. Others using 2/4 GPUs per box. CHIME correlator

uses AMD. Code written in OpenCL.

  • Dan – NVIDIA is into supercomputing; AMD is selling chips to gamers. Can run OpenCL on

NVIDIA.

  • CUDA gives you cuFFT, cuBLAS, Thrust library. Does AMD have equivalents?
  • Number of PCI Express lanes the CPU can support is important. AMD CPU + NVIDIA GPU

may be beneficial.

  • Power 8/9 have “bluelink” connections. May develop NICs which use bluelink. IBM has

shown a lot of dedication to getting the GPU as high speed interconnect as possible.

  • Vendors: very cheap 10/40Gb transceivers from FiberStore (fs.com). Also sell 100 Gb

switches.

slide-7
SLIDE 7

PIPELINES

  • HTGS does inverse of bifrost. Bifrost binds thread to an operation. HTGS define nodes in a

graph, nodes will be bound to a CPU thread. Aim is to overlap data transport and

  • computation. Get hybrid, multicore pipeline. Uses explicit graph representation throughout.
  • Hashpipe – originally developed for GUPPI (Paul D.) Generalized by Dave MacMahon. Not

as sophisticated as bifrost/HTGS. Provides support for metadata. Hashpipe does not support forking ring buffers. Simple and straightforward, well documented, CASPER tutorials available.

  • PSRDADA similar to hashpipe. Low level. Simple and conservative: use hashpipe or
  • PSRDADA. Bifrost in a single instrument.
  • HTGS just starting prototyping use in GB. Unique in using graph representation –

maintained through analysis and execution. Also can use multiple GPUs – formulate a sub- graph and encapsulate it into an execution pipeline graph, bound to a GPU.

  • Should put a link to Tim’s thesis from CASPER website. Link to paper is https://

link.springer.com/article/10.1007/s11265-017-1262-6

slide-8
SLIDE 8

GPU ROADMAP

  • vega for AMD coming out next week. Volta for NVIDIA. Volta has

tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit

  • utputs (designed for AI training / inferencing). CUDA 9 –

formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.

slide-9
SLIDE 9

GPU ROADMAP

  • vega for AMD coming out next week. Volta for NVIDIA. Volta has

tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit

  • utputs (designed for AI training / inferencing). CUDA 9 –

formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.

slide-10
SLIDE 10

FPGA ROADMAP

  • latest generation Ultrascale+. Some chips in production. Lots more memory on
  • chip. 10s of Mbytes -> Gbits. 26 Gbit links, 100 Gb Ethernet on eval boards.
  • $7k for a VCU118 eval board with a $20k chip on. Not engineered for industrial
  • applications. HBM (high bandwidth memory) superhigh bandwidth DRAM,

connects over substrate.

  • FPGAs with high-speed ADCs/DACs on chip. 8 3Gsps ADCs/DACs. Not

generally available yet, will be out at the end of the year.

  • Working on 7nm chips – no date for availability yet. Dan: for performance/$, use

latest generation family, but medium-size chip.

  • Can buy VCU118 boards in bulk. Power to FPGA is throttled to 60W (?). May be

a problem for full utilization, but looks encouraging. Full investigation not complete.