CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE - PowerPoint PPT Presentation

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE • Frameworks – MPI, heterogenous “large” systems • Pipelines – hashpipe, psrdata, bifrost, htgs • Data transport – DPDK, libVMA, NTOP • Applications – correlators, beamformers, spectrometers, FRB • Hardware configurations and future hardware roadmaps

USEFUL LINKS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE • hashpipe - https://github.com/david-macmahon/hashpipe • psrdada - http://psrdada.sourceforge.net • bifrost - https://github.com/ledatelescope/bifrost • htgs - https://pages.nist.gov/HTGS/ • DPDK - http://www.dpdk.org • libVMA - https://github.com/Mellanox/libvma • NTOP - https://github.com/ntop/PF_RING

APPLICATIONS • FRB searching (Dan) - Building systems for GBT, Arecibo, FAST. Using Heimdall. • Building the whole FPGA/switch/GPU processing engine. Have they build the whole “ultimate CASPER backend?” Not yet. There is a SETI GPU, an FRB GPU, etc. Heimdall dedispersion is the hardest computational task, but overall still swamped by the number of candidates. • Beamformers – Max Planck beamformer on MeerKAT (commensal backend) • Packet capture and beamforming in bifrost. • DifX (reported by Jonathon Weintroub) used some aspect of MPI to move existing DifX X-engine into GPU? [From discussions with Arash after the meeting: he did need to hand-port FFTW to CuFFT, and some aspects of X-engine to CUDA kernel.] • Dan: Use GPU correlators for ~ 2**8 antennas. Not needed for small number of antennas (e.g. VLBI).

DATA TRANSPORT • DPDK, etc zero copy operations, bypass kernel space. Goes from NIC to GPU memory saving one hop. RDMA – direct to GPU with Inifiniband, Rocky = similar over Ethernet, layer above RDMA. All still have to go through system memory. • DPDK – has to have an Intel NIC (or clone) – can get 80 Gb/sec into GPU (2x 40 Gb NICs).   [Edit: DPDK does support some Mellanox / Broadcomm / Cisco / Chelsio chipset] • libVMA: equivalent with Mellanox NICs; 40 Gb/sec per NIC. Using SPEAD packets • Would like a SPEAD reader using DPDK for psrdada, bifrost, etc. • Dan – bottleneck into PCs is packets/sec, not bits/sec, want giant packets. (Jumbo = 9k packets). • NICs now supporting interrupt coalescing – will wait for e.g. 10 packets before it interrupts the CPU. Dave’s hashpipe uses this. Kernel tuning parameters critical – need a CASPER memo for this. Danny – maybe one exists. Application code also needs to be bound to correct processor. Threads need to be locked to the correct core. • Dan: action item – group to get together to identify memo(s) of “required reading” before attempting to develop HPC code. Group to consist of: John Ford, Dave MacMahon, Danny Price. “How to do high speed data transport”.

HOW TO DO HIGH-SPEED DATA TRANSPORT: A READING LIST FOR THE CURIOUS CASPERITE • Digital signal processing using stream high performance computing: A 512-input broadband correlator for radio astronomy , J Kocz, LJ Greenhill, BR Barsdell… arXiv:1401.8288 • A Scalable Hybrid FPGA/GPU FX Correlator, J Kocz, LJ Greenhill, BR Barsdell… - Journal of Astronomical Instrumentation, 2014 • The Breakthrough Listen Search for Intelligent Life: A Wideband Data Recorder System for the Robert C. Byrd Green Bank Telescope , D MacMahon, DC Price, M Lebofsky…, arXiv:1707.06024 • An Efficient Real-time Data Pipeline for the CHIME Pathfinder Radio Telescope X-Engine , A Recnik, K Bandura, N Denman… arXiv: 1503.06189

HARDWARE CONFIGURATIONS • Danny: Breakthrough uses 4U servers from SuperMicro, dual xeons, capture raw voltages to disk. After observations – play back through NVIDIA 1080 gaming cards – one per node. • Typically BTL/GBT use one GPU per box. Others using 2/4 GPUs per box. CHIME correlator uses AMD. Code written in OpenCL. • Dan – NVIDIA is into supercomputing; AMD is selling chips to gamers. Can run OpenCL on NVIDIA. • CUDA gives you cuFFT, cuBLAS, Thrust library. Does AMD have equivalents? • Number of PCI Express lanes the CPU can support is important. AMD CPU + NVIDIA GPU may be beneficial. • Power 8/9 have “bluelink” connections. May develop NICs which use bluelink. IBM has shown a lot of dedication to getting the GPU as high speed interconnect as possible. • Vendors: very cheap 10/40Gb transceivers from FiberStore (fs.com). Also sell 100 Gb switches.

PIPELINES • HTGS does inverse of bifrost. Bifrost binds thread to an operation. HTGS define nodes in a graph, nodes will be bound to a CPU thread. Aim is to overlap data transport and computation. Get hybrid, multicore pipeline. Uses explicit graph representation throughout. • Hashpipe – originally developed for GUPPI (Paul D.) Generalized by Dave MacMahon. Not as sophisticated as bifrost/HTGS. Provides support for metadata. Hashpipe does not support forking ring buffers. Simple and straightforward, well documented, CASPER tutorials available. • PSRDADA similar to hashpipe. Low level. Simple and conservative: use hashpipe or PSRDADA. Bifrost in a single instrument. • HTGS just starting prototyping use in GB. Unique in using graph representation – maintained through analysis and execution. Also can use multiple GPUs – formulate a sub- graph and encapsulate it into an execution pipeline graph, bound to a GPU. • Should put a link to Tim’s thesis from CASPER website. Link to paper is https:// link.springer.com/article/10.1007/s11265-017-1262-6

GPU ROADMAP • vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 – formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.

FPGA ROADMAP • latest generation Ultrascale+. Some chips in production. Lots more memory on chip. 10s of Mbytes -> Gbits. 26 Gbit links, 100 Gb Ethernet on eval boards. • $7k for a VCU118 eval board with a $20k chip on. Not engineered for industrial applications. HBM (high bandwidth memory) superhigh bandwidth DRAM, connects over substrate. • FPGAs with high-speed ADCs/DACs on chip. 8 3Gsps ADCs/DACs. Not generally available yet, will be out at the end of the year. • Working on 7nm chips – no date for availability yet. Dan: for performance/$, use latest generation family, but medium-size chip. • Can buy VCU118 boards in bulk. Power to FPGA is throttled to 60W (?). May be a problem for full utilization, but looks encouraging. Full investigation not complete.

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE - PowerPoint PPT Presentation

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RICHARD PRESTAGE Frameworks MPI, heterogenous large systems Pipelines hashpipe, psrdata, bifrost, htgs Data transport DPDK, libVMA, NTOP Applications correlators,

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Web Services Security: a preliminary study using Casper and FDR E. Kleiner and A.W. Roscoe

The Whole Sort of General Mish Mash Frequently Asked Questions and Trip-ups About the CASPER

for GAVRT at Caltech Glenn Jones Aug. 03, 2008 2008 CASPER Workshop Acknowledgements Xilinx

SNAP f-engine for UTMOST2D Kathryn Plant CASPER

OIL CITY AQUATIC CENTER Presenta5on by Casper Swim Club

Casper van Donderen Preparing your KDE DP-based application for deployment on Windows OSes 1/7

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Development of an Optimized Real-Time Radio Transient Imager for LWA-SV Hariharan Krishnan 1

Livro de Rute Panorama Bblico AT I. Terceiro dos livros histricos. II. Autor incerto,

Measuring the polarization of disk-integrated lunar reflectance in the middle ultraviolet Paige

CS-E5200 Design Project Design Project 2019 Image:

The sustainability of and metrics for success in open source projects Martilene Orffer,

SKA Transients Science Working Group Headline science and requirements Rob Fender (chair), Cath

HAI Learning and Action Network January 8, 2015 Monthly Call GPQIN Website greatplainsqin.org

The Transient Universe with The Square Kilometre Array observe discover analyse report Rob