Helping the Discovery of New Galaxies on the Worlds Largest - - PowerPoint PPT Presentation

helping the discovery of new galaxies on the world s
SMART_READER_LITE
LIVE PREVIEW

Helping the Discovery of New Galaxies on the Worlds Largest - - PowerPoint PPT Presentation

Helping the Discovery of New Galaxies on the Worlds Largest Telescopes Using a Large GPU Cluster Damien Gratadour 1 and Hatem Ltaief 2 1 LESIA, Observatoire de Paris and Universit e Paris Diderot, France 2 Extreme Computing Research Center,


slide-1
SLIDE 1

Helping the Discovery of New Galaxies on the World’s Largest Telescopes Using a Large GPU Cluster

Damien Gratadour1 and Hatem Ltaief2

1LESIA, Observatoire de Paris and Universit´

e Paris Diderot, France

2Extreme Computing Research Center, KAUST, Saudi Arabia

NVIDIA GTC Conference at San Jose, CA March 26-29, 2018

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 1 / 61

slide-2
SLIDE 2

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 2 / 61

slide-3
SLIDE 3

Acknowledgments

Students/Collaborators

Extreme Computing Research Center @ KAUST

  • A. Charara and D. Keyes

L’Observatoire de Paris, LESIA

  • R. Dembet, N. Doucet, E. Gendron, D. Gratadour, C. Morel,
  • A. Sevin and F. Vidal

Innovative Computing Laboratory @ UTK PLASMA/MAGMA/PaRSEC Teams INRIA/INP Bordeaux, France Runtime/HiePACS Teams

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 3 / 61

slide-4
SLIDE 4

Acknowledgments

Support/Funding

Funded partially by the French National Center for Scientific Research (CNRS, 2016) Funded partially by the European Commission (Horizon 2020 program, FET-HPC grant# 671662) For core-hours:

Tsubame 3.0, Tokyo, Japan Tokyo Institute of Technology Shaheen 2.0, Thuwal, KSA KAUST Supercomputer Lab

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 4 / 61

slide-5
SLIDE 5

E-ELT

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 5 / 61

slide-6
SLIDE 6

E-ELT

The effect of atmospheric turbulence

The sun observed with a compact camera Disturbs the trajectory of light rays (wavefront perturbations) Reduces astronomical images quality

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 6 / 61

slide-7
SLIDE 7

E-ELT

Adaptive optics (AO)

AO: technique used compensate in real-time the wavefront perturbations providing a significant improvement in resolution The moon observed with a 8m telescope (left: no AO, right: with AO)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 7 / 61

slide-8
SLIDE 8

E-ELT

How AO works

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 8 / 61

slide-9
SLIDE 9

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 9 / 61

slide-10
SLIDE 10

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963 5 Atacama Large Millimeter Array (ALMA) Chile 12m 1.4 billion 2013 4 Giant Magellan Telescope (GMT) Chile 24.5m 2.2 billion 2024 3 Thirty Meter Telescope (TMT) USA 30m 1.4 billion 2030 2 Square Kilometer Array (SKA) Australia 90m 2 billion 2020 1 European Extremely Large Telescope (E-ELT) Chile 39m 1.3 billion 2024

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 9 / 61

slide-11
SLIDE 11

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963 5 Atacama Large Millimeter Array (ALMA) Chile 12m 1.4 billion 2013 4 Giant Magellan Telescope (GMT) Chile 24.5m 2.2 billion 2024 3 Thirty Meter Telescope (TMT) USA 30m 1.4 billion 2030 2 Square Kilometer Array (SKA) Australia 90m 2 billion 2020 1 European Extremely Large Telescope (E-ELT) Chile 39m 1.3 billion 2024

Consortium: multiple nation initiatives

Src: http://www.space.com/14075-10-biggest-telescopes-earth-comparison.html

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 9 / 61

slide-12
SLIDE 12

E-ELT

The World’s Biggest Eye on The Sky

Credits: ESO (http://www.eso.org/public/teles-instr/e-elt/)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 10 / 61

slide-13
SLIDE 13

E-ELT

The World’s Biggest Eye on The Sky

Credits: ESO (http://www.eso.org/public/teles-instr/e-elt/) The largest optical/near-infrared telescope in the world. It will weigh about 2700 tons with a main mirror diameter of 39m. Location: Chile, South America.

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 11 / 61

slide-14
SLIDE 14

E-ELT

Multi-object Adaptive Optics (MOAO)

Probably the most challenging embedded instrument in the E-ELT. Observe the most distant galaxies in parallel to understand their evolution with a high multiplex Capable of exploiting a Field of View (FoV) of 7 to 10 arcminutes (1/4 of the full moon) with a resolution of few tens of milli-arcsec (1/20,000 of the full moon) Use multiple guide stars (tomographic measurement of the turbulence) and multiple deformable mirrors (direction specific compensation). Extremely compute intensive at full scale.

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 12 / 61

slide-15
SLIDE 15

E-ELT

Multi-object Adaptive Optics (MOAO)

Credits: ESO

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 13 / 61

slide-16
SLIDE 16

E-ELT

Multi-object Adaptive Optics (MOAO)

Creating artificial guide stars. Credits : A. Reeves, Durham University

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 14 / 61

slide-17
SLIDE 17

E-ELT

Multi-object Adaptive Optics (MOAO)

Probably the most challenging embedded instrument in the E-ELT. Extremely compute intensive at full scale. Today: need to simulate the system to provide efficient observations forecast (evaluate the possible science return, perform design studies). Tomorrow (2024): need to drive the system in real-time for routine

  • perations

Core component of the system: the real-time controller which provides commands to the deformable mirrors actuators from the measurements of multiple wavefront sensors.

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 15 / 61

slide-18
SLIDE 18

E-ELT

AO real-time controller

Heterogeneous HPC facility

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 16 / 61

slide-19
SLIDE 19

E-ELT

AO real-time controller

Heterogeneous HPC facility Real-time box: 1ms response time, from 100k measurements to several 5k actuators (see J. Bernard presentation @ GTC17 for our implementation on DGX-1). Similar to real-time inference

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 17 / 61

slide-20
SLIDE 20

E-ELT

AO real-time controller

Heterogeneous HPC facility Here we concentrate on the supervisor sub-system Critical component for operations and observations forecast

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 18 / 61

slide-21
SLIDE 21

E-ELT

AO supervisor sub-system

Cost function optimization for parameters identification (Learn stage) Linear algebra for reconstructor matrix computation (Apply stage)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 19 / 61

slide-22
SLIDE 22

E-ELT

Learn process

Fitting a covariance matrix on a model including system and turbulence parameters, using a score function Levenberg-Marquardt algorithm for function optimization Future development: rely on Machine learning approaches Example of fitted turbulence profile

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 20 / 61

slide-23
SLIDE 23

E-ELT

Learn process

Multi-GPU process, including matrix generation and LM fit Time to solution on DGX-1 (P100) for a matrix size of 86k and 40 turbulence layers:240s (4 minutes) Weak and strong scaling for the learn process

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 21 / 61

slide-24
SLIDE 24

E-ELT

Apply process: tomographic reconstructor

Compute the tomographic reconstructor matrix using covariance matrix between direction specific truth sensor and other sensors and the inverse of measurements covariance matrix R′ = Ctm · C −1

mm

Factorize and Solve for R′ with Cmm, a 100k x 100k matrix, is extremely compute intensive At the core of system operations (soft real-time, should be achieved in seconds to update the real-time box) Also a critical component for the numerical simulation of the system behavior (observation forecast for today’s design studies)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 22 / 61

slide-25
SLIDE 25

E-ELT

AO simulation pipeline: observations forecast

Goal is to produce image quality maps over the instrument’s full FoV depending on turbulence conditions evolution over the night

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 23 / 61

slide-26
SLIDE 26

E-ELT

AO simulation pipeline: observations forecast

From system parameters and turbulence evolution timeline to image quality at the output of the system

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 24 / 61

slide-27
SLIDE 27

HPC Tasking

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 25 / 61

slide-28
SLIDE 28

HPC Tasking

Global Workflow Chart

System Parameters Turbulence Parameters matcov Cmm Ctm ToR matcov Cmm Ctm Ctt Cee Cvv BLAS BLAS Inter- sample R ToR computation Observing sequence

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 26 / 61

slide-29
SLIDE 29

HPC Tasking

Blocked Algorithms: Fork-Join Paradigm

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 27 / 61

slide-30
SLIDE 30

HPC Tasking

LAPACK: Blocked Algorithms

Principles: Panel-Update sequence Transformations are blocked/accumulated within the Panel Level-2 BLAS Transformations applied at once on the trailing submatrix Level-3 BLAS Parallelism hidden inside the BLAS Fork-join model A broken model!

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 28 / 61

slide-31
SLIDE 31

HPC Tasking

Tile Data Layout Format

LAPACK: column-major format PLASMA/CHAMELEON: tile format

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 29 / 61

slide-32
SLIDE 32

HPC Tasking

PLASMA/CHAMELEON: Tile Algorithms

PLASMA = ⇒ http://icl.cs.utk.edu/plasma/ CHAMELEON = ⇒ https://gitlab.inria.fr/solverstack/chameleon.git Break the bulk synchronous programming model Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Default dynamic runtime system environment StarPU (but could use Quark, PaRSEC, OmpSs, OpenMP etc.)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 30 / 61

slide-33
SLIDE 33

HPC Tasking

StarPU Runtime System 101

Provides: = ⇒ Task scheduling = ⇒ Memory management = ⇒ Out-of-core Supports: = ⇒ SMP/Multicore Processors (x86, PPC, . . . ) = ⇒ NVIDIA GPUs (e.g., multi-GPU) = ⇒ Hybrid architectures = ⇒ Shared and Distributed-memory

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 31 / 61

slide-34
SLIDE 34

HPC Tasking

StarPU Runtime System: User Productivity!

starpu_Insert_Task(&cl_dpotrf, VALUE, &uplo, sizeof(char), VALUE, &n, sizeof(int), INOUT, Ahandle(k, k), VALUE, &lda, sizeof(int), OUTPUT, &info, sizeof(int) CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL, 0);

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 32 / 61

slide-35
SLIDE 35

HPC Tasking

Global Workflow Chart

System Parameters Turbulence Parameters matcov Cmm Ctm ToR matcov Cmm Ctm Ctt Cee Cvv BLAS BLAS Inter- sample R ToR computation Observing sequence

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 33 / 61

slide-36
SLIDE 36

HPC Tasking

Directed Acyclic Graph for MOAO

1:14 matcov 1 1 1 1 2 2 2 2 3 3 3 3 2:5 matcov 1 1 1 1 2 2 2 2 3 3 3 3 matcov 1 1 1 1 2 2 2 2 3 3 3 3 matcov 1 1 1 1 2 2 2 2 3 3 3 3 matcov 1 matcov 1 matcov 1 matcov 1 1 matcov matcov 1 1 matcov matcov 1 1 matcov matcov 1 1 matcov 1 1 3:1 4:2 5:2 1 6:2 1 1 1 1 7:2 1 matcov 8:4 1 matcov 9:7 matcov 10:12 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 11:9 matcov matcov matcov matcov matcov matcov 12:10 13:10 14:7 matcov matcov matcov matcov matcov 2 matcov matcov 15:3 16:9 17:13 18:16 19:9 2 2 2 matcov 2 2 matcov matcov 2 2 matcov 2 2 2 2 2 2 2 2 matcov 2 matcov matcov 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 3 matcov matcov matcov 20:10 21:10 22:7 matcov matcov matcov matcov matcov 3 matcov matcov 23:3 24:9 25:12 26:14 27:7 3 3 3 matcov 3 3 matcov matcov 3 3 matcov 3 3 3 3 3 3 3 3 matcov 3 matcov matcov 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 28:4 29:4 30:3 31:1 32:4 33:6 34:4

Figure: DAG representation of the overall MOAO framework execution for the calculation of three ToR (i.e., three iterations of the outer loop) and a single

  • bserving sequence (i.e., one iteration in the inner loop) on a two-by-two tile

matrix.

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 34 / 61

slide-37
SLIDE 37

HPC Tasking

Directed Acyclic Graph for MOAO

Figure: DAG representation of the overall MOAO framework execution for the calculation of three ToR (i.e., three iterations of the outer loop) and a single

  • bserving sequence (i.e., one iteration in the inner loop) on a two-by-two tile

matrix.

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 35 / 61

slide-38
SLIDE 38

HPC Tasking

Zooming in...

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 36 / 61

slide-39
SLIDE 39

HPC Tasking

MOAO Software Release – https://github.com/ecrc/moao

A collaboration of With support from Sponsored by

Centre de recherche BORDEAUX – SUD-OUEST

Place your text here A HIGH PEFORMANCE MULTI-OBJECT ADAPTIVE OPTICS FRAMEWORK FOR GROUND-BASED ASTRONOMY

The Multi-Object Adaptive Optics (MOAO) framework provides a comprehensive testbed for high performance computational astronomy. In particular, the European Extremely Large Telescope (E-ELT) is one of today’s most challenging projects in ground-based astronomy and will make use of a MOAO instrument based on turbulence tomography. The MOAO framework uses a novel compute-intensive pseudo-analytical approach to achieve close to real-time data processing

  • n manycore architectures. The scientific goal of the MOAO simulation package is to dimension future E-ELT instruments

and to assess the qualitative performance of tomographic reconstruction of the atmospheric turbulence on real datasets. DOWNLOAD THE SOFTWARE AT h6p://github.com/ecrc/moao THE MULTI-OBJECT ADAPTIVE OPTICS TECHNIQUE

Single conjugate AO concept Open-Loop tomography concept Observing the GOODS South cosmological field with MOAO

MOAO 0.1.0

  • Tomographic Reconstructor Computation
  • Dimensioning of Future Instruments
  • Real Datasets
  • Single and Double Precisions
  • Shared-Memory Systems
  • Task-based Programming Models
  • Dynamic Runtime Systems
  • Hardware Accelerators

CURRENT RESEARCH

  • Distributed-Memory Systems
  • Hierarchical Matrix Compression
  • Machine Learning for Atmospheric Turbulence
  • High Resolution Galaxy Map Generation
  • Extend to other ground-based telescope projects

PERFORMANCE RESULTS TOMOGRAPHIC RECONSTRUCTOR COMPUTATION – DOUBLE PRECISION

High res. map of the quality of turbulence compensation obtained with MOAO on a cosmological field

THE PSEUDO-ANALYTICAL APPROACH

System Parameters Turbulence Parameters matcov Cmm Ctm ToR matcov Cmm Ctm Ctt Cee Cvv BLAS BLAS Inter- sample R ToR computation Observing sequence

  • Compute the tomographic error:

Cee = Ctt - Ctm RT – R Ctm

T + R Cmm RT

  • Compute the equivalent phase map:

Cvv = D Cee DT

  • Generate the point spread function image

Two-socket 18-core Intel HSW – 64-core Intel KNL – 8 NVIDIA GPU P100s (DGX-1)

  • Solve for the

tomographic reconstructor R: R x Cmm = Ctm This is one tomographic reconstructor every 25 seconds!

5 10 15 20 25 30 35 40 45 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000110000 TFlops/s matrix size DGX-1 peak DGX-1 perf KNL perf Haswell perf 100 200 300 400 500 600 700 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000110000 time(s) matrix size DGX-1 KNL Haswell
  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 37 / 61

slide-40
SLIDE 40

Targeting Large-Scale GPU Supercomputer

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 38 / 61

slide-41
SLIDE 41

Targeting Large-Scale GPU Supercomputer

Global Workflow Chart

System Parameters Turbulence Parameters matcov Cmm Ctm ToR matcov Cmm Ctm Ctt Cee Cvv BLAS BLAS Inter- sample R ToR computation Observing sequence

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 39 / 61

slide-42
SLIDE 42

Targeting Large-Scale GPU Supercomputer

Leveraging ScaLAPACK

2D Block-Cyclic Data Distribution: Processes Grid Logical View (Matrix) Local View (CPUs)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 40 / 61

slide-43
SLIDE 43

Targeting Large-Scale GPU Supercomputer

New DPOTRF @ KBLAS – https://github.com/ecrc/kblas

GPU-resident using recursive formulation

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 41 / 61

slide-44
SLIDE 44

Targeting Large-Scale GPU Supercomputer

Programming Model and Optimizations

StarPU as the master of ceremony Task-based + MPI + Pthreads + CUDA StarPU eager scheduler GPU Direct support: CudaMemcpyPeer between GPUs RDMA across GPU nodes

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 42 / 61

slide-45
SLIDE 45

Performance Results

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 43 / 61

slide-46
SLIDE 46

Performance Results

Environment Settings - DGX-1 w/ P100 and V100

Compute Node w/ P100 and V100

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 44 / 61

slide-47
SLIDE 47

Performance Results

Environment Settings - DGX-1 w/ P100 and V100

Hardware and software description: CPU hardware

Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (40 Broadwell cores) 512 GB of main memory

GPU hardware

DGX-1 with 8 NVIDIA P100 / V100 GPUs 12 GB / 16 GB of main memory NVLink between GPUs PCIe 16x Gen 3

Software

GCC compiler suite CUDA 9.0 StarPU v1.2.3 Chameleon v1.0 Intel MKL 2018.1.163

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 45 / 61

slide-48
SLIDE 48

Performance Results

Environment Settings - Shaheen 2.0

Compute Node x 6174

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 46 / 61

slide-49
SLIDE 49

Performance Results

Environment Settings - Shaheen 2.0

Hardware and software description: CPU system (22nm)

Cray XC40 Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.3GHz (32 Haswell cores) 128 GB of DDR4 main memory Aries Interconnect

Software

GCC compiler suite StarPU v1.2.3 Chameleon v1.0 Intel MKL 2018.1.163

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 47 / 61

slide-50
SLIDE 50

Performance Results

Environment Settings - Tsubame 3.0

Compute Node x 540

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 48 / 61

slide-51
SLIDE 51

Performance Results

Environment Settings - Tsubame 3.0

CPU hardware

Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.4GHz (28 Broadwell cores) 256 GB of DDR4 main memory Intel Omni-Path interconnect

GPU hardware

4 NVIDIA TESLA P100 per node 12 GB of main memory NVLink between GPUs PCIe 16x Gen 3

Software

GCC compiler suite CUDA 8.0 StarPU v1.2.3 Chameleon v1.0 Intel MKL 2018.1.163

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 49 / 61

slide-52
SLIDE 52

Performance Results

E-ELT Application Apparatus

Diameter telescope: 39m Number of measurements of the true sensor: 10240 Number of actuators: 5120 Number of science channels (i.e., #galaxies): 16, 32 and 64 Performance of the ToR computation

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 50 / 61

slide-53
SLIDE 53

Performance Results

Performance Evolution of the ToR Computations: GTC’17

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014 8 x NVIDIA K80s 2014 28 cores Intel BDW 2016 64 cores Intel KNL 2016 NVIDIA DGX-1 2016

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 51 / 61

slide-54
SLIDE 54

Performance Results

ToR Performance using GPU-resident DPOTRF@KBLAS

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 52 / 61

slide-55
SLIDE 55

Performance Results

ToR Performance (TFlop/s) on Distr.-Mem. Systems

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 53 / 61

slide-56
SLIDE 56

Performance Results

ToR Performance (TFlop/s) on Distr.-Mem. Systems

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 54 / 61

slide-57
SLIDE 57

Performance Results

ToR Performance (TFlop/s) on Distr.-Mem. Systems

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 55 / 61

slide-58
SLIDE 58

Performance Results

ToR Time to Solution on DGX1-P100 and DGX1-V100

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 56 / 61

slide-59
SLIDE 59

Performance Results

ToR Time to Solution on Distr.-Mem. Systems

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 57 / 61

slide-60
SLIDE 60

Summary and Future Work

Outline

1

The European Extremely Large Telescope

2

Ubiquitous Taskification

3

Targeting Large-Scale GPU Supercomputer

4

Performance Results

5

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 58 / 61

slide-61
SLIDE 61

Summary and Future Work

Summary

Adaptive optics require massively parallel hardware accelerators Efficient Task-based programming model Pipelining computational stages Dense, tightly-connected GPU-based compute node (i.e., DGX-1) for real-time processing Distributed-memory GPU-based compute nodes for large-scale simulations MOAO effort for standardization (w/ O. Guyon from Subaru telescope) Machine learning for the control matrix (e.g., maximum likelihood)

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 59 / 61

slide-62
SLIDE 62

Summary and Future Work

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 60 / 61

slide-63
SLIDE 63

Summary and Future Work

Questions?

  • D. Gratadour & H. Ltaief

MOAO on Large GPU Cluster 61 / 61