[PPT] - Efficient Observations Forecast for the Worlds Biggest Eye Using PowerPoint Presentation

SLIDE 1

Efficient Observations Forecast for the World’s Biggest Eye Using DGX-1

Damien Gratadour1 and Hatem Ltaief2

1LESIA, Observatoire de Paris and Universit´

e Paris Diderot, France

2Extreme Computing Research Center, KAUST, Saudi Arabia

NVIDIA GTC at San Jose, CA May 8-11, 2017

D. Gratadour & H. Ltaief

MOAO Using DGX-1 1 / 48

SLIDE 2

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 2 / 48

SLIDE 3

Acknowledgments

Students/Collaborators

Extreme Computing Research Center @ KAUST

A. Charara and D. Keyes

L’Observatoire de Paris, LESIA

R. Dembet, N. Doucet, E. Gendron, D. Gratadour, C. Morel,
A. Sevin and F. Vidal

Innovative Computing Laboratory @ UTK PLASMA/MAGMA/PaRSEC Teams INRIA/INP Bordeaux, France Runtime/HiePACS Teams

D. Gratadour & H. Ltaief

MOAO Using DGX-1 3 / 48

SLIDE 4

Acknowledgments

Support/Funding

KAUST IT Research Computing support NVIDIA hardware donations (can we get more? ∗) Funded partially by the French National Center for Scientific Research (CNRS, 2016) Funded partially by the European Commission (Horizon 2020 program, FET-HPC grant# 671662)

∗ For free

D. Gratadour & H. Ltaief

MOAO Using DGX-1 4 / 48

SLIDE 5

E-ELT

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 5 / 48

SLIDE 6

E-ELT

The effect of atmospheric turbulence

The sun observed with a compact camera Disturbs the trajectory of light rays (wavefront perturbations) Reduces astronomical images quality

D. Gratadour & H. Ltaief

MOAO Using DGX-1 6 / 48

SLIDE 7

E-ELT

Adaptive optics (AO)

AO: technique used compensate in real-time the wavefront perturbations providing a significant improvement in resolution The moon observed with a 8m telescope (left: no AO, right: with AO)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 7 / 48

SLIDE 8

E-ELT

How AO works

D. Gratadour & H. Ltaief

MOAO Using DGX-1 8 / 48

SLIDE 9

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963

D. Gratadour & H. Ltaief

MOAO Using DGX-1 9 / 48

SLIDE 10

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963 5 Atacama Large Millimeter Array (ALMA) Chile 12m 1.4 billion 2013 4 Giant Magellan Telescope (GMT) Chile 24.5m 2.2 billion 2024 3 Thirty Meter Telescope (TMT) USA 30m 1.4 billion 2030 2 Square Kilometer Array (SKA) Australia 90m 2 billion 2020 1 European Extremely Large Telescope (E-ELT) Chile 39m 1.3 billion 2024

D. Gratadour & H. Ltaief

MOAO Using DGX-1 9 / 48

SLIDE 11

E-ELT

The Top 10 (present and future) Ground-based Telescopes

Rank Name Location Diameter Cost Year 10 Large Synoptic Survey Telescope (LSST) Chile 8.4m 450 million 2014 9 South African Large Telescope (SALT) South Africa 9.2m 36 million 2005 8 Keck USA 10m 100 million 1996 7 Gran Telescopio Canarias (GTC) Spain 10.4m 130 million 2009 6 Aricebo Observatory Puerto Rico 305m 9.3 million 1963 5 Atacama Large Millimeter Array (ALMA) Chile 12m 1.4 billion 2013 4 Giant Magellan Telescope (GMT) Chile 24.5m 2.2 billion 2024 3 Thirty Meter Telescope (TMT) USA 30m 1.4 billion 2030 2 Square Kilometer Array (SKA) Australia 90m 2 billion 2020 1 European Extremely Large Telescope (E-ELT) Chile 39m 1.3 billion 2024

Consortium: multiple nation initiatives

Src: http://www.space.com/14075-10-biggest-telescopes-earth-comparison.html

D. Gratadour & H. Ltaief

MOAO Using DGX-1 9 / 48

SLIDE 12

E-ELT

The World’s Biggest Eye on The Sky

Credits: ESO (http://www.eso.org/public/teles-instr/e-elt/)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 10 / 48

SLIDE 13

E-ELT

The World’s Biggest Eye on The Sky

Credits: ESO (http://www.eso.org/public/teles-instr/e-elt/) The largest optical/near-infrared telescope in the world. It will weigh about 2700 tons with a main mirror diameter of 39m. Location: Chile, South America.

D. Gratadour & H. Ltaief

MOAO Using DGX-1 11 / 48

SLIDE 14

E-ELT

Multi-object Adaptive Optics (MOAO)

Probably the most challenging embedded instrument in the E-ELT. Observe the most distant galaxies in parallel to understand their evolution with a high multiplex Capable of exploiting a Field of View (FoV) of 7 to 10 arcminutes (1/4 of the full moon) with a resolution of few tens of milli-arcsec (1/20,000 of the full moon) Use multiple guide stars (tomographic measurement of the turbulence) and multiple deformable mirrors (direction specific compensation). Extremely compute intensive at full scale.

D. Gratadour & H. Ltaief

MOAO Using DGX-1 12 / 48

SLIDE 15

E-ELT

Multi-object Adaptive Optics (MOAO)

Credits: ESO

D. Gratadour & H. Ltaief

MOAO Using DGX-1 13 / 48

SLIDE 16

E-ELT

Multi-object Adaptive Optics (MOAO)

Creating artificial guide stars. Credits : A. Reeves, Durham University

D. Gratadour & H. Ltaief

MOAO Using DGX-1 14 / 48

SLIDE 17

E-ELT

Multi-object Adaptive Optics (MOAO)

Probably the most challenging embedded instrument in the E-ELT. Extremely compute intensive at full scale. Today: need to simulate the system to provide efficient observations forecast (evaluate the possible science return, perform design studies). Tomorrow (2024): need to drive the system in real-time for routine

perations

Core component of the system: the real-time controller which provides commands to the deformable mirrors actuators from the measurements of multiple wavefront sensors.

D. Gratadour & H. Ltaief

MOAO Using DGX-1 15 / 48

SLIDE 18

E-ELT

AO real-time controller

Heterogeneous HPC facility

D. Gratadour & H. Ltaief

MOAO Using DGX-1 16 / 48

SLIDE 19

E-ELT

AO real-time controller

Heterogeneous HPC facility Real-time box: stay tuned for J. Bernard presentation later in this session about our implementation on DGX-1

D. Gratadour & H. Ltaief

MOAO Using DGX-1 17 / 48

SLIDE 20

E-ELT

AO real-time controller

Heterogeneous HPC facility Here we concentrate on the supervisor sub-system Critical component for operations and observations forecast

D. Gratadour & H. Ltaief

MOAO Using DGX-1 18 / 48

SLIDE 21

E-ELT

AO supervisor sub-system

Cost function optimization for parameters identification (Learn stage) Linear algebra for reconstructor matrix computation (Apply stage)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 19 / 48

SLIDE 22

E-ELT

Learn process

Fitting a covariance matrix on a model including system and turbulence parameters, using a score function Levenberg-Marquardt algorithm for function optimization Example of fitted turbulence profile

D. Gratadour & H. Ltaief

MOAO Using DGX-1 20 / 48

SLIDE 23

E-ELT

Learn process

Multi-GPU process, including matrix generation and LM fit Time to solution on DGX-1 for a matrix size of 86k and 40 turbulence layers:240s (4 minutes) Weak and strong scaling for the learn process

D. Gratadour & H. Ltaief

MOAO Using DGX-1 21 / 48

SLIDE 24

E-ELT

Apply process: tomographic reconstructor

Compute the tomographic reconstructor matrix using covariance matrix between direction specific truth sensor and other sensors and the inverse of measurements covariance matrix R′ = Ctm · C −1

mm

Factorize and Solve for R′ with Cmm, a 100k x 100k matrix, is extremely compute intensive At the core of system operations (soft real-time, should be achieved in seconds to update the real-time box) Also a critical component for the numerical simulation of the system behavior (observation forecast for today’s design studies)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 22 / 48

SLIDE 25

E-ELT

AO simulation pipeline: observations forecast

Goal is to produce image quality maps over the instrument’s full FoV depending on turbulence conditions evolution over the night

D. Gratadour & H. Ltaief

MOAO Using DGX-1 23 / 48

SLIDE 26

E-ELT

AO simulation pipeline: observations forecast

From system parameters and turbulence evolution timeline to image quality at the output of the system

D. Gratadour & H. Ltaief

MOAO Using DGX-1 24 / 48

SLIDE 27

Task Model

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 25 / 48

SLIDE 28

Task Model

One of The Possible Solutions for Exascale Computing

Fine-granularity computation Asynchronous execution High concurrency Runtime system for separation of concerns Productivity with abstraction Performance/debugging tools Popular: available from OpenMP 3.0

D. Gratadour & H. Ltaief

MOAO Using DGX-1 26 / 48

SLIDE 29

Task Model

Task-based Applications 101

1 ”Taskify” the application. This may require:

Implementing a new algorithm from scratch Performing more flops at the end of the day Increasing code size

2 Schedule the generated tasks. This may require a runtime system

featuring:

Static/dynamic scheduling Shared/Distributed memory systems NUMA-aware Heterogeneous architecture (x86+accelerators) Homogeneous architectures (Intel TurboBoost) Load balancing Overlapping communication with computation

D. Gratadour & H. Ltaief

MOAO Using DGX-1 27 / 48

SLIDE 30

Parallel Implementation

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 28 / 48

SLIDE 31

Parallel Implementation

Blocked Algorithms: Fork-Join Paradigm

D. Gratadour & H. Ltaief

MOAO Using DGX-1 29 / 48

SLIDE 32

Parallel Implementation

LAPACK/MAGMA: Blocked Algorithms

Principles: Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model A broken model!

D. Gratadour & H. Ltaief

MOAO Using DGX-1 30 / 48

SLIDE 33

Parallel Implementation

Tile Data Layout Format

LAPACK: column-major format PLASMA: tile format

D. Gratadour & H. Ltaief

MOAO Using DGX-1 31 / 48

SLIDE 34

Parallel Implementation

PLASMA/CHAMELEON: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ CHAMELEON: = ⇒ https://gitlab.inria.fr/solverstack/chameleon.git Break the bulk synchronous programming model Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Default dynamic runtime system environment StarPU (but could use Quark, PaRSEC, OmpSs, OpenMP etc.)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 32 / 48

SLIDE 35

Parallel Implementation

StarPU Runtime System

RunTime which provides: = ⇒ Task scheduling = ⇒ Memory management Supports: = ⇒ SMP/Multicore Processors (x86, PPC, . . . ) = ⇒ NVIDIA GPUs (e.g. heterogeneous multi-GPU) = ⇒ OpenCL devices = ⇒ Cell Processors (experimental)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 33 / 48

SLIDE 36

Parallel Implementation

StarPU Runtime System

starpu_Insert_Task(&cl_dpotrf, VALUE, &uplo, sizeof(char), VALUE, &n, sizeof(int), INOUT, Ahandle(k, k), VALUE, &lda, sizeof(int), OUTPUT, &info, sizeof(int) CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL, 0);

D. Gratadour & H. Ltaief

MOAO Using DGX-1 34 / 48

SLIDE 37

Performance Results

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 35 / 48

SLIDE 38

Performance Results

Global Workflow Chart

System Parameters Turbulence Parameters matcov Cmm Ctm ToR matcov Cmm Ctm Ctt Cee Cvv BLAS BLAS Inter- sample R ToR computation Observing sequence

D. Gratadour & H. Ltaief

MOAO Using DGX-1 36 / 48

SLIDE 39

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012

D. Gratadour & H. Ltaief

MOAO Using DGX-1 37 / 48

SLIDE 40

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012

D. Gratadour & H. Ltaief

MOAO Using DGX-1 38 / 48

SLIDE 41

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013

D. Gratadour & H. Ltaief

MOAO Using DGX-1 39 / 48

SLIDE 42

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013

D. Gratadour & H. Ltaief

MOAO Using DGX-1 40 / 48

SLIDE 43

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014

D. Gratadour & H. Ltaief

MOAO Using DGX-1 41 / 48

SLIDE 44

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014 8 x NVIDIA K80s 2014

D. Gratadour & H. Ltaief

MOAO Using DGX-1 42 / 48

SLIDE 45

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014 8 x NVIDIA K80s 2014 28 cores Intel BDW 2016 64 cores Intel KNL 2016

D. Gratadour & H. Ltaief

MOAO Using DGX-1 43 / 48

SLIDE 46

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014 8 x NVIDIA K80s 2014 28 cores Intel BDW 2016 64 cores Intel KNL 2016 NVIDIA DGX-1 2016

D. Gratadour & H. Ltaief

MOAO Using DGX-1 44 / 48

SLIDE 47

Performance Results

Performance Evolution of the ToR Computations

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

16 cores Intel SDB 2012 8 x NVIDIA K20s 2012 40 cores Intel IVB 2013 8 x NVIDIA K40s 2013 36 cores Intel HSW 2014 8 x NVIDIA K80s 2014 28 cores Intel BDW 2016 64 cores Intel KNL 2016 NVIDIA DGX-1 2016 8 nodes (16 Intel SDB + 8 K80s)

D. Gratadour & H. Ltaief

MOAO Using DGX-1 45 / 48

SLIDE 48

Conclusion

Outline

1

The European Extremely Large Telescope

2

Leveraging Task-based Programming Model

3

Parallel Implementation

4

Performance Results

5

Conclusion

D. Gratadour & H. Ltaief

MOAO Using DGX-1 46 / 48

SLIDE 49

Conclusion

Summary

Computational challenge tackled down: 25 seconds to compute ToR with DGX-1. When part of the simulation overall pipeline, capable of getting hours

f observation forecast in few seconds.

Most efficient real-time MOAO frameworks available for the computational astronomy. Efficient Task-based programming model. Pipelining computational stages. DGX-1: hardware of choice for remote scientific experiments (e.g., power limitations and system complexity). Software portability is a key to explore various hardware architectures.

D. Gratadour & H. Ltaief

MOAO Using DGX-1 47 / 48

SLIDE 50

Conclusion

Thank You!

D. Gratadour & H. Ltaief

MOAO Using DGX-1 48 / 48