PERHAPS . . . LATEST HARDWARE DEPLOYMENT 3 Courtesy by Miriam, 7a - - PowerPoint PPT Presentation

perhaps
SMART_READER_LITE
LIVE PREVIEW

PERHAPS . . . LATEST HARDWARE DEPLOYMENT 3 Courtesy by Miriam, 7a - - PowerPoint PPT Presentation

MAX-PLANCK-GESELLSCHAFT ASYNCHRONICITY T HE CHALLENGE OF FINE - GRAINED PARALLELISM Luis Kornblueh September 29, 2016 Max-Planck-Institut fr Meteorologie PERHAPS . . . LATEST HARDWARE DEPLOYMENT 3 Courtesy by Miriam, 7a SYSTEM


slide-1
SLIDE 1

MAX-PLANCK-GESELLSCHAFT

ASYNCHRONICITY THE CHALLENGE OF FINE-GRAINED PARALLELISM

Luis Kornblueh September 29, 2016

Max-Planck-Institut für Meteorologie

slide-2
SLIDE 2

PERHAPS . . .

slide-3
SLIDE 3

LATEST HARDWARE DEPLOYMENT Courtesy by Miriam, 7a

3

slide-4
SLIDE 4

SYSTEM CHARACTERISTICS

  • 24 nodes with Broadcom BCM2835 SoC (700 MHz ARM

1176JZF-S, VideoCore IV GPU)

  • Non-blocking fat tree high speed network IEEE 802.3u

(100BASE-TX) via USB-2 Bus (aggregated 64.8 MB/s)

  • NFSv4 network filesystem, SLURM, GCC, mpich
  • Linux Debian jessie (Kernel 4.4)

4

slide-5
SLIDE 5

SYSTEM CHARACTERISTICS

  • 24 nodes with Broadcom BCM2835 SoC (700 MHz ARM

1176JZF-S, VideoCore IV GPU)

  • Non-blocking fat tree high speed network IEEE 802.3u

(100BASE-TX) via USB-2 Bus (aggregated 64.8 MB/s)

  • NFSv4 network filesystem, SLURM, GCC, mpich
  • Linux Debian jessie (Kernel 4.4)

4

slide-6
SLIDE 6

SYSTEM CHARACTERISTICS

  • 24 nodes with Broadcom BCM2835 SoC (700 MHz ARM

1176JZF-S, VideoCore IV GPU)

  • Non-blocking fat tree high speed network IEEE 802.3u

(100BASE-TX) via USB-2 Bus (aggregated 64.8 MB/s)

  • NFSv4 network filesystem, SLURM, GCC, mpich
  • Linux Debian jessie (Kernel 4.4)

4

slide-7
SLIDE 7

SYSTEM CHARACTERISTICS

  • 24 nodes with Broadcom BCM2835 SoC (700 MHz ARM

1176JZF-S, VideoCore IV GPU)

  • Non-blocking fat tree high speed network IEEE 802.3u

(100BASE-TX) via USB-2 Bus (aggregated 64.8 MB/s)

  • NFSv4 network filesystem, SLURM, GCC, mpich
  • Linux Debian jessie (Kernel 4.4)

4

slide-8
SLIDE 8

SYSTEM CHARACTERISTICS

  • 24 nodes with Broadcom BCM2835 SoC (700 MHz ARM

1176JZF-S, VideoCore IV GPU)

  • Non-blocking fat tree high speed network IEEE 802.3u

(100BASE-TX) via USB-2 Bus (aggregated 64.8 MB/s)

  • NFSv4 network filesystem, SLURM, GCC, mpich
  • Linux Debian jessie (Kernel 4.4)

Successfully run echam 4.6 T31L19 (CVS version 6.00, 2000-09-19 08:26:58 (Git: da9d477) , no code changes) using the full system.

4

slide-9
SLIDE 9

ENERGY CONSUMPTION 100 W Courtesy by Miriam, 7a

5

slide-10
SLIDE 10

SETTING THE STAGE

slide-11
SLIDE 11

WHAT IS DRIVING NEW DEVELOPMENTS?

Redefinition: the models we talk about consist of all components which are used in the workflow!

7

slide-12
SLIDE 12

WHAT IS DRIVING NEW DEVELOPMENTS?

Redefinition: the models we talk about consist of all components which are used in the workflow! The development of global circulation models in its current form has to change and respond to major challenges in hardware development.

7

slide-13
SLIDE 13

WHAT IS DRIVING NEW DEVELOPMENTS?

Redefinition: the models we talk about consist of all components which are used in the workflow! The development of global circulation models in its current form has to change and respond to major challenges in hardware development. Example:

  • ld node — 12 cores 2.5 GHz

new node 18 cores 2.1 GHz

7

slide-14
SLIDE 14

WHAT IS DRIVING NEW DEVELOPMENTS?

Redefinition: the models we talk about consist of all components which are used in the workflow! The development of global circulation models in its current form has to change and respond to major challenges in hardware development. Example:

  • ld node — 12 cores 2.5 GHz

new node 18 cores 2.1 GHz Consequence: more and more, fine grained parallelism is required to achieve the necessary performance to answer scientific questions posed.

7

slide-15
SLIDE 15

OBJECTIVES

Key points are

  • to keep all critical hardware resources concurrently in use,
  • to minimize or hide the response time for remote access and

service requests,

  • to improve and reduce contributions of parallel resources and

task scheduling not used for computational work itself, and

  • to minimize resource access conflicts.

8

slide-16
SLIDE 16

OBJECTIVES

Key points are

  • to keep all critical hardware resources concurrently in use,
  • to minimize or hide the response time for remote access and

service requests,

  • to improve and reduce contributions of parallel resources and

task scheduling not used for computational work itself, and

  • to minimize resource access conflicts.

8

slide-17
SLIDE 17

OBJECTIVES

Key points are

  • to keep all critical hardware resources concurrently in use,
  • to minimize or hide the response time for remote access and

service requests,

  • to improve and reduce contributions of parallel resources and

task scheduling not used for computational work itself, and

  • to minimize resource access conflicts.

8

slide-18
SLIDE 18

OBJECTIVES

Key points are

  • to keep all critical hardware resources concurrently in use,
  • to minimize or hide the response time for remote access and

service requests,

  • to improve and reduce contributions of parallel resources and

task scheduling not used for computational work itself, and

  • to minimize resource access conflicts.

8

slide-19
SLIDE 19

ALGORITHMS

The solution framework consists of the

  • functional description of processing algorithms, and
  • a direct acyclic graph representation (DAG) of processing (to

be used for optimization and parallelization).

9

slide-20
SLIDE 20

PROCESSES COMPACTION

slide-21
SLIDE 21

COARSE-GRAINED ASYNCHRONOUS PROCESS

atmosphere radiation

  • cean

bio-geo-chemistry

time no of cores time integration barrier time integration barrier

11

slide-22
SLIDE 22

HOW A VECTOR PIPELINING PROCESSING MODEL WORKS

node-thread space time

read

  • perator 1
  • perator 2

read read read read

  • perator 1
  • perator 1
  • perator 1
  • perator 2
  • perator 2
  • perator3
  • perator3

store slot 0 slot 1 slot 2 slot 3 slot 4

12

slide-23
SLIDE 23

MOVING TO A DAG BASED PROCESSING MODEL

node-thread space time

arrive send

  • perator 1
  • perator 2
  • perator3

arrive arrive arrive arrive

  • perator 1
  • perator 1
  • perator 1
  • perator 1
  • perator 2
  • perator 2
  • perator 2
  • perator 2
  • perator3
  • perator3
  • perator3
  • perator3

send send send send

13

slide-24
SLIDE 24

DAG BASED META-SCHEDULING

cylc, Hilary Oliver, NIWA

14

slide-25
SLIDE 25

FUTURE

slide-26
SLIDE 26

DEVELOPMENT ACTIVITIES

  • Development of a DAG based worker/broker toolkit with

arithmetic operators as first test and later add cdo Hermes, Florian Rathgeber and Tiago Quintino (ECMWF)

  • Refactoring of cdo by moving to C++ and disentangling

command line and operator handling

  • Develop an evaluation hierarchy for cdo operators

16

slide-27
SLIDE 27

DEVELOPMENT ACTIVITIES

  • Development of a DAG based worker/broker toolkit with

arithmetic operators as first test and later add cdo Hermes, Florian Rathgeber and Tiago Quintino (ECMWF)

  • Refactoring of cdo by moving to C++ and disentangling

command line and operator handling

  • Develop an evaluation hierarchy for cdo operators

16

slide-28
SLIDE 28

DEVELOPMENT ACTIVITIES

  • Development of a DAG based worker/broker toolkit with

arithmetic operators as first test and later add cdo Hermes, Florian Rathgeber and Tiago Quintino (ECMWF)

  • Refactoring of cdo by moving to C++ and disentangling

command line and operator handling

  • Develop an evaluation hierarchy for cdo operators

16

slide-29
SLIDE 29

WHAT NEXT?

  • Get a working prototype of post-processing tools and

scheduling

  • Using meta-scheduling for applicable problems
  • Rethink the time operator splitting of the model physics to

allow for a more functional, concurrent usable representation

  • f processes — or resolve those explictly . . .
  • Development and application of model developer friendly

Domain Specific Languages (DSL)

17

slide-30
SLIDE 30

WHAT NEXT?

  • Get a working prototype of post-processing tools and

scheduling

  • Using meta-scheduling for applicable problems
  • Rethink the time operator splitting of the model physics to

allow for a more functional, concurrent usable representation

  • f processes — or resolve those explictly . . .
  • Development and application of model developer friendly

Domain Specific Languages (DSL)

17

slide-31
SLIDE 31

WHAT NEXT?

  • Get a working prototype of post-processing tools and

scheduling

  • Using meta-scheduling for applicable problems
  • Rethink the time operator splitting of the model physics to

allow for a more functional, concurrent usable representation

  • f processes — or resolve those explictly . . .
  • Development and application of model developer friendly

Domain Specific Languages (DSL)

17

slide-32
SLIDE 32

WHAT NEXT?

  • Get a working prototype of post-processing tools and

scheduling

  • Using meta-scheduling for applicable problems
  • Rethink the time operator splitting of the model physics to

allow for a more functional, concurrent usable representation

  • f processes — or resolve those explictly . . .
  • Development and application of model developer friendly

Domain Specific Languages (DSL)

17

slide-33
SLIDE 33

ADDITIONAL CONSTRAINTS

slide-34
SLIDE 34

UNKNOWNS

There are two more aspects contributing to effective system

  • usage. Power consumption and the system’s reliability.

The influence of this parameters on future development are not in the primary scope of this considerations, but are supposed to have a strong impact on solutions.

19