I/O Scheduling Service for Multi-Application Clusters Adrien Lebre - - PowerPoint PPT Presentation

i o scheduling service for multi application clusters
SMART_READER_LITE
LIVE PREVIEW

I/O Scheduling Service for Multi-Application Clusters Adrien Lebre - - PowerPoint PPT Presentation

I/O Scheduling Service for Multi-Application Clusters Adrien Lebre Adrien.Lebre@irisa.fr Guillaume Huard, Yves Denneulin { Adrien.Lebre,Guillaume.Huard,Yves.Denneulin } @imag.fr Laboratoire ID-IMAG (UMR 5132), Grenoble, France. BULL - HPC,


slide-1
SLIDE 1

I/O Scheduling Service for Multi-Application Clusters

Adrien Lebre

Adrien.Lebre@irisa.fr

Guillaume Huard, Yves Denneulin

{Adrien.Lebre,Guillaume.Huard,Yves.Denneulin}@imag.fr

Laboratoire ID-IMAG (UMR 5132), Grenoble, France. BULL - HPC, ´ Echirolles, France.

Przemyslaw Sowa

sowa@icis.pcz.pl

Institute of Computer and Information Sciences Czestochowa University of Technology, Poland.

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-2
SLIDE 2

Plan

Part 1 - Parallel Input/Output and Clusters Part 2 - Controlling and Scheduling Multi-application I/O Part 3 - aIOLi, an Input/Output Scheduler for HPC Part 4 - Conclusion

2/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-3
SLIDE 3

Plan

Part 1 - Parallel Input/Output and Clusters 1 Introduction Context Parallel I/O 2 Parallel I/O and Clusters Available Solutions 3 Objectives Part 2 - Controlling and Scheduling Multi-application I/O Part 3 - aIOLi, an Input/Output Scheduler for HPC Part 4 - Conclusion

3/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-4
SLIDE 4

Context

Environment

Clusters of SMPs Linux High Performance Computing Intensive I/O applications

CPU bounded application ⇒ I/O bounded application Remote hard drive I/O

Parallel I/O

Handling concurrent accesses to a same resource (a file) Accesses: different in size, in offset Example: a parallel matrix product

4/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-5
SLIDE 5

Parallel I/O - Example

Parallel matrix product

Specific parts to fetch according to the data distribution (columns/rows, BLOCK/BLOCK ...)

Matrices are stored "row by row" in files

... ... ... ... ... ... ... ... ... ...

d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ... ... ... ... ... ... ...

d e f ... ...

... ... ... ... ... ... ... ...

a b c ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... g h i ... ... Matrix A Matrix B Matrix C Matrix A Matrix B g h i ... ...

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-6
SLIDE 6

Parallel I/O - Example

Parallel matrix product

Specific parts to fetch according to the data distribution (columns/rows, BLOCK/BLOCK ...)

1 read (n) n read (1) 7 8 9 ... ...

... ... ... ... ... ... ... ... ... ...

d e f ... ...

... ... ... ... ... ... ... ...

a b c ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... g h i ... ... Matrix A Matrix B Matrix C Matrix A Matrix B g h i ... ... Matrices are stored "row by row" in files

... ... ... ... ... ... ... ... ... ...

d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ...

1 2 3 ... ... 4 5 6 ... ... P0

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-7
SLIDE 7

Parallel I/O - Example

Parallel matrix product

Specific parts to fetch according to the data distribution (columns/rows, BLOCK/BLOCK ...)

1 read (n) n read (1) n read (1) 1 read (n) d e f ... ...

... ... ... ... ... ... ... ...

a b c ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... g h i ... ... Matrix A Matrix B Matrix C Matrix A Matrix B g h i ... ... Matrices are stored "row by row" in files

... ... ... ... ... ... ... ... ... ...

d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ... ... ... ... ... ... ...

P1 P0 5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-8
SLIDE 8

Parallel I/O - Example

Parallel matrix product

Specific parts to fetch according to the data distribution (columns/rows, BLOCK/BLOCK ...) File decomposition: Lot of disjoint/contiguous requests at the same time ⇒ ”lethal” behaviour for I/O subsystem

n2 * read (n) n2 * n read (1) n read (1) 1 read (n) 1 read (n) n read (1) 1 read (n) n read (1) n read (1) 1 read (n) Matrices are stored "row by row" in files

... ... ... ... ... ... ... ... ... ...

d e f ... ...

... ... ... ...

a b c ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... g h i ... ... g h i ... ... Matrix A Matrix B Matrix C

... ... ... ...

Matrix A

... ... ... ... ... ... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ...

... ... ... ... ... ... ... ... ... ...

? ? ? ... ... ? ? ? ... ... ? ? ? ... ... Matrix B a b c ... ... d e f ... ... Pp Pi P1 P0

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-9
SLIDE 9

Parallel I/O - Example

No defined order between requests ⇒ many disk head movements

SEEK SEEK SEEK SEEK time Matrix C Matrix are stored "row by row" in files Matrix A g h i ... ... ... ... ... ... ... ... ... ... ... ... 7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ... ... ... ... ... ... ... ... ... ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ... a b c ... ... d e f ... ... ... ... ... ... a b c ... ... g h i ... ... d e f ... ... ... ... ... ... ... ... ... ... ... ... Matrix A Matrix B

Pp Pi P1 P0

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-10
SLIDE 10

Parallel I/O - Example

No defined order between requests ⇒ many disk head movements

SEEK SEEK SEEK SEEK time ... ... ... ... ... ... ... ... ... ... 7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ... ... ... ... ... ... ... ... ... ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ... d e f ... ... ... ... ... ... a b c ... ... g h i ... ... g h i ... ... Matrix are stored "row by row" in files Matrix A a b c ... ... d e f ... ... ... ... ... ... ... ... ... ... ... ... Matrix A Matrix B Matrix C

One HDD provides the best performances when files are accessed in a sequential and contiguous way

Pp Pi P1 P0

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-11
SLIDE 11

Parallel I/O - Example

No defined order between requests ⇒ many disk head movements

... ... ... ... Matrix B 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Matrix C Matrix A

... ... ... ... ... ... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ...

... ... ... ... ...

g h i ... ... ? ? ? ... ... a b c ... ... d e f ... ... d e f ... ... Matrix A a b c ... ... g h i ... ... Matrix B Matrix are stored "row by row" in files

... ... ... ...

? ? ? ... ... ? ? ? ... ...

⇒ performance degradation (bottlenecks) the bigger the potential number of seeks The bigger the number of requests ”Random order” between requests Similar behavior for the matrix B

Pp Pi P1 P0

5/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-12
SLIDE 12

Parallel I/O and Clusters

Parallel I/O ⇒ bottlenecks

Application 1 I/O server (NFS) ... ... P1 Pn interconnection Network SMP 1 ... ... SMP 2 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute Pn Pn P1 Pn P1 Pn P1 P1 Pn P1

6/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-13
SLIDE 13

Parallel I/O and Clusters

Parallel I/O ⇒ bottlenecks

Application 1 I/O server (NFS) ... ... P1 Pn interconnection Network SMP 1 ... ... SMP 2 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute Pn Pn P1 Pn P1 Pn P1 P1 Pn P1

Hypothesis: network has fewer impact than the I/O subsystems.

6/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-14
SLIDE 14

Parallel I/O Requirements

Performance constraints

Reduce the number of requests: decrease overhead implied by the different syscalls Requests Scheduling: avoid expensive seeks and maximize large accessses Exploit cache mechanisms: benefit from read-ahead strategies, ...

7/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-15
SLIDE 15

Parallel I/O Solutions (1/4)

Application 1 I/O server p P1 Pn ... ... I/O server 1 P1 Pn ... ... I/O server 2 P1 Pn ... ... Storage nodes Interconnection Network P1 Pn P1 Pn SMP 1 SMP 2 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute P1 ... ... Pn P1 Pn P1 Pn

Solution 1: Parallel File Systems ⇒ Balance requests between several servers

8/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-16
SLIDE 16

Parallel I/O Solutions (1/4)

Parallel File Systems

Load balancing on several servers 2 types :

Designed for ”Parallel I/O”: PIOUS, VESTA, ... (logical view/physical placement, gave up in the time) More generic : PVFS, Parallel NFS, GPFS, Lustre (performance/coherency/fault tolerance...)

+/- complete +/- efficient / +/ - intrusive (dedicated APIs at client side) No ”real” scheduling policy (most of them rely on low level schedulers) The performance depends on the striping policy of the file system From a general point of view, they do not take into account application striping !

8/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-17
SLIDE 17

Parallel I/O Solutions (2/4)

Parallel I/O ⇒ bottlenecks

Application 1 P1 Pn ... ... I/O server 1 P1 Pn ... ... I/O server 2 P1 Pn ... ... I/O server p Storage nodes Interconnection Network P1 Pn P1 Pn SMP 1 ... ... SMP 2 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute P1 Pn P1 Pn P1 Pn

9/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-18
SLIDE 18

Parallel I/O Solutions (2/4)

Parallel I/O ⇒ bottlenecks

Application 1 P1 Pn ... ... I/O server 1 P1 Pn ... ... I/O server 2 P1 Pn ... ... I/O server p Storage nodes Network Interconnection Pn P1 Pn P1 Pn SMP 1 ... ... ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute SMP 2 Pn P1 Pn P1 P1

Parallel I/O library

Solution 2: Libraries ⇒ MPI I/O, the standard

9/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-19
SLIDE 19

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev)

7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ... Matrix B ... ... ... ... Matrix A d e f ... ... ... ... ... ... a b c ... ... g h i ... ... Matrix B Matrix A

... ... ... ... ... ... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ...

... ... ... ... ... ... ... ... ... ...

? ? ? ... ... ? ? ? ... ... ? ? ? ... ... a b c ... ... d e f ... ...

... ... ... ... ... ... ... ... ... ...

Matrix are stored <<row by row>> in files g h i ... ... Matrix C Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-20
SLIDE 20

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev)

Pattern of

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... Matrix B

... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... 1 2 3 ... ...

... ... ... ... ...

Matrix C d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... ? ? ? ... ...

... ... ... ... ...

g h i ... ...

... ... ... ... ...

Matrix B Matrix A Matrix are stored <<row by row>> in files

... ... ... ... ... ... ... ... ... ...

P0 Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-21
SLIDE 21

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev)

Pattern of

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... Matrix B 1 2 3 ... ...

... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... Matrix B Matrix A

... ... ... ... ...

Matrix are stored <<row by row>> in files

... ... ... ... ...

Matrix C d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... g h i ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

P1 Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-22
SLIDE 22

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev)

Pattern of

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... Matrix B 1 2 3 ... ...

... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... Matrix B Matrix A

... ... ... ... ...

Matrix are stored <<row by row>> in files

... ... ... ... ...

Matrix C d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... g h i ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Pi Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-23
SLIDE 23

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev)

Pattern of

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... Matrix B 1 2 3 ... ...

... ... ... ... ...

7 8 9 ... ... 4 5 6 ... ... Matrix B Matrix A

... ... ... ... ...

Matrix are stored <<row by row>> in files

... ... ... ... ...

Matrix C d e f ... ... a b c ... ... ? ? ? ... ... ? ? ? ... ... g h i ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Pp Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-24
SLIDE 24

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

... ... ... ... ... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... Matrix B Matrix are stored <<row by row>> in files ? ? ? ... ... ? ? ? ... ... ? ? ? ... ...

... ... ... ... ... ... ... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ... ... ... ... ... ... ...

d e f ... ...

... ... ... ... ... ... ... ... ... ...

a b c ... ... Matrix A g h i ... ... Matrix B Matrix C Pp Pi P1 P0 10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-25
SLIDE 25

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

each access is sent in an independant and parallel way

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-26
SLIDE 26

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

SEEK each access is sent in an independant and parallel way ... ... ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... ... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-27
SLIDE 27

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

SEEK each access is sent in an independant and parallel way ... ... ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... ... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-28
SLIDE 28

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

SEEK each access is sent in an independant and parallel way ... ... ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... ... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-29
SLIDE 29

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

SEEK each access is sent in an independant and parallel way ... ... ... ... 1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ... ... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-30
SLIDE 30

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...)

Use of complementary routines to define a particular order

each access is sent in an independant and parallel way

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-31
SLIDE 31

Parallel I/O Solutions (3/4)

Libraries - MPI I/O [MPI2-97]

Definition of access patterns ⇒ Reduce number of requests (equivalent to ”views” / access vectors like for readv/writev) Processes coordination to efficiently balance requests (aggregation, load balancing but still no efficient order ...) From a global point of view, the performance are improved but : Sophisticated API ⇒ Development overhead/Language bindings No global coordination ⇒ Impact on performances

Use of complementary routines to define a particular order

each access is sent in an independant and parallel way

... ... ... ...

1 2 3 ... ... 4 5 6 ... ... 7 8 9 ... ...

... ... ... ...

Matrix B

Side File System Side

P0 P1 Pi Pp

Application

10/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-32
SLIDE 32

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

Application 2 Application 1 P1 Pn ... ... I/O server 1 P1 Pn ... ... I/O server 2 P1 Pn ... ... I/O server p Storage nodes Network Interconnection P1 Pn P1 Pn SMP 1 ... ... SMP 2 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute Pn P1 Pn Pn P1 P1

ROMIO ROMIO

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-33
SLIDE 33

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

? SEEK ?

File 1 File 2

(2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-34
SLIDE 34

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

?

File 1 File 2

Synchronous behaviour (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-35
SLIDE 35

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

SEEK ! ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-36
SLIDE 36

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

! ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-37
SLIDE 37

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

! SEEK ! ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-38
SLIDE 38

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

! ! ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-39
SLIDE 39

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

! SEEK ! ! ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

No informations about applications, only about files

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-40
SLIDE 40

Parallel I/O Solutions (3/4)

Libraries in multi-application environment

! ! ! ! ? I/O server Synchronous behaviour

File 2

Appli 1 Appli 2

File 1

(2 concurrent applications execute a cat−like operation)

File 2 File 1

To switch from one file to the other implies a disk head movement

Libraries are not suited global synchronization is required

11/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-41
SLIDE 41

A New Approach

Objectives

Supply Parallel I/O algorithms scheduling / aggregating / overlapping access ⇒ mono-application efficiency Only through the use of the ubiquitous POSIX calls:

  • pen/read/write/lseek/close ⇒ portability / simplicity

Address requests in a global manner ⇒ multi-application efficiency Naive approach: Processing all the requests from one application before serving another one Not suited for a cluster ⇒ Tradeoff between ”fairness” and performance

12/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-42
SLIDE 42

Plan

Part 1 - Parallel Input/Output and Clusters Part 2 - Controlling and Scheduling Multi-application I/O

4

Scheduling in Multi-application Environment General Algorithm

5

Synchronous Behaviour

Part 4 - Conclusion

13/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-43
SLIDE 43

Multi-application Scheduling

nodes I/O server 2 Pn ... ... I/O server p ... ... Pn P1 I/O server 1 P1 Storage P1 Pn ... ... Network interconnection ... ... Pn P1 Pn P1 Pn P1 Pn P1 Pn SMP 4 ... ... SMP 3 ... ... SMP 2 ... ... SMP 1 P1 Compute nodes ... ... SMP n Appli2 Appli 6 Appli 1 Appli 3 Appli 4 Appli n Appli 5

14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-44
SLIDE 44

Multi-application Scheduling

Pn I/O server 2 nodes Storage ... ... I/O server 1 P1 I/O server p ... ... P1 Pn ... ... Pn P1 Network interconnection ... ... ... ... SMP 2 ... ... SMP 1 Pn Compute nodes ... ... SMP n ... ... SMP 4 SMP 3 P1 Pn P1 Pn P1 Pn P1 Pn P1 Appli 3 Appli 5 Appli n Appli 4 Appli 6 Appli 1 Appli2

14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-45
SLIDE 45

Multi-application Scheduling

I/O server (NFS) ... ... P1 Pn interconnection Network SMP 4 P1 Pn P1 Pn P1 Pn P1 Pn P1 Pn SMP 1 ... ... SMP 2 ... ... SMP 3 ... ... ... ... SMP n ... ... nodes Compute Appli 1 Appli 3 Appli 5 Appli n Appli 4 Appli2 Appli 6

14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-46
SLIDE 46

Multi-application Scheduling

”Online” problem: several requests from distinct applications are delivered to the file systems. Wished criteria : ”Efficiency” with ”fairness” constraints ⇒ Maximize the minimum of instantaneous throughput for each application Algorithm : ”Multi-Level Feedback” variant (quantum approach)

Step 1 q = 10, T = 15 q = 10, T = 5 q = 10, T = 20 Scheduling Pre−processing (offset dependance)

A1 A3 A3 A2 A2 A2 A1 A3 A1 A1 A1 A1 A2 A2 A2 A1 A1 A1 A2 A1 A2 A1 A2 A1

1 element requires 5 time units to be processed Processing Waiting 14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-47
SLIDE 47

Multi-application Scheduling

”Online” problem: several requests from distinct applications are delivered to the file systems. Wished criteria : ”Efficiency” with ”fairness” constraints ⇒ Maximize the minimum of instantaneous throughput for each application Algorithm : ”Multi-Level Feedback” variant (quantum approach)

Step 2 q = 20, T = 20 q = 20, T = 25 q = 10, T = 5 Scheduling Pre−processing (offset dependance)

A2 A3 A3 A2 A2 A2 A1 A1 A1 A1 A3 A1 A1 A2 A2 A2 A2 A2 A3 A2 A1 A1 A1 A1 A1 A1 A2 A2 A2 A2 A2

1 element requires 5 time units to be processed Processing Waiting 14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-48
SLIDE 48

Multi-application Scheduling

”Online” problem: several requests from distinct applications are delivered to the file systems. Wished criteria : ”Efficiency” with ”fairness” constraints ⇒ Maximize the minimum of instantaneous throughput for each application Algorithm : ”Multi-Level Feedback” variant (quantum approach)

Step 3 q = 40, T = 30 q = 20, T = 20 Scheduling Pre−processing (offset dependance)

A2 A3 A3 A3 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A1 A1 A1 A1 A2 A2 A2 A2 A2 A3 A3 A2 A3 A3 A3

1 element requires 5 time units to be processed Processing Waiting 14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-49
SLIDE 49

Multi-application Scheduling

”Online” problem: several requests from distinct applications are delivered to the file systems. Wished criteria : ”Efficiency” with ”fairness” constraints ⇒ Maximize the minimum of instantaneous throughput for each application Algorithm : ”Multi-Level Feedback” variant (quantum approach) The grow of a quantum could be set for each application (QoS)

Step 3 q = 40, T = 30 q = 20, T = 20 Scheduling Pre−processing (offset dependance)

A2 A3 A3 A3 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A1 A1 A1 A1 A2 A2 A2 A2 A2 A3 A3 A2 A3 A3 A3

1 element requires 5 time units to be processed Processing Waiting 14/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-50
SLIDE 50

Manage Synchronous Behaviour in an Efficient Way

! ! ! ! ? I/O server

File 1 File 2

Synchronous behaviour Appli 2 (2 concurrent applications execute a cat−like operation) Appli 1

File 2 File 1

⇒ Serialize and define ”dedicated” windows To switch from one file to the other implies a disk head movement

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-51
SLIDE 51

Manage Synchronous Behaviour in an Efficient Way

server Coordination ? SEEK ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-52
SLIDE 52

Manage Synchronous Behaviour in an Efficient Way

Coordination server ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-53
SLIDE 53

Manage Synchronous Behaviour in an Efficient Way

Coordination server ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-54
SLIDE 54

Manage Synchronous Behaviour in an Efficient Way

server Coordination ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-55
SLIDE 55

Manage Synchronous Behaviour in an Efficient Way

SEEK ! server Coordination ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-56
SLIDE 56

Manage Synchronous Behaviour in an Efficient Way

! server Coordination ? (2 concurrent applications execute a cat−like operation) Appli 2 Appli 1 I/O server

File 1 File 2

Synchronous behaviour

Serialize and define dedicated windows

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-57
SLIDE 57

Manage Synchronous Behaviour in an Efficient Way

! server Coordination ? I/O server Appli 1

File 1 File 2

Appli 2 Synchronous behaviour (2 concurrent applications execute a cat−like operation)

Serialize and define dedicated windows Decrease the number of seeks and exploit read-ahead mechanism

15/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-58
SLIDE 58

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way

Scheduling queue (aIOLi) File system granularity Client side read(4MB) read(4MB) read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-59
SLIDE 59

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way

(aIOLi) Scheduling queue File system granularity Client side read(4MB) read(4MB) read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-60
SLIDE 60

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way ⇒ The quantum size is adapted according to the file access history

Synchronous behaviour Aggregation process could not be exploited Scheduling queue (aIOLi) File system granularity Client side read(4MB) read(4MB) read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-61
SLIDE 61

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way ⇒ The quantum size is adapted according to the file access history

q=3 q=2 q=1 q=4 Scheduling queue (aIOLi) File system granularity Client side read(4MB) read(4MB) read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-62
SLIDE 62

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way ⇒ The quantum size is adapted according to the file access history

q=4 q=3 q=2 q=2 q=4 q=3 q=2 q=1 (aIOLi) Scheduling queue read(4MB) File system granularity read(4MB) Client side read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-63
SLIDE 63

Synchronous behaviour and Parallel I/O

Synchronous behaviour within parallel I/O

Due to the file system granularity Equivalent to n synchronous accesses sent in a parallel way ⇒ The quantum size is adapted according to the file access history

min bound < Quantum < max bound q=4 q=3 q=2 q=2 q=4 q=3 q=2 q=1 Scheduling queue (aIOLi) File system granularity Client side read(4MB) read(4MB) read(4MB) read(4MB) P3 P2 P1 P0 16/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-64
SLIDE 64

Plan

Part 1 - Parallel Input/Output and Clusters Part 2 - Controlling and Scheduling Multi-application I/O Part 3 - aIOLi, an Input/Output Scheduler for HPC 6

aIOLi - ”Generic” Framework

7

aIOLi - Evaluations Evaluations - Multi-nodes Eval - 2 applications Evaluations - 10 applications Part 4 - Conclusion

17/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-65
SLIDE 65

aIOLi - ”Generic” Framework

interconnection ... ... I/O server p P1 Pn ... ... I/O server 2 P1 Pn SMP 2 ... ... SMP 1 ... ... SMP 3 ... ... SMP 4 ... ... SMP n ... ... nodes Compute Network Storage nodes P1 Pn P1 Pn P1 Pn P1 Pn P1 Pn ... ... I/O server 1 P1 Pn SMP 4 Pn P1 Pn P1 Pn P1 Pn P1 Pn SMP 2 ... ... P1 Pn ... ... I/O server 2 P1 ... ... SMP n ... ... SMP 1 ... ... SMP 3 ... ...

Objectives

A central point is required to apply our global strategy : model Client/Server like PANDA (explicit centralization) is not scalable ⇒Exploit available ”central” point Major change : I/O systems become clients of our framework!

18/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-66
SLIDE 66

aIOLi - Framework

Implementation

”Virtual File System” (server side)

Optimize the VFS on server side aIOLi VFS Client aIOLi layer Virtual File System userspace kernelspace ReiserFS syscalls layer NFS server XDR Network ExtX RPC 19/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-67
SLIDE 67

aIOLi - Framework

Implementation

”Virtual File System” (server side) NFS server (Version 3)

aIOLi NFS client aIOLi Optimize the NFS server ReiserFS syscalls layer NFS server XDR Network layer Virtual File System userspace kernelspace ExtX RPC 19/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-68
SLIDE 68

aIOLi - Framework

Implementation

”Virtual File System” (server side) NFS server (Version 3)

Technical aspects

Linux kernel module - 3 functions to plug an I/O system

Device Block Device Block aIOLi Client n (Network File System) Statistics Control aIOLi Client 1 (Virtual File System) Request Queue Management Schedulers Pool . . . Sched. Instance I/O Controller . . . Sched. Instance I/O Controller . . . Sched. Instance I/O Controller aIOLi System I/O Systems . . . Network

19/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-69
SLIDE 69

aIOLi - Evaluations

Platform

Grid5000 : Sophia-Antipolis cluster (AMD 64, IDE HDD, Giga Ethernet) 1 to 96 nodes 1 dedicated NFS server IOR benchmark (LLNL)

Experiments

1 parallel application : Valid parallel I/O detection and transparent optimisations 2 applications : Analyse mutual impact and interest of a global strategy 10 applications : Evaluate a ”real” case

20/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-70
SLIDE 70

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

POSIX − 32 POSIX − 1 (KBytes) 2000 1500 1000 500 8 16 32 64 128 512 1024 4096 3000 Completion time (sec) File access granularity 2500

32 processes deployed on 32 nodes

Observations

Time ! Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-71
SLIDE 71

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

MPI IO − 32 POSIX − 32 POSIX − 1 (KBytes) 1500 1000 500 8 16 32 64 128 512 1024 4096 3000 2500 Completion time (sec) File access granularity 2000

32 processes deployed on 32 nodes

Observations

Time ! Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-72
SLIDE 72

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

POSIX + aIOLi − 32 MPI IO − 32 POSIX − 32 POSIX − 1 (KBytes) 1500 1000 500 8 16 32 64 128 512 1024 4096 3000 2500 (sec) Completion time File access granularity 2000

32 processes deployed on 32 nodes

Observations

Time ! aIOLi provides significant improvements 11 < T P osix

T aIOLi < 50

3.5 < T MP IIO

T aIOLi < 6.5

Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-73
SLIDE 73

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

POSIX + aIOLi − 32 MPI IO − 32 POSIX − 32 POSIX− 1 (KBytes) 50 40 30 20 10 8 16 32 64 128 512 1024 4096 70 Bandwidth (MB/s) File access granularity 60

32 processes deployed on 32 nodes

Observations

Time ! aIOLi provides significant improvements 11 < T P osix

T aIOLi < 50

3.5 < T MP IIO

T aIOLi < 6.5

Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-74
SLIDE 74

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

POSIX + aIOLi − 1 POSIX− 1 (KBytes) 50 40 30 20 10 8 16 32 64 128 512 1024 4096 70 Bandwidth (MB/s) File access granularity 60

1 process (’prefetch’)

Observations

Time ! aIOLi provides significant improvements 11 < T P osix

T aIOLi < 50

3.5 < T MP IIO

T aIOLi < 6.5

Synchronous behaviours benefit from aIOLi Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-75
SLIDE 75

Evaluations : One Parallel Application

4 GB File decomposition including 32 MPI instances kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

POSIX + aIOLi − 1 POSIX + aIOLi − 32 MPI IO − 32 POSIX − 32 POSIX− 1 (KBytes) 50 40 30 20 10 8 16 32 64 128 512 1024 4096 70 Bandwidth (MB/s) File access granularity 60

Observations

Time ! aIOLi provides significant improvements 11 < T P osix

T aIOLi < 50

3.5 < T MP IIO

T aIOLi < 6.5

Synchronous behaviours benefit from aIOLi Mono-application OK ! next step: global coordination

21/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-76
SLIDE 76

Evaluations : Multi-application Mode - Case 1

Impact of a 4 GB decomposition (32 processes - 32 nodes) over a cat of 16 MB kernel 2.6.15, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

cat−NFS IOR−NFS IOR file access granularity (KBytes) 0.1 1 10 100 1000 10000 512 128 Completion time 32 8 (sec) ’cat’ granularity 32KBytes cat−NFS IOR−NFS "cat" granularity 32KBytes 32 0.1 8 1 10 100 1000 10000 512 IOR file access granularity (KBytes) 128

read - POSIX read - MPI I/O

Observations

Impact : 16MB in 100 sec (Y axis : ”log” scale) The use of MPI I/O reduces the load on the server

22/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-77
SLIDE 77

Evaluations : Multi-application Mode - Case 1

Impact of a 4 GB decomposition (32 processes - 32 nodes) over a cat of 16 MB kernel 2.6.15, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO

cat−aIOLi cat−NFS IOR−aIOLi IOR−NFS IOR file access granularity (KBytes) 0.1 1 10 100 1000 10000 512 128 32 8 Completion time (sec) ’cat’ granularity 32KBytes cat−aIOLi cat−NFS IOR−aIOLi IOR−NFS "cat" granularity 32KBytes 32 0.1 8 1 10 100 1000 10000 512 IOR file access granularity (KBytes) 128

read - POSIX read - MPI I/O

Observations

Impact : 16MB in 100 sec (Y axis : ”log” scale) The use of MPI I/O reduces the load on the server aIOLi improves the performances for both applications

22/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-78
SLIDE 78

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Application details POSIX MPI IO 4 decompos. 6 sequential. 6 GB 595 840

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-79
SLIDE 79

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Application details NFS POSIX MPI IO read decompos. - 2GB (32 nodes, granularity=128KB) 490 840 write decompos. - 2GB (32 nodes, granularity=128KB) 409 815 read decompos. - 256MB (16 nodes, granularity=8KB) 595 728 write decompos. - 128MB (8 nodes, granularity=64KB) 51 257 read sequential. - 1GB (1 node, granularity=2MB) 558 59 write sequential. - 512MB (1 node, granularity=2MB) 192 71 read sequential. - 32MB (1 node, granularity=4KB) 531 9 write sequential. - 32MB (1 node, granularity=4KB) 208 9 read sequential. - 4MB (1 node, granularity=32KB) 57 1.5 write sequential. - 4MB (1 node, granularity=32KB) 39 2

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-80
SLIDE 80

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Application details NFS POSIX MPI IO read decompos. - 2GB (32 nodes, granularity=128KB) 490 840 write decompos. - 2GB (32 nodes, granularity=128KB) 409 815 read decompos. - 256MB (16 nodes, granularity=8KB) 595 728 write decompos. - 128MB (8 nodes, granularity=64KB) 51 257 read sequential. - 1GB (1 node, granularity=2MB) 558 59 write sequential. - 512MB (1 node, granularity=2MB) 192 71 read sequential. - 32MB (1 node, granularity=4KB) 531 9 write sequential. - 32MB (1 node, granularity=4KB) 208 9 read sequential. - 4MB (1 node, granularity=32KB) 57 1.5 write sequential. - 4MB (1 node, granularity=32KB) 39 2

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-81
SLIDE 81

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Application details NFS NFS+aIOLi POSIX MPI IO POSIX MPI IO 4 decompos. 6 sequential. 6 GB 595 840 143 604

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-82
SLIDE 82

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Applications NFS NFS+aIOLi POSIX MPI IO POSIX MPI IO read decompos. - 2GB (32 nodes, granularity=128KB) 490 840 134 500 write decompos. - 2GB (32 nodes, granularity=128KB) 409 815 107 604 read decompos. - 256MB (16 nodes, granularity=8KB) 595 728 104 415 write decompos. - 128MB (8 nodes, granularity=64KB) 51 257 14.5 247 read sequential - 1GB (1 node, granularity=2MB) 558 59 143 54 write sequential. - 512MB (1 node, granularity=2MB) 192 71 84 61.5 read sequential. - 32MB (1 node, granularity=4KB) 531 9 48.5 3 write sequential. - 32MB (1 node, granularity=4KB) 208 9 47 6 read sequential. - 4MB (1 node, granularity=32KB) 57 1.5 6 1 write sequential. - 4MB (1 node, granularity=32KB) 39 2 19 2

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-83
SLIDE 83

Evaluations : Multi-application Mode - Case 2

10 concurrent applications, 96 nodes, 6 GB kernel 2.6.12, sophia cluster, NFS version 3, mpich 1.2.5, ROMIO Completion time Applications NFS NFS+aIOLi POSIX MPI IO POSIX MPI IO read decompos. - 2GB (32 nodes, granularity=128KB) 490 840 134 500 write decompos. - 2GB (32 nodes, granularity=128KB) 409 815 107 604 read decompos. - 256MB (16 nodes, granularity=8KB) 595 728 104 415 write decompos. - 128MB (8 nodes, granularity=64KB) 51 257 14.5 247 read sequential - 1GB (1 node, granularity=2MB) 558 59 143 54 write sequential. - 512MB (1 node, granularity=2MB) 192 71 84 61.5 read sequential. - 32MB (1 node, granularity=4KB) 531 9 48.5 3 write sequential. - 32MB (1 node, granularity=4KB) 208 9 47 6 read sequential. - 4MB (1 node, granularity=32KB) 57 1.5 6 1 write sequential. - 4MB (1 node, granularity=32KB) 39 2 19 2

Time are given in seconds.

23/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-84
SLIDE 84

Plan

Part 1 - Parallel Input/Output and Clusters Part 2 - Controlling and Scheduling Multi-application I/O Part 3 - aIOLi, an Input/Output Scheduler for HPC Part 4 - Conclusion 8

Conclusion

9

Current and Future Works

24/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-85
SLIDE 85

Conclusion

Performances and I/O in multi-application environment

Control and schedule I/O requests in a global way : multi-criteria problem : efficiency and fairness ⇒ MLF variant proposal

aIOLi, an I/O scheduler for HPC

Generic framework to evaluate new strategies for I/O scheduling Implementation in kernel space : intrusive from system point of view but efficient Code available under GPL Joint project since the end of 2005 with ICIS University (Poland)

25/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-86
SLIDE 86

Current and Future Works

aIOLi, works in progress

Take into account data stripping (on RAID devices and Parallel File Systems) Collaboration around a parallel version of NFS to evaluate the interest of an higher I/O scheduler (Brasil/France/Poland) Interconnexion with Lustre File system

Future

Control I/O requests at different points : multi-level scheduler

  • n client side (compute node) / on server and hard drive side ⇒ cascade

scheduling Exploiting the meta-node concept used by modern FS to provide consistency as a central point to plug aIOLi

26/26

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006

slide-87
SLIDE 87

Questions ?

http://aioli.imag.fr

LIPS Project BULL - INRIA - ID-IMAG Laboratory - ICIS Institute Thanks

questions

Adrien Lebre c Bull-ID LIPS 2006

aIOLi - workshop Phenix - December 2006