Decoupled I/O for Data-Intensive High Performance Computing Chao - - PowerPoint PPT Presentation

decoupled i o for data intensive high performance
SMART_READER_LITE
LIVE PREVIEW

Decoupled I/O for Data-Intensive High Performance Computing Chao - - PowerPoint PPT Presentation

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2 Yanlong Yin 2 Hassan Eslami 3 Rajeev Thakur 4 Xian-He Sun 2 William D.


slide-1
SLIDE 1

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Decoupled I/O for Data-Intensive High Performance Computing

Chao Chen 1 Yong Chen 1 Kun Feng 2 Yanlong Yin 2 Hassan Eslami 3 Rajeev Thakur 4 Xian-He Sun 2 William D. Gropp 3

1Department of Computer Science, Texas Tech University 2Department of Computer Science, Illinoise Institude of Technology 3Department of Computer Science, University of Illinois Urbana-Champaign 4Mathematics and Computer Science Division, Argonne National Laboratory

Sep 12th, 2014

Yong Chen DISCL @ Texas Tech University

slide-2
SLIDE 2

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Scientific Computing and Workload

⋄ High performance computing is a strategic tool for scientific discovery and innovation

  • Climate Change: Community Earth System Model (CESM)
  • Astronomy: Supernova, Sloan Digital Sky Survey
  • etc..

⋄ Utilizing HPC system to simulate events and analyze the

  • utput to get insights

Figure 1: Climate modeling and analysis Figure 2: Typical scientific workload Yong Chen DISCL @ Texas Tech University

slide-3
SLIDE 3

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Big Data Problem

⋄ Many scientific simulations become highly data intensive ⋄ Simulation resolution desires finer granularity both spacial and temporal

  • e.x. climate model, 250KM ⇒ 20KM; 6 hours ⇒ 30 minutes

⋄ The output data volume reaches tens of terabytes in a single simulation, the entire system deals with petabytes of data ⋄ The pressure on the I/O system capability substantially increases

PI ¡ Project ¡ On-­‑Line ¡Data ¡ Off-­‑Line ¡Data ¡

Lamb, ¡Don ¡ FLASH: ¡Buoyancy-­‑Driven ¡Turbulent ¡Nuclear ¡Burning ¡ 75TB ¡ 300TB ¡ Fischer, ¡Paul ¡ Reactor ¡Core ¡Hydrodynamics ¡ 2TB ¡ 5TB ¡ Dean, ¡David ¡ ComputaIonal ¡Nuclear ¡Structure ¡ 4TB ¡ 40TB ¡ Baker, ¡David ¡ ComputaIonal ¡Protein ¡Structure ¡ 1TB ¡ 2TB ¡ Worley, ¡Patrick ¡H. ¡ Performance ¡EvaluaIon ¡and ¡Analysis ¡ 1TB ¡ 1TB ¡ Wolverton, ¡Christopher ¡ KineIcs ¡and ¡Thermodynamics ¡of ¡Metal ¡and ¡ Complex ¡Hydride ¡NanoparIcles ¡ 5TB ¡ 100TB ¡ Washington, ¡Warren ¡ Climate ¡Science ¡ 10TB ¡ 345TB ¡ Tsigelny, ¡Igor ¡ Parkinson's ¡Disease ¡ 2.5TB ¡ 50TB ¡ Tang, ¡William ¡ Plasma ¡Microturbulence ¡ 2TB ¡ 10TB ¡ Sugar, ¡Robert ¡ LaVce ¡QCD ¡ 1TB ¡ 44TB ¡ Siegel, ¡Andrew ¡ Thermal ¡Striping ¡in ¡Sodium ¡Cooled ¡Reactors ¡ 4TB ¡ 8TB ¡ Roux, ¡Benoit ¡ GaIng ¡Mechanisms ¡of ¡Membrane ¡Proteins ¡ 10TB ¡ 10TB ¡

Figure 3: Data volume of current simulations Figure 4: Climate Model Evolution: FAR (1990), SAR (1996), TAR (2001), AR4 (2007) Yong Chen DISCL @ Texas Tech University

slide-4
SLIDE 4

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Gap between Applications’ Demand and I/O System Capability

⋄ Gyrokinetic Toroidal Code (GTC) code

  • Outputs particle data that consists of two 2D arrays for

electrons and ions, respectively

  • Two arrays distributed among all cores, particles can move

across cores in a random manner as the simulation evolves

⋄ A production run with the scale of 16,384 cores

  • Each core outputs roughly two million particles, 260GB in total
  • Desires O(100MB/s) for efficient output

⋄ The average I/O throughput of Jaguar (now Titan) is around 4.7MB/s per node ⋄ Large and growing gap between the application’s requirement and system capability

Yong Chen DISCL @ Texas Tech University

slide-5
SLIDE 5

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Decoupled I/O

A new way of moving computations near to data to minimize the data movement and address the I/O bottleneck issue ⋄ A runtime system design for our Decoupled Execution Paradigm ⋄ Providing a set of interface for users to decouple their applications, and map into different sets of nodes

!""#$%&'()*+

!"#$%&'""$&#(

)'&*+(,,-(.#'%*/$( )'&*+(,,-(.#'%*/$( !"#$%&'()"*'+( !"#$%&',+-*'( ./&/()"*'+( 0&"1/2',+-*'( ./&/()"*'+( .'3"%$4'*(5-26,78*(!"#$%982( 0:+&'#+(;136-&'3&%1'(

Figure 5: Decoupled Execution Paradigm and System Architecture Yong Chen DISCL @ Texas Tech University

slide-6
SLIDE 6

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Overview of Decoupled I/O

⋄ An extension to MPI library, managing both Compute nodes and Data nodes in the DEP architecture. ⋄ Internally splits them into compute group and data group for normal applications and data-intensive operations respectively.

System ¡Network ¡ High-­‑speed ¡Network ¡

Compute node Data node Storage node Improved ¡ MPI ¡Library ¡

PFS ¡ Figure 6: Overview of Decoupled I/O Yong Chen DISCL @ Texas Tech University

slide-7
SLIDE 7

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Overview of Decoupled I/O

Involves 3 improvements to existing MPI library: ⋄ Decoupled I/O APIs ⋄ Improved MPI compiler (mpicc) ⋄ Improved MPI process manager (hydra)

¡

Data node Compute node MPI Runtime

void func(…) main() { MPI_Init(…); MPI_Op myop; MPI_Op_create(func, myop) …. computation(); MPI_File_decouple_xxx(in, out, my_op); compute(out); } void func(…) main() { MPI_Init(…); MPI_Op myop; MPI_Op_create(func, myop) …. if (rank < n) { computation(); MPI_File_decouple_xxx(in, out, myop); compute(out); } if (rank > n) { wait_for(request) processing(); //including I/O send_result(); } } Mpicc code trans

mpirun mpirun User Implemented Code

Figure 7: Decoupled I/O at runtime Yong Chen DISCL @ Texas Tech University

slide-8
SLIDE 8

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Decoupled I/O API

⋄ Abstracting each data-intensive operation with two phases: traditional I/O and data processing ⋄ Providing APIs to treat them as an ensemble with different file handler design, and data op argument

Table 1: Decoupled I/O APIs MPI File decouple open(MPI Decoupled File fh, char * filename, MPI Comm comm); MPI File decouple close(MPI Decoupled File fh, MPI Comm comm); MPI File decouple read (MPI Decoupled File fh, void *buf, int count, MPI Datatype data type, MPI Op data op, MPI Comm comm ); MPI File decouple write(MPI Decoupled File fh, void *buf, int count, MPI Datatype data type, MPI Op data op, MPI Comm comm ); MPI File decouple set view(MPI Decoupled File fh, MPI Offset disp, MPI Datatype etype, MPI Datatype filetype, char * datarep, MPI Info info, MPI Comm comm); MPI File decouple seek(MPI Decoupled File fh, MPI Offset offset, int whence, MPI Comm comm); Yong Chen DISCL @ Texas Tech University

slide-9
SLIDE 9

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Decoupled I/O API Example

Traditional Code

int buf; MPI File read(fh, buf, ...); for(i = 0; i < bufsize; i++) { sum += buf[i]; } ...

Decoupled I/O Code

/* define operation */ int sum op(buf, bufsize) { for (i = 0; i < bufsize; i++ ) sum += buf[i]; } .... MPI op myop; MPI Op create(myop, sum op); MPI File decoupled read(fh, sum, myop, ....);

...

Yong Chen DISCL @ Texas Tech University

slide-10
SLIDE 10

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Process/Node management

⋄ Data nodes and compute nodes are at the same level belonging to two groups ⋄ “mpirun -np n -dp m -f hostfile ./app” to run an application with n compute processes and m data processes ⋄ All of them belong to the MPI COMM WORLD communicator with distinguished rank ⋄ Each group has its own group communicator MPI COMM LOCAL as an intra-communicator, ⋄ MPI COMM INTER communicator as a group-to-group inter-communicator between the compute processes group and data processes group.

Yong Chen DISCL @ Texas Tech University

slide-11
SLIDE 11

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Code Decoupling & Compiler Improvement

⋄ Identify the process type, compute process or data process, with its rank in MPI COMM WORLD to execute different codes ⋄ Data process code is automatically generated by mpicc with hints defined by macros MPI DECOUPLE START and MPI DECOUPLE END ⋄ MPI Op for defining offloaded operations that have to be registered at the before MPI DECOUPLE START.

Yong Chen DISCL @ Texas Tech University

slide-12
SLIDE 12

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Decoupled I/O Implementation and Prototyping

⋄ Completely based on MPI library ⋄ Gather the tasks from compute processes, and scatter them to data process.

Data node Compute node MPI Runtime Compute processes Data processes MPI_Gather: tasks collective

Tasks at master process Tasks at master process

MPI_Scatter

MPI_Send(request) (MPI_COMM_INTER)

MPI_Gather MPI_Scatter: results

MPI_Recv(results) (MPI_COMM_INTER)

Figure 8: Decoupled I/O prototype Yong Chen DISCL @ Texas Tech University

slide-13
SLIDE 13

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Platform and Setup

Platform:

Name DISCFarm Cluster(small), Hrothgar Cluster(large)

  • Num. of nodes

DISCFarm: 16 nodes, Hrothgar: 640 nodes CPU DISCFarm: Xeon 2.6GHz, 8 cores, Hrothgar: Westmere 2.8GHz, 12 cores Memory DISCFarm: 4GB/node, Hrothgar: 24GB/node

Evaluated Operations: Data assimilation (ENKF) read the data, and apply EnKF algorithm (including 6 matrix multiplications, 1 matrix addition, and 1 matrix substitution), then write almost the same size data Flow-routing compute the direction where fluids flow to Summation calculates the total value of all specified data elements Lookup searches for and returns all elements that meet given criteria Yong Chen DISCL @ Texas Tech University

slide-14
SLIDE 14

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Results and Analysis

⋄ Compared against Active Storage (AS) ⋄ 2 storage nodes, 4 data nodes and 8 compute nodes for DEPIO, 12 compute nodes for AS ⋄ Around 13% improvements

0 ¡ 200 ¡ 400 ¡ 600 ¡ 800 ¡ 1000 ¡ 1200 ¡ 1400 ¡ 4GB ¡ 8GB ¡ 16GB ¡ 32GB ¡ Execu&on ¡&me ¡(s) ¡ Data ¡size ¡

Performance ¡of ¡Decoupled ¡I/O ¡

AS ¡ DEPIO ¡ Figure 9: Performance Comparison of Decoupled I/O and Active Storage I/O (Observed CPU usage on storage nodes: 1.3%) Yong Chen DISCL @ Texas Tech University

slide-15
SLIDE 15

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Results and Analysis (with resource contention)

⋄ Workload on storage nodes has great impact on Active Storage performance ⋄ DEPIO keeps better performance than AS with less impact from workload on storage nodes

  • ­‑100.00% ¡
  • ­‑80.00% ¡
  • ­‑60.00% ¡
  • ­‑40.00% ¡
  • ­‑20.00% ¡

0.00% ¡ 20.00% ¡ 40.00% ¡ 21% ¡ 43% ¡ 61% ¡ 83% ¡ performance ¡inprovement ¡against ¡TS ¡ CPU ¡usage ¡of ¡each ¡storage ¡node ¡

Performance ¡of ¡Decoupled ¡I/O ¡under ¡ different ¡CPU ¡usage ¡

AS ¡ DEPIO ¡

Figure 10: Performance of Decoupled I/O under Different CPU Usages on storage nodes Yong Chen DISCL @ Texas Tech University

slide-16
SLIDE 16

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Results and Analysis

⋄ Up to 60 nodes in the Hrothagar cluster ⋄ Compared against traditional storage I/O (TS) ⋄ Observed 25% performance improvements

0 ¡ 2000 ¡ 4000 ¡ 6000 ¡ 8000 ¡ 10000 ¡ 24 ¡ 36 ¡ 48 ¡ 60 ¡ Execu&on ¡tme ¡(s) ¡ number ¡of ¡nodes ¡

Emula&on ¡Performance ¡of ¡Decoupled ¡ I/O ¡

DEPIO ¡ TS ¡

Figure 11: Emulation Performance of the Decoupled I/O Yong Chen DISCL @ Texas Tech University

slide-17
SLIDE 17

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Overhead of the Decoupled I/O

⋄ Primary overhead comes from the communication, Gather, Scatter, etc... ⋄ As the data size of each I/O request increases, this overhead is observed to decrease steadily

20% ¡ 30% ¡ 40% ¡ 50% ¡ 60% ¡ 70% ¡ 80% ¡ 90% ¡ 100% ¡ 4KB ¡ 32KB ¡ 128KB ¡ 512KB ¡ 2MB ¡ 4MB ¡ Ra#o ¡ Data ¡size ¡of ¡each ¡I/O ¡opera#on ¡

Communica#on ¡Overhead ¡of ¡ Decoupled ¡I/O ¡

communica7on ¡ data ¡processing ¡

Figure 12: Overhead of the Decoupled I/O Operation Yong Chen DISCL @ Texas Tech University

slide-18
SLIDE 18

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Conclusion and Future Work

⋄ Big data computing brings new opportunities but also poses big challenges ⋄ Dedicating data nodes for data-intensive operations can be helpful and critical for system performance ⋄ An initial investigation of runtime system design to decouple a task into compute-intensive and data-intensive phases

  • Beneficial because of less resource contention, and reduced

system wide data movements.

⋄ Prototyping were conducted to evaluate the potential of Decoupled I/O ⋄ Plan to investigate the feasibility of the integration with the MapReduce and in-memory computing model

Yong Chen DISCL @ Texas Tech University

slide-19
SLIDE 19

Background and Motivation Decoupled I/O Evaluation Conclusion and Future Work

Thank You

For more information, please visit: http://discl.cs.ttu.edu This research is sponsored in part by the National Science Foundation under the grants CNS-1338078, CNS-1162540, CNS-1162488, and CNS-1161507.

Yong Chen DISCL @ Texas Tech University