Runtime Optimization of Application Level Communication Patterns - - PowerPoint PPT Presentation

runtime optimization of application level communication
SMART_READER_LITE
LIVE PREVIEW

Runtime Optimization of Application Level Communication Patterns - - PowerPoint PPT Presentation

Runtime Optimization of Application Level Communication Patterns Edgar Gabriel and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu HIPS 2007 Long Beach Edgar Gabriel Motivation Finite Difference code on a


slide-1
SLIDE 1

HIPS 2007 Long Beach Edgar Gabriel

Runtime Optimization of Application Level Communication Patterns

Edgar Gabriel and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu

slide-2
SLIDE 2

HIPS 2007 Long Beach Edgar Gabriel

Motivation

Finite Difference code on a PC cluster using IB and GE interconnects Execution time for 200 iterations of the solver on 32 processes/processors

5 10 15 20 25 30

128x128x64 IB 128x128x128 IB 128x128x64 TCP 128x128x128 TCP

execution time [sec ]

fcfs fcfs-pack

  • rdered
  • verlap

p

slide-3
SLIDE 3

HIPS 2007 Long Beach Edgar Gabriel

How to implement the required communication pattern efficiently?

  • Dependence on platform

– Some functionality only supported (efficiently) on certain/platforms or with certain network interconnects

  • Dependence on MPI library

– Does the MPI library support all available methods – Efficiency in overlapping communication and computation – Quality of the support for user defined data-types

  • Dependence on application

– Problem size – Ratio of communication to computation

slide-4
SLIDE 4

HIPS 2007 Long Beach Edgar Gabriel

  • Problem: How can an (average) user understand the

myriad of implementation options and their impact on the performance of the application?

  • (Honest) Answer: no way

– Abstract interfaces for application level communication

  • perations required

ADCL – Statistical tools required to detect correlations between parameters and application performance

slide-5
SLIDE 5

HIPS 2007 Long Beach Edgar Gabriel

ADCL - Adaptive Data and Communication Library

  • Goals:

– Provide abstract interfaces for often occurring application level communication patterns

  • Collective operations
  • Not-covered by MPI specification

– Provide a wide variety of implementation possibilities and decision routines which choose the fastest available implementation (at runtime)

  • Not replacing MPI, but add-on functionality

– Uses many features of MPI

slide-6
SLIDE 6

HIPS 2007 Long Beach Edgar Gabriel

ADCL terminology

Handle for tuple of < topology, vector, function-set> Request Abstraction for a process topology Topology Abstraction for a multi-dimensional data object Vector Set of functions providing the same functionality

  • have to have the same attribute-set

Function-set Implementation of a particular operation

  • optionally including an attribute-set and values

Function Group of attributes Attribute-set Abstraction for a characteristic of an implemen- tation represented by the set its possible values Attribute Functionality ADCL object

slide-7
SLIDE 7

HIPS 2007 Long Beach Edgar Gabriel

Code sample

ADCL_Vector vec; ADCL_Topology topo; ADCL_Request request; /* Generate a 2-D process topology */ MPI_Cart_create ( comm, 2, cart_dims, periods, 0,&cart_comm); ADCL_Topology_create ( cart_comm, &topo ); /* Register a 2D vector with ADCL */ ADCL_Vector_register (ndims, vec_dims, HALO_WIDTH, MPI_DOUBLE, vector, &vec); /* Match process topology, data item and function-set */ ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD, &request ); for (i=0; i<NIT; i++ ) { /* Main application loop */ ADCL_Request_start (request ); … }

slide-8
SLIDE 8

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: brute force search (I)

Implementation no. 1 2 3 4 5 6 7 Using the fastest implementation for the rest of the application

slide-9
SLIDE 9

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: brute force search (II)

  • Test each function of a given function set a given

number of times

– Store the execution time for each execution per process

  • Filter the list of execution times in order to exclude
  • utliers
  • Determine the avg. execution time per function i and

process j

  • Determine the max. execution time for function i across

all processes

– Requires communication (e.g. MPI_Allreduce)

1 ... ), max(

max

− = = nprocs j f f

j i i

slide-10
SLIDE 10

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: brute force search (III)

  • Determine the function with the minimal max. execution

time across all processes

  • Use this function for the rest of the application lifetime

1 ... ), min(

max

− = = nfuncs i f f

i winner

slide-11
SLIDE 11

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: performance hypothesis (I)

  • Assumptions:

– every implementation can be characterized by a set of attributes, which impact its performance, e.g. for neighborhood communication

  • Communication pattern/degree
  • Handling of non-contiguous data
  • Data transfer primitive
  • Overlapping communication and computation

– The fastest implementation will also have the optimal values for these attributes

slide-12
SLIDE 12

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: performance hypothesis (II)

  • Approach: determine the optimal value for an attribute by

comparing the execution time of functions differing in

  • nly a single attribute

– E.g. if function c had the lowest execution time across all processes:

  • Hypothesis: value 3 optimal for attribute 1
  • Confidence value in this hypothesis: 1

Y z X 1

Function a Function b Function c

Y z X 2 Y z X 3

Value for attribute 1 Value for attribute 2 Value for attribute 3 Value for attribute 4

slide-13
SLIDE 13

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: performance hypothesis (III)

  • Evaluate a different set of functions differing in one other

attribute, e.g.

– If this set of measurements lead to the same optimal value for attribute 1:

  • Increase confidence value for this hypothesis by 1

– Else decrease the confidence value by 1

Y z X+1 1

Function c Function d Function e

Y z X+1 2 Y z X+1 3

Value for attribute 1 Value for attribute 2 Value for attribute 3 Value for attribute 4

slide-14
SLIDE 14

HIPS 2007 Long Beach Edgar Gabriel

Runtime selection logic: performance hypothesis (IV)

  • If the confidence value for an attribute reaches a given

threshold

– Remove all functions not having the required value for this attribute from the Function-set

  • If the value for attribute (s) do not converge towards a

value this algorithm leads to the brute force search

  • Advantage: potentially fewer functions have to be

evaluated to determine the winner

slide-15
SLIDE 15

HIPS 2007 Long Beach Edgar Gabriel

Name

  • Comm. pattern

Handling of non-cont. data Data transfer primitive IsendIrecv_aao aao ddt MPI_Isend/Irecv/Waitall IsendIrecv_pair pair ddt MPI_Isend/Irecv/Waitall SendIrecv_aao aao ddt MPI_Send/Irecv/Waitall SendIrecv_pair pair ddt MPI_Send/Irecv/Wait IsendIrecv_aao_pack aao ddt MPI_Isend/Irecv/Waitall IsendIrecv_pair_pack pair Pack/unpack MPI_Isend/Irecv/Waitall SendIrecv_aao_pack aao ddt MPI_Send/Irecv/Waitall SendIrecv_pair_pack pair Pack/unpack MPI_Send/Irecv/Wait SendRecv_pair pair ddt MPI_Send/Recv Sendrecv_pair pair ddt MPI_Send/Recv SendRecv_pair_pack pair Pack/unpack MPI_Send/Recv Sendrecv_pair_pack pair Pack/unpack MPI_Send/Recv WinfencePut_aao aao ddt MPI_Put/MPI_Win_fence WinfenceGet_aao aao ddt MPI_Get/MPI_Win_fence PostStartPut_aao aao ddt MPI_Put/MPI_Win_post/start PostStartGet_aao aao ddt MPI_Get/MPI_Win_post/start WinfencePut_pair pair ddt MPI_Put/MPI_Win_fence WinfenceGet_pair pair ddt MPI_Get/MPI_Win_fence PostStartPut_pair pair ddt MPI_Put/MPI_Win_post/start PostStartGet_pair pair ddt MPI_Get/MPI_Win_post/start

Currently available implementations for neighborhood communication

slide-16
SLIDE 16

HIPS 2007 Long Beach Edgar Gabriel

Performance results (I)

InfiniBand 32 processes small problem size

10.4 10.6 10.8 11 11.2 11.4 11.6 11.8 12 12.2 12.4 I s e n d I r e c v _ a a

  • S

e n d I r e c v _ a a

  • I

s e n d I r e c v _ p a i r S e n d R e c v _ p a i r S e n d I r e c v _ p a i r S e n d r e c v _ p a i r I s e n d I r e c v _ a a

  • _

p a c k S e n d I r e c v _ a a

  • _

p a c k I s e n d I r e c v _ p a i r _ p a c k S e n d R e c v _ p a i r _ p a c k S e n d I r e c v _ p a i r _ p a c k S e n d r e c v _ p a i r _ p a c k b r u t e h y p

  • Execution time [sec]
slide-17
SLIDE 17

HIPS 2007 Long Beach Edgar Gabriel

Performance results (II)

InfiniBand 32 processes large problem size

72.5 73 73.5 74 74.5 75 75.5 76 76.5 77 77.5 I s e n d I r e c v _ a a

  • S

e n d I r e c v _ a a

  • I

s e n d I r e c v _ p a i r S e n d R e c v _ p a i r S e n d I r e c v _ p a i r S e n d r e c v _ p a i r I s e n d I r e c v _ a a

  • _

p a c k S e n d I r e c v _ a a

  • _

p a c k I s e n d I r e c v _ p a i r _ p a c k S e n d R e c v _ p a i r _ p a c k S e n d I r e c v _ p a i r _ p a c k S e n d r e c v _ p a i r _ p a c k b r u t e h y p

  • Execution time [sec]
slide-18
SLIDE 18

HIPS 2007 Long Beach Edgar Gabriel

Performance results (III)

TCP over Fast Ethernet 32 processes small problem size

50 100 150 200 250 300 350 400 IsendIrecv_aao SendIrecv_aao IsendIrecv_pair SendRecv_pair SendIrecv_pair Sendrecv_pair IsendIrecv_aao_pack SendIrecv_aao_pack IsendIrecv_pair_pack SendRecv_pair_pack SendIrecv_pair_pack Sendrecv_pair_pack brute hypo Execution time [sec]

slide-19
SLIDE 19

HIPS 2007 Long Beach Edgar Gabriel

Performance results (IV)

TCP over Fast Ethernet 32 processes large problem size

50 100 150 200 250 300 350 400 450 I s e n d I r e c v _ a a

  • S

e n d I r e c v _ a a

  • I

s e n d I r e c v _ p a i r S e n d R e c v _ p a i r S e n d I r e c v _ p a i r S e n d r e c v _ p a i r I s e n d I r e c v _ a a

  • _

p a c k S e n d I r e c v _ a a

  • _

p a c k I s e n d I r e c v _ p a i r _ p a c k S e n d R e c v _ p a i r _ p a c k S e n d I r e c v _ p a i r _ p a c k S e n d r e c v _ p a i r _ p a c k b r u t e h y p

  • Execution time [sec]
slide-20
SLIDE 20

HIPS 2007 Long Beach Edgar Gabriel

Limitations of ADCL

  • Reproducibility of measurements even on dedicated compute nodes

a challenging topic – Hyper-threading – Processor frequency scaling

  • Network often shared between multiple jobs
  • Hierarchical networks

– Process placement by the batch scheduler

  • Performance hypothesis

– Attributes should not be correlated

  • User has to modify its code

– How much longer will we have to deal with MPI?

slide-21
SLIDE 21

HIPS 2007 Long Beach Edgar Gabriel

Advantages of ADCL

  • Provides close to optimal performance in many scenarios
  • Simplifies the development of parallel code for many applications
  • Simplifies the development of adaptive parallel code
  • Currently ongoing work:

– Improving (nearly) all components of ADCL

  • Data filtering
  • Increase parameter space and set of implementation
  • Experiment with other runtime selection algorithms

– Historic learning, Game theory, genetic algorithms

– Integration with a CFD solver in cooperation with Dr. Garbey