Self-Adaptive Architectures for Autonomic Computational Science - - PowerPoint PPT Presentation

self adaptive architectures for autonomic computational
SMART_READER_LITE
LIVE PREVIEW

Self-Adaptive Architectures for Autonomic Computational Science - - PowerPoint PPT Presentation

Self-Adaptive Architectures for Autonomic Computational Science http://wiki.esi.ac.uk/Distributed_Programming_Abstractions Shantenu Jha 1 Manish Parashar 2 Omer Rana 3 1 CCT and CS, LSU & e-Science Institute, Edinburgh 2 Rutgers University


slide-1
SLIDE 1

Self-Adaptive Architectures for Autonomic Computational Science

http://wiki.esi.ac.uk/Distributed_Programming_Abstractions

Shantenu Jha1 Manish Parashar2 Omer Rana3

1 CCT and CS, LSU & e-Science Institute, Edinburgh 2 Rutgers University & NSF Center for Autonomic Computing, USA 3Cardiff University & Welsh e-Science Centre, UK

  • Sep. 21 EGEE’09 Barcelona

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

1 / 31

slide-2
SLIDE 2

Outline

1

Background

2

Elements of ACS Conceptual Framework

3

Conceptual Architectures Tuning of Application Tuning by Application

4

Distributed Autonomic Applications Abstractions for Distributed Systems and Applications Ensemble Kalman Filters Ensemble Kalman Filters – Mode II

5

Analysis

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

2 / 31

slide-3
SLIDE 3

Background

Context

Grid infrastructures present unprecedented opportunities for computational science and engineering, with the potential for fundamental insights into complex phenomenon: Where can Autonomics be of benefit? Various existing investments in Grid computing infrastructure; Limited uptake Due to (i) complexity of developing applications and (ii) changes in the infrastructure – need some support for system adaptation How can Autonomics Help Computational Science utilize Distributed Resources? DPA theme takes an application centric view; intended to initiate discussion about this theme – not be prescriptive Autonomic Distributed Computational Applications: Break free of static (execution) model and enable dynamic execution of Applications Autonomics + Abstractions: Demonstrate effectiveness in scaling-out across multiple Sites (Grids) Through empirical development, experience and analysis of applications on infrastructure understand the role and advantages of autonomics

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

3 / 31

slide-4
SLIDE 4

Elements of ACS Conceptual Framework

ACS Framework Elements

Application-level Objective (AO): User identified application requirement, e.g. increase throughput, reduce task failure, load balance, etc Mechanism: action used by application or resource manager to achieve AO – mechanism m: ({mi}, {me

i }, {mo}, {me

  • }), e.g. file staging:

{mi} and {mo}: file references before/after staging process {me

i }: input events that trigger start of file staging

{me

  • }: output events after file staging is completed.

Strategy: consists of a collection of mechanisms – manual or dynamically constructed by an autonomic approach

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

4 / 31

slide-5
SLIDE 5

Elements of ACS Conceptual Framework

Self-Adaptive Approaches

Two approaches: Top Down: overall system goals need to be achieved through the modification of interconnectivity or behaviour of system components – realized through a system manager Bottom Up: local behaviour of system components need to be aggregated (without a centralized system manager) to generate some overall system behaviour Current focus, primarily on (i). However, (ii) may be used as a precursor – to dynamically form resource ensembles using clustering approaches. Adaptation approaches (at different levels): Modify Code Modify Structure Modify Application Parameters (based on previous executions, and driven by a set of external constraints)

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

5 / 31

slide-6
SLIDE 6

Elements of ACS Conceptual Framework

Conceptual Architectures for ACS

Tuning of Applications: SAMR and Coupled-Fusion Simulations Tuning by Applications: EnKF

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

6 / 31

slide-7
SLIDE 7

Elements of ACS Conceptual Framework

Vectors: Understanding Distributed Applications

Vectors: Axes representing application characteristics the values of which help us understand: The application requirements, and Design and Constraints of solutions, tools Vector Listing: Executable Unit Communication Coordination Execution Environment

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

7 / 31

slide-8
SLIDE 8

Conceptual Architectures Tuning of Application

Tuning of application & resource manager parameters

Example: Dynamic structured adaptive mesh refinement (SAMR) techniques on structured meshes/grids. SAMR methods employ locally optimal approximations – leading to highly advantageous cost/accuracy ratios. Compared to numerical techniques based

  • n static uniform discretization

Focus computational resources to regions with large local solution error at runtime; Adaptive nature and inherent space-time heterogeneity of SAMR implementations = ⇒ dynamic resource allocation, data-distribution, load balancing, and runtime management.

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

8 / 31

slide-9
SLIDE 9

Conceptual Architectures Tuning of Application

SAMR example

3-D compressible turbulence (RM3D) SAMR simulation with 256 × 64 × 64

  • resolution. The RM3D application serves as a representative of the class of

simulations that exhibit significant dynamism and spatiotemporal heterogeneity Changes in RM3D application physics create dynamically varying simulation workloads – load at each grid point is assumed to be uniform. The peak total workload is about 8 times larger than the minimum total workload and over two times larger than the average total workload for this simulation.

From: Sumir Chandra’s PhD thesis

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

9 / 31

slide-10
SLIDE 10

Conceptual Architectures Tuning of Application

Coupled Fusion Simulation

Workflow with coupled simulation codes, i.e., the edge turbulence particle-in-cell (PIC) code (GTC) and the microscopic MHD code (M3D) – run simultaneously

  • n separate HPC resources at supercomputing centers

Data streamed and processed enroute – e.g. data from the PIC codes filtered through “noise detection" processes before it can be coupled with the MHD code Efficiently data streaming between live simulations – to arrive just-in-time – if it arrives too early, times and resources will have to be wasted to buffer the data, and if it arrives too late, the application would waste resources waiting for the data to come in Opportunistic use of in-transit resources

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

10 / 31

slide-11
SLIDE 11

Conceptual Architectures Tuning of Application

Coupled Fusion Application

Vectors Mechanisms Coordination Peer-2-Peer interaction Communication Data Streaming, Events Execution Storage Selection (local/remote), Environment Resource Selection/Management, Task migration, Checkpointing Task execution (local/remote) Dynamic provisioning (provisioning of in-transit storage/processing nodes)

Table: Tuning mechanisms in the Coupled Fusion Simulation application.

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

11 / 31

slide-12
SLIDE 12

Conceptual Architectures Tuning of Application

Coupled Fusion Application

Application Autonomic Strategy Objective Maintain Resource Management latency-sensitive adaptive data buffering (time, size), data delivery adaptive buffering strategy, adaptive data transmission & destination selection Maximize data Resource Management quality

  • pportunistic in-transit processing

adaptive in-transit buffering Scientific Algorithmic Adaptivity Fidelity in-time data coupling model correction using dynamic data solver adaptations

  • Table: Coupled fusion simulation application management using autonomic

strategies.

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

12 / 31

slide-13
SLIDE 13

Conceptual Architectures Tuning by Application

EnKF: Tuning by application

Resource reservation to achieve particular QoS-criteria Dynamic analysis of data stream from a scientific instrument – may also involve analysis of video/audio feeds

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

13 / 31

slide-14
SLIDE 14

Distributed Autonomic Applications Abstractions for Distributed Systems and Applications

Abstractions for Distributed Computing

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

14 / 31

slide-15
SLIDE 15

Distributed Autonomic Applications Abstractions for Distributed Systems and Applications

Abstractions for Distributed Computing - 2

Replica

Resource 1 RE-Manager SAGA CPR/Migol SAGA File SAGA CPR/Migol BigJob Abstraction SAGA Advert SAGA Advert Replica-Agent

Replica Replica

RE Application SAGA Based Glide-In Framework SAGA Implementation/ Adaptors (Migol, Globus)

big-job sub-job

Replica Replica

sub-job

Replica Replica

sub-job

Replica Replica

sub-job

Replica

Resource 2 SAGA CPR/Migol SAGA Advert Replica-Agent

Replica Replica

sub-job

Replica Replica

sub-job

Replica Replica

sub-job

Replica Replica

sub-job big-job

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

15 / 31

slide-16
SLIDE 16

Distributed Autonomic Applications Ensemble Kalman Filters

EnKF

Ensemble Kalman filters: Recursive filters to handle large noisy data – data can be the results and parameters of ensembles of models Variation in the required run time of different models – which impacts overall results – each model run needs to converge before the next stage can begin

Stage 3 Stage 2 Stage 1

3 M G e n 2 1 4 n KF KF

. . . . . . . . .

3 2 1 4 n 3 2 1 4 n Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

16 / 31

slide-17
SLIDE 17

Distributed Autonomic Applications Ensemble Kalman Filters

EnKF ... 2

Launch jobs on multiple TeraGrid (TG) resources – corresponding to different EnKF stages – on resources using a Batch Queue Predictor (BQP) (BQP) – a tool available on a number TG resources that allows users to make bounded predictions about the time a job of a given size and duration will spend in the queue Estimating: Tc = toverhead + twait + trun

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

17 / 31

slide-18
SLIDE 18

Distributed Autonomic Applications Ensemble Kalman Filters

EnKF ... 3

Vectors characterising EnKF application: Vectors Mechanisms Coordination Centralized Data-store (SAGA) Pilot-Job (BigJob) abstraction Communication File staging, File indexing Execution Centralized Scheduler Environment Resource Selection/Management, Task re-execution, Task migration, Storage management, File caching, File distribution, Checkpointing

Table: Tuning Mechanisms in EnKF

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

18 / 31

slide-19
SLIDE 19

Distributed Autonomic Applications Ensemble Kalman Filters

EnKF ... 4

Application Autonomic Strategy Objective Load

  • 1. Adapt task mapping granularity

Balancing based on system capabilities/state File staging, File splitting/merging Task rescheduling, Task migration File distribution and caching, Storage Management

  • 2. Resource Selection

Resource selection (using BQP), resource configuration update, Task rescheduling, Task migration File distribution and caching Storage Management Scientific Algorithmic Adaptivity Fidelity Change solvers

  • Table: EnKF application management through autonomic strategies

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

19 / 31

slide-20
SLIDE 20

Distributed Autonomic Applications Ensemble Kalman Filters

BQP

BQP Allows Users to make bound predictions of how much time a job of a given size, duration nd location will spend in the queue The prediction is given with a degree of confidence (probability) and quantile (repeatability) BQP is available on a number of TeraGrid resources Use QP infor to submit BigJobs with sizes, durations to locations that are

  • pCmized to spend a minimal amount of time in the queue,therefore a smaller

TTC

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

20 / 31

slide-21
SLIDE 21

Distributed Autonomic Applications Ensemble Kalman Filters

TeraGrid and BQP

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

21 / 31

slide-22
SLIDE 22

Distributed Autonomic Applications Ensemble Kalman Filters

Performance

Concurrent use of machines to solve EnKF. Some improvement through the use of a queue prediction technique – especially when using multiple machines concurrently. But why does the use of BQP improve performance?

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

22 / 31

slide-23
SLIDE 23

Distributed Autonomic Applications Ensemble Kalman Filters

Performance - 2

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

23 / 31

slide-24
SLIDE 24

Distributed Autonomic Applications Ensemble Kalman Filters

Tuning by Application or Tuning of Application?

EnKF: Example of the use of tuning by application Two Sides To Every Application.. Determined by: (i) Application; (ii) Usage Model; (iii) Architecture..

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

24 / 31

slide-25
SLIDE 25

Distributed Autonomic Applications Ensemble Kalman Filters

Ensemble Kalman Filter – Mode II

Hybrid Architectures e.g., TeraGrid and Clouds, will be part of future infrastructure Many challenges: How will applications utilize such Infrastructure? Key Issues: (i) Coordination across heterogenous architectures; (ii) Performance Run-time tradeoffs; (iii) Decompose Applications, (iv) Usage Modes take advantage? Autonomics on hybrid architectures: Multiple Objectives: (i) Acceleration, (ii) Conservation and (iii) Resilience Use of BQP on TeraGrid. “User space” evaluation on Clouds

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

25 / 31

slide-26
SLIDE 26

Distributed Autonomic Applications Ensemble Kalman Filters

Ensemble Kalman Filter – Mode II: Architecture

Workflow Manager: Includes Planner ad well as monitors/managers; inserts task meta-data into Comet Space and notifies autonomic scheduler Estimator: Converting computational complexity into Cost and/or Time estimates Autonomic Scheduler: Applications hints to estimate relative complexities and provide an initial hybrid mix based on (i) Objectives (ii) Policies (iii) Constraints Grid/Cloud Agents: Provisioning resources, configuring workers as execution agents and assigning tasks to these workers. Pull (Clouds); Pull-Push (TG).

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

26 / 31

slide-27
SLIDE 27

Distributed Autonomic Applications Ensemble Kalman Filters – Mode II

Ensemble Kalman Filter – Mode II: Task Distribution

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

27 / 31

slide-28
SLIDE 28

Distributed Autonomic Applications Ensemble Kalman Filters – Mode II

EnKF Mode II: Results

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

28 / 31

slide-29
SLIDE 29

Analysis

Dynamic Execution: Achieving Nirvana for Distributed Applications

Static (aka Death by Localization) Execution Model versus Dynamic Execution Multi-Level: Match resources to computational requirement: TG vs Cloud? Take a workload; find a set of resources Find a set of resources (first-to-run); tune workload Reverse Scheduling: Support for Dynamic Execution of Applications is underlying paradigm Late binding to resource and optimal resource configuration (not just resource selection) Unbalanced, irregular workload Utilizing job and resource properties in this way would also lead to a complementary self-organizing architecture – i.e. one that is able to adapt application properties and job scheduling based on the characteristics of resources on offer.

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

29 / 31

slide-30
SLIDE 30

Analysis

Conclusion

Two Self-Adaptive Architectures presented – tuning engine is logically external to application Architectures emphasize particular usage modes; Same Application different Architectures, based upon Usage Modes EnKF: Example of Usage Mode and Programming Systems determining which Self-Adaptive Architecture is employed Dynamic Execution Model: EnKF (i) Coarse Grained: Forward and (ii) Fine Grained: Reverse

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

30 / 31

slide-31
SLIDE 31

Acknowledgment

Acknowledgements

Hyunjoo Kim (Rutgers) and Yaakoub el-Khamra (TACC & LSU) SAGA Team DPA Team: Dan Katz, Jon Weissman and Murray Cole

Shantenu Jha (LSU and eSI) Grid Observatory

  • Sep. 21 EGEE’09 Barcelona

31 / 31