ccdsc 2016 10 4 2016 equivalent platforms for unmodified
play

CCDSC 2016 10/4/2016 Equivalent platforms for unmodified - PowerPoint PPT Presentation

Tiziano Passerini, Jaroslaw Slawinski, Umberto Villa, Sofia Guzzetti Alessandro Veneziani, Vaidy Sunderam Mathematics & Computer Science Emory University, Atlanta, USA CCDSC 2016 10/4/2016 Equivalent platforms for unmodified application c


  1. Tiziano Passerini, Jaroslaw Slawinski, Umberto Villa, Sofia Guzzetti Alessandro Veneziani, Vaidy Sunderam Mathematics & Computer Science Emory University, Atlanta, USA CCDSC 2016 10/4/2016

  2. • Equivalent platforms for unmodified application c o r e Application Intra/Inter ‐ net RAM IB, low latency SMP VO, P2P, etc. Cluster, supercomp • Single OS • Heter. CPUs • Homogen. CPUs • Parallel • Distributed • Soft precon’d (threads, Logical view computing • Good network OpenXYZ, MPI) • I have my application Virtualization, IaaS clouds • I need some CPU(s) • Look: soft condition to have • Do I care about a resource like above comm/io? Maybe • Feel: depends (on coupling) 10 Gb/s eth

  3. • If different computational platforms may be used interchangeably … Not real data Turnaround time [in time or effort units] 80 70 60 50 40 30 20 10 0 Dev cluster Single node Supercomputer IaaS cloud Soft preconditioning Waiting for resources Computation

  4. • Dev environment – no soft conditioning • “Rented” resources – no up ‐ front costs Not real data Distribution of costs per execution [in virtual dollars] 200 180 160 140 120 100 80 60 40 20 0 Dev cluster Single node Supercomputer, VO IaaS cloud Amortized up ‐ front Amortized admin Comp. & storage or energy

  5. Case study: LifeV ‐ based hemodynamic simulation • CFD/FEM MPI parallel code • LifeV library • Issues – Process placement – Turnaround – Cost • Utility

  6. • FEM input mesh partitioned into 8 partitions (8 processes) • Logical topology graph • Physical topology • How to match? 400 Affinity zones 350 CPU cores 300 250 200 Scotch 150 Internode connection 100Gb/s 100 50Gb/s 50 1Gb/s 0 0 1 2 3 4 5 6 7

  7. • M – data from the partitioner • D – data from benchmarks • I – inverted D • Round ‐ robin and per ‐ core – input ‐ agnostic allocation

  8. NP

  9. 1 4 3 2 5 • Diagnosis • Bypass or stent placement • Cost vs. turnaround

  10. 1. Ellipse: university cluster 256 ‐ node 1k ‐ core; 1Gb/s ethernet; queue SGE 2. Puma: dev environment cluster 32 ‐ nodes 128 ‐ core; IB SDR; queue PBS 3. Lonestar: XSEDE supercomputer IB QDR; queue PBS 4. Rockhopper cluster: On ‐ Demand HPC Cloud Service, Penguin Computing IB QDR; queue PBS 5. Amazon EC2; 1 ‐ 16 nodes cc2.8xlarge 16 ‐ core per node; 10Gb/s ethernet

  11. • Aneurism simulation • About 1 million elements (FEM) • Computes pressure and velocity for each 0.01 sec • Same problem, various number of processes (strong scalability test) • One MPI process per computing core in round ‐ robin placement

  12. • A – fastest overall • B – supercom ‐ puter nodes are not the fastest • C – single EC2 = 16 processes on supercomputer • D – fastest EC2 configuration • EC2 scalability…

  13. Avg is 4h 44m

  14. • Puma and Lonestar – estimated cost based on hardware/ operational expenses; typical figures reported in literature • Ellipse – university pricing • Rockhopper – actual charges • EC2 – we used as many cheap spot ‐ request (bid ‐ based) instances as possible (about 6 times cheaper than regular instances)

  15. • Value of simulation results to user over time • T * ‐ expected completion • U – utility value (e.g., in $) time • U max – the max value the • |T * ‐ T 0 | ‐ delay tolerance user is willing to pay (importance of the task) • T 0 – latest completion time

  16. Range of min. prices per simulation for all architectures: $3.53 ‐ $22.59 Avg. $10.30

  17. Low (3), high (1), average (2) priority jobs T* = 4.44 hrs #3 = $10.31 #1 = $20.62 A – overall fastest execution C – overall cheapest execution D – fastest time for EC2

  18. • Turnaround vs. cost tradeoffs vary considerably across platforms (multiplied by parameter sweeps) • Some IaaS cloud resources offer superior capabilities compared to cluster/supercomputer nodes (large single instances vs. local clusters) • Queue waiting time is not considered in this study, but it may significantly change selection decisions for time ‐ critical computation (e.g., medical diagnosis)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend