Urgent Computing, Sharing Grid Resources, and Elastic Computing - - PowerPoint PPT Presentation

urgent computing sharing grid resources and elastic
SMART_READER_LITE
LIVE PREVIEW

Urgent Computing, Sharing Grid Resources, and Elastic Computing - - PowerPoint PPT Presentation

Urgent Computing, Sharing Grid Resources, and Elastic Computing Pete Beckman Argonne National Laboratory University of Chicago http://www.mcs.anl.gov/~beckman SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 Argonne Natl Lab/U


slide-1
SLIDE 1

Urgent Computing, Sharing Grid Resources, and Elastic Computing

Pete Beckman

Argonne National Laboratory University of Chicago http://www.mcs.anl.gov/~beckman

slide-2
SLIDE 2
slide-3
SLIDE 3

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

3 Argonne Nat’l Lab/U Chicago

slide-4
SLIDE 4

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

4 Argonne Nat’l Lab/U Chicago

slide-5
SLIDE 5

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

5 Argonne Nat’l Lab/U Chicago

slide-6
SLIDE 6

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

6 Argonne Nat’l Lab/U Chicago

Urgent Computing: I Need it Now!

  • Applications with dynamic data and

result deadlines are being deployed

  • Late results are useless

 Wildfire path prediction  Storm/Flood prediction  Influenza modeling

  • Some jobs need priority access

“Right-of-Way Token”

slide-7
SLIDE 7

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

7 Argonne Nat’l Lab/U Chicago

How can we get cycles?

  • Build supercomputers for the app

 Pros: Resource is ALWAYS available  Cons: Incredibly costly (99% idle)  Example: Coast Guard rescue boats

  • Share public infrastructure

 Pros: low cost  Cons: Requires complex system for authorization, resource mgmt, and control  Examples: school buses for evacuation, cruise ships for temporary housing

slide-8
SLIDE 8

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

8 Argonne Nat’l Lab/U Chicago

Introducing SPRUCE

  • The Vision:

 Build cohesive infrastructure that can provide urgent computing cycles

  • Technical Challenges:

 Provide high degree of reliability  Elevated priority mechanisms  Resource selection, data movement

  • Social Challenges:

 Who? When? What?  How will emergency use impact regular use?  Decision-making, workflow, and interpretation

slide-9
SLIDE 9

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

9 Argonne Nat’l Lab/U Chicago

GETS USER GETS USER ORGANIZATION

GETS priority is invoked GETS priority is invoked “ “call-by-call call-by-call” ” Calling cards are in widespread use and easily understood by the NS/EP User, simplifying GETS usage

GETS is a "ubiquitous" service in the Public Switched Telephone Network…if you can get a DIAL TONE, you can make a GETS call

Existing “Digital Right-of-Way” Emergency Phone System

slide-10
SLIDE 10

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

10 Argonne Nat’l Lab/U Chicago Automated Trigger Human Trigger Right-of-Way Token

2 1

SPRUCE Science Gateway

Event

SPRUCE Architecture Overview (1/3) Right-of-Way Tokens

First Responder

slide-11
SLIDE 11

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

11 Argonne Nat’l Lab/U Chicago

Conventional Job Submission Parameters

!

Urgent Computing Parameters

Urgent Computing Job Submission SPRUCE Job Manager Supercomputer Resource Local Site Policies Priority Job Queue User Team Authentication

3

Choose a Resource

4 5

SPRUCE Architecture Overview (2/3) Submitting Urgent Jobs

slide-12
SLIDE 12

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

12 Argonne Nat’l Lab/U Chicago

Supercomputer Resource

Domain Specialist Interpreter

6

Results Decision Maker

7

SPRUCE Architecture Overview (3/3) Analyzing Urgent Jobs

slide-13
SLIDE 13

Student fun with AJAX…

slide-14
SLIDE 14
slide-15
SLIDE 15

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

15 Argonne Nat’l Lab/U Chicago

Site-Local Response Policies: How will Urgent Computing be treated?

  • “Next-to-run” status for priority queue; wait for

running jobs to complete

  • Force checkpoint of existing jobs; run urgent job
  • Suspend current job in memory (kill -STOP); run

urgent job

  • Kill all jobs immediately; run urgent job
  • Provide difgerentiated CPU accounting

 “jobs that can be killed because they maintain their

  • wn checkpoints will be charged 20% less”
  • Other incentives
slide-16
SLIDE 16

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

16 Argonne Nat’l Lab/U Chicago

Emergency Preparedness Testing: “Warm Standby”

  • In urgent computing situation, there is no time to

port applications  Applications must be in “warm standby”  Verification and validation runs test readiness periodically (Inca)  Only verified apps participate in urgent computing

  • Grid-wide Information Catalog

 Application was last tested & validated on <date>  Also provides key success/failure history logs

slide-17
SLIDE 17

Deadline Urgency Level

59% 78% 98% 95% Reliability … … … … … 45 days ago City Airflow SDSC::Elimidata SDSC::Elimidata NCSA::Cobalt NCSA::Cobalt Platform 8 days ago Tornado 30 days ago Influenza 14 days ago City Airflow Validated App. Normal priority, no SPRUCE support SDSC::Datastar Automated, immediate access, kill existing jobs, 10 min turnaround Automated, next job Human-in-the-loop, immediate access, kill existing jobs, 15 min. turnaround Policy … … … … SDSC::Elimidata PSC::Rachel NCSA::Cobalt Platform

Warm Standby Validation History Site Policies

(5.3 hrs, 1024 nodes) SDSC::Datastar Immediate Immediate Next Available Job (Policy Based) … … … PSC::Rachel NCSA::Cobalt Platform

Live Job/Queue Data User Team MDS4 Service SPRUCE Data Advisor Best HPC Resource Urgent Computation Request

Choosing a Resource

An Advisor

slide-18
SLIDE 18

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

18 Argonne Nat’l Lab/U Chicago

Deployment Status

  • Deployed and available:

 UC/ANL  Purdue  TACC  SDSC

  • Very close:

 Indiana  LSU

  • Ready to integrate LEAD into SPRUCE

 First user-customer  Warm standby apps

slide-19
SLIDE 19

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

19 Argonne Nat’l Lab/U Chicago

What About “Capacity” Computing?

  • SPRUCE works well with “capability”

computing:

 Interface to small set of large resources

  • Imagine a larger set of smaller

resources?

 Condor management?  Real on-demand servers?

  • Amazon S3 & EC2
slide-20
SLIDE 20

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

20 Argonne Nat’l Lab/U Chicago

Amazon S3 & EC2

It’s a Web Services World

  • S3: Simple Storage Service

 Cost: $0.20/GB transfer, $.15/GB-month

  • EC2: Elastic Compute Cloud

 Cost: $0.10/cpu-hr, $0.20/GB transfer  No cost for internal bandwidth

  • Cost is extraordinarily good
  • Commoditization is good!!
  • The the real keys are reliability and

dynamic behavior

slide-21
SLIDE 21
slide-22
SLIDE 22

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

22 Argonne Nat’l Lab/U Chicago

Imagine…

  • Other companies catching up…
  • Commoditization (like web email)
  • A standardized interface to web-service

“request vm”

  • Dynamic capacity provides availability of

250K “node instances”

  • urgent computing resources available

immediately

  • Missing bisection bandwidth, but great for

capacity computing

slide-23
SLIDE 23

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

23 Argonne Nat’l Lab/U Chicago

The Future

  • Web services interfaces to all the portal functions
  • Extended submission schema
  • Flexible tokens - aggregation, extension
  • Encode local site policies
  • Warm standby integration
  • Automated ‘advisor’
  • Data movement
  • Redundancy to avoid downtime of portal
slide-24
SLIDE 24

SPRUCE Urgent Computing Flat Rock, North Carolina, 2006 http://www.mcs.anl.gov/~beckman

24 Argonne Nat’l Lab/U Chicago

Questions? Ready to Join?

spruce@ci.uchicago.edu beckman@mcs.anl.gov http://spruce.teragrid.org