Mark Bartelt Center for Advanced Computing Research California - - PowerPoint PPT Presentation

mark bartelt
SMART_READER_LITE
LIVE PREVIEW

Mark Bartelt Center for Advanced Computing Research California - - PowerPoint PPT Presentation

Mark Bartelt Center for Advanced Computing Research California Institute of Technology mark @ cacr.caltech.edu http: / / www.cacr.caltech.edu/ ~ mark Hype? Or Buzzword? Grid Computing: Future Directions Current Status History History


slide-1
SLIDE 1

Mark Bartelt

Center for Advanced Computing Research California Institute of Technology mark @ cacr.caltech.edu http: / / www.cacr.caltech.edu/ ~ mark

slide-2
SLIDE 2

Grid Computing: Hype? Or Buzzword?

slide-3
SLIDE 3

History Current Status Future Directions

slide-4
SLIDE 4

History

  • PACI (Partnerships for Advanced

Computational Infrastructure)

  • TCS (Terascale Computing

System)

  • DTF (Distributed Terascale Facility)
  • ETF (Extended Terascale Facility)
slide-5
SLIDE 5

PACI Program: NPACI (National Partnership for Advanced Computational Infrastructure)

  • San Diego Supercomputer Center

(SDSC)

  • University of Texas
  • University of Michigan
  • Caltech
  • (others …

)

slide-6
SLIDE 6

PACI Program: The Alliance (National Computational Science Alliance)

  • National Center for Supercomputing

Applications (NCSA)

  • Argonne National Laboratory
  • University of Wisconsin
  • Boston University
  • University of Tennessee, Knoxville
  • University of Kentucky
  • Caltech [ recently]
  • (many, many others …

)

slide-7
SLIDE 7

TCS (Terascale Computing System)

  • At Pittsburgh Supercomputing Center
  • Funded in 2000
  • Fully deployed in 2001
  • 6 Tflop system (750 quad-processor

Alpha nodes)

slide-8
SLIDE 8

Distributed TeraScale Facility (DTF)

  • Proposal submitted April 2001
  • Three-year program
  • Four DTF partners:

– NCSA – SDSC – Argonne National Laboratory (ANL) – Caltech

slide-9
SLIDE 9

DTF TeraGrid

  • IA64-based Linux clusters at four sites
  • Myrinet for intra-cluster connections
  • High-bandwidth inter-site interconnect

(10 Gbit between every pair of sites)

  • Lots of storage
  • Grid services based on Globus
slide-10
SLIDE 10

Future TeraGrid Authentication Mechanism?

slide-11
SLIDE 11

Goals (Measures of Success)

  • New Science

– Provide New Capabilities through:

  • Site capabilities that are more powerful than existing PACI resources
  • Combine site resources into a coordinated system

– To enable:

  • Existing PACI users to deepen their science
  • New users - problems not feasible with today’s PACI resources, require

a grid

  • Build an Extensible Grid

– Design principles assume heterogeneity and > 4 sites

  • A Grid hierarchy similar to Internet hierarchy

– multiple types, with smaller number of “tightly coupled” and large number of “loosely coupled”

  • Can be grow n, can be replicated, multiple copies can be com bined

– Formally documented design: protocols and specifications

  • “Implement this protocol” rather than “Install this magic software”
  • Leverage Global Grid Forum for technical input and dissemination
  • Provide a Pathway for Current Users

– Support evolutionary path

  • migration to linux clusters, simple “distributed machine room” model

– Provide exam ples, tools, training to exploit grid capabilities – User support, user support, and user support

slide-12
SLIDE 12

DTF Teragrid: Goals

  • Free computational scientists from

the “tyranny of distance”

  • Seed future cyberinfrastructure
slide-13
SLIDE 13

The Arpanet (1969)

slide-14
SLIDE 14

Arpanet (1971)

slide-15
SLIDE 15

Arpanet (1986)

slide-16
SLIDE 16

The Internet (1999)

slide-17
SLIDE 17

So … What was planned ?

  • IBM Linux clusters

– open source software and community

  • Intel/ HP Itanium Processor Family™ nodes

– “McKinley” processors for commodity leverage

  • Very high-speed network backbone

– bandwidth for rich interaction and tight coupling

  • Large-scale storage systems

– hundreds of terabytes for secondary storage

  • Grid middleware

– Globus, data management, …

  • Next-generation applications

– breakthrough versions of today’s applications – But also, reaching beyond “traditional” supercomputing

slide-18
SLIDE 18

DTF Network Topology

  • Full N-way mesh
  • OC192 links between each pair of

sites

slide-19
SLIDE 19

The TeraGrid Backbone

slide-20
SLIDE 20

So … What was planned ?

HPSS HPSS 574p IA-32 Chiba City 128p Origin HR Display & VR Facilities Myrinet Myrinet Myrinet Myrinet 1176p IBM SP Blue Horizon Sun E10K 1500p Origin UniTree 1024p IA-32 320p IA-64 HPSS 256p HP X-Class 128p HP V2500 92p IA-32

NCSA: Compute-Intensive ANL: Visualization Caltech: Data collection analysis SDSC: Data-Intensive

slide-21
SLIDE 21

Extended Terascale Facility (ETF)

  • Proposal submitted June 2002
  • New partner (PSC)
  • Revised network topology
  • Heterogeneity

– Alpha-based cluster at PSC – Power4-based cluster at SDSC

slide-22
SLIDE 22

ETF Network Topology

  • Major hubs in Los Angeles and

Chicago

  • 40 Gbit ( 4 x OC192 ) connection

between hubs

  • 3 x OC192 from each DTF site to

nearest hub

  • Facilitates addition of new sites
slide-23
SLIDE 23

ETF TeraGrid

Myrinet Myrinet Myrinet Myrinet

Chicago & LA DTF Core Switch/Routers

Sun Server

Federation

7.8 TF Power4 1 TF Itanium2

Fibre Channel Fibre Channel

2 TF Itanium2 9.2 TF Madison

0.5 TF Itanium2 90TB

1.5 TF Itanium2/Madison 20 TB

Datawulf IA-32

SDSC NCSA Caltech Argonne

Quadrics

PSC

6TF Alpha EV68 1.1 TF Alpha EV7

300 TB 300 TB 160 TB

Myrinet Myrinet Myrinet Myrinet

Chicago & LA DTF Core Switch/Routers

Sun Server

Federation

7.8 TF Power4 1 TF Itanium2

Fibre Channel Fibre Channel

2 TF Itanium2 9.2 TF Madison

0.5 TF Itanium2 90TB

1.5 TF Itanium2/Madison 20 TB

Datawulf IA-32

SDSC NCSA Caltech Argonne

Quadrics

PSC

6TF Alpha EV68 1.1 TF Alpha EV7

300 TB 300 TB 160 TB

slide-24
SLIDE 24

Nostradamus Speaks …

  • The technical challenges will be

difficult.

slide-25
SLIDE 25

Nostradamus Speaks …

  • The technical challenges will be

difficult.

  • But the sociopolitical issues will be

at least as challenging.

slide-26
SLIDE 26

How does it all work?

  • Many “working groups”

– Networking – Clusters – Performance evaluation – Etc. etc. etc …

slide-27
SLIDE 27

TeraGrid Management

Site Coordination Committee

Site Leads Project Director

Rick Stevens (UC/ANL)

Technical Coordination Committee

Project-wide Technical Area Leads Clusters

Pennington (NCSA)

Networking

Winkler (ANL)

Grid Software

Kesselman (ISI) Butler (NCSA)

Data

Baru (SDSC)

Applications

WIlliams (Caltech)

Visualization

Papka (ANL)

Performance Eval

Brunett (Caltech)

Chief Architect

Dan Reed (NCSA)

Executive Committee

Fran Berman, SDSC (Chair) Ian Foster, UC/ANL Paul Messina, CIT Dan Reed, NCSA Rick Stevens, UC/ANL Charlie Catlett, ANL

Technical Working Group

  • Are we creating an extensible

cyberinfrastructure?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

User Advisory Committee

  • Are we effectively supporting

good science? NSF MRE Projects Internet-2

McRobbie

Alliance UAC

Sugar, Chair

NPACI UAC

Kupperman, Chair

NSF ACIR NSF ACIR NSF Review Panels NSF Review Panels

Policy Oversight Policy Oversight Objectives Architecture

Currently being formed

Executive Director / Project Manager

Charlie Catlett (UC/ANL)

ANL

Evard

CIT

Bartelt

NCSA

Pennington

SDSC

Andrews

PSC NCAR Operations

Sherwin (SDSC)

User Services

Wilkins-Diehr (SDSC) Towns (NCSA)

Implementation

slide-28
SLIDE 28

How does it all work?

  • Every working group includes

people from all TeraGrid sites.

slide-29
SLIDE 29

How does it all work?

  • Every working group includes

people from all TeraGrid sites.

  • How the heck do you coordinate all

these people?

slide-30
SLIDE 30

TeraGrid Management

Site Coordination Committee

Site Leads Project Director

Rick Stevens (UC/ANL)

Technical Coordination Committee

Project-wide Technical Area Leads Clusters

Pennington (NCSA)

Networking

Winkler (ANL)

Grid Software

Kesselman (ISI) Butler (NCSA)

Data

Baru (SDSC)

Applications

WIlliams (Caltech)

Visualization

Papka (ANL)

Performance Eval

Brunett (Caltech)

Chief Architect

Dan Reed (NCSA)

Executive Committee

Fran Berman, SDSC (Chair) Ian Foster, UC/ANL Paul Messina, CIT Dan Reed, NCSA Rick Stevens, UC/ANL Charlie Catlett, ANL

Technical Working Group

  • Are we creating an extensible

cyberinfrastructure?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

User Advisory Committee

  • Are we effectively supporting

good science? NSF MRE Projects Internet-2

McRobbie

Alliance UAC

Sugar, Chair

NPACI UAC

Kupperman, Chair

NSF ACIR NSF ACIR NSF Review Panels NSF Review Panels

Policy Oversight Policy Oversight Objectives Architecture

Currently being formed

Executive Director / Project Manager

Charlie Catlett (UC/ANL)

ANL

Evard

CIT

Bartelt

NCSA

Pennington

SDSC

Andrews

PSC NCAR Operations

Sherwin (SDSC)

User Services

Wilkins-Diehr (SDSC) Towns (NCSA)

Implementation

slide-31
SLIDE 31

How does it all work?

  • Every working group includes

people from all TeraGrid sites.

  • How the heck do you coordinate all

these people?

  • We all seem to spend half our lives
  • n conference calls, and the other

half replying to e-mail.

slide-32
SLIDE 32

How does it all work?

  • Every working group includes

people from all TeraGrid sites.

  • How the heck do you coordinate all

these people?

  • We all seem to spend half our lives
  • n conference calls, and the other

half replying to e-mail.

  • The “herding cats” analogy is apt.
slide-33
SLIDE 33

TeraGrid Management

Site Coordination Committee

Site Leads Project Director

Rick Stevens (UC/ANL)

Technical Coordination Committee

Project-wide Technical Area Leads Clusters

Pennington (NCSA)

Networking

Winkler (ANL)

Grid Software

Kesselman (ISI) Butler (NCSA)

Data

Baru (SDSC)

Applications

WIlliams (Caltech)

Visualization

Papka (ANL)

Performance Eval

Brunett (Caltech)

Chief Architect

Dan Reed (NCSA)

Executive Committee

Fran Berman, SDSC (Chair) Ian Foster, UC/ANL Paul Messina, CIT Dan Reed, NCSA Rick Stevens, UC/ANL Charlie Catlett, ANL

Technical Working Group

  • Are we creating an extensible

cyberinfrastructure?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

External Advisory Committee

  • Are we enabling new science?
  • Are we pioneering the future?

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

Institutional Oversight Committee

Robert Conn, UCSD Richard Herman UIUC Dan Meiron, CIT (Chair) Robert Zimmer, UC/ANL

User Advisory Committee

  • Are we effectively supporting

good science? NSF MRE Projects Internet-2

McRobbie

Alliance UAC

Sugar, Chair

NPACI UAC

Kupperman, Chair

NSF ACIR NSF ACIR NSF Review Panels NSF Review Panels

Policy Oversight Policy Oversight Objectives Architecture

Currently being formed

Executive Director / Project Manager

Charlie Catlett (UC/ANL)

ANL

Evard

CIT

Bartelt

NCSA

Pennington

SDSC

Andrews

PSC NCAR Operations

Sherwin (SDSC)

User Services

Wilkins-Diehr (SDSC) Towns (NCSA)

Implementation

slide-34
SLIDE 34

Commercial Break

slide-35
SLIDE 35

Potholes on the Road to Production

  • Original timetable:

– Sep 2002: Initial delivery of phase-1 systems – Mar 2003: Friendly users – July 2003: Production

  • Hmm, what if there are

unexpected problems?

slide-36
SLIDE 36

Problem-free? Hah!

  • Hardware delivery behind

schedule.

– This should have come as a surprise?

slide-37
SLIDE 37

Problem-free? Hah!

  • Hardware delivery behind

schedule.

– This should have come as a surprise?

  • Gack! Numerical errors!

– Bug in floating-point software assist code.

slide-38
SLIDE 38

Problem-free? Hah!

  • Hardware delivery behind

schedule.

– This should have come as a surprise?

  • Gack! Numerical errors!

– Bug in floating-point software assist code. – Kernel bug: Floating-point registers sometimes not being saved/ restored properly on context switch.

slide-39
SLIDE 39

Problem-free? Hah!

  • Hardware delivery behind schedule.

– This should have come as a surprise?

  • Gack! Numerical errors!

– Bug in floating-point software assist code. – Kernel bug: Floating-point registers sometimes not being saved/ restored properly on context switch. – SUPERB support from IBM + Intel on first problem, and from IBM + SuSE on second.

slide-40
SLIDE 40

Problem-free? Hah!

  • Breaking news; this just in …

– Problem with Itanium2 processor (hit the trade press just yesterday) – { Intel| IBM} working on a solution

slide-41
SLIDE 41

Unsolved (or partially- solved) problems

  • Metascheduling
  • Coordinated advance reservation
  • On-demand computing
  • Very large datasets

– Even at 10 Gbit/ second, 100 TBytes takes approximately one day to move – Possible solutions: Datacutter and similar tools

slide-42
SLIDE 42

High-Bandwidth TeraGrid Transport Mechanism

slide-43
SLIDE 43

So … Where are we?

  • Target production date has slipped

three months (from beginning of July to beginning of October).

slide-44
SLIDE 44

So … Where are we?

  • Target production date has slipped

three months (from beginning of July to beginning of October).

  • Everybody is overworked and

stressed out.

slide-45
SLIDE 45

So … Where are we?

  • Target production date has slipped

three months (from beginning of July to beginning of October).

  • Everybody is overworked and

stressed out.

  • But …

We’re having LOADS of fun!

slide-46
SLIDE 46

Let us close with a prayer …

slide-47
SLIDE 47

Let us close with a prayer …

Anointed with oil,

  • n troubled waters,
  • h heavenly Grid,

help us bear up Thy standard,

  • ur chevron flashing bright

across the gulf of compromise.

slide-48
SLIDE 48

Remember: TeraGrid is …

slide-49
SLIDE 49

Questions ?

http: / / www.teragrid.org