f
Lothar A T Bauerdick Fermilab ISGC 2004 July 27, 2004
LATBauerdick, Fermilab International Symposium on Grid Computing ISGC 2004
中央研究院 Academia Sinica, Taipei, Taiwan
f Lothar A T Bauerdick Fermilab ISGC 2004 July 27, 2004 f U.S. - - PowerPoint PPT Presentation
Grid-3 and the Open Science Grid in the U.S. LATBauerdick, Fermilab International Symposium on Grid Computing ISGC 2004 Academia Sinica, Taipei, Taiwan f Lothar A T Bauerdick Fermilab ISGC 2004 July 27, 2004 f U.S. Grids
Lothar A T Bauerdick Fermilab ISGC 2004 July 27, 2004
中央研究院 Academia Sinica, Taipei, Taiwan
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
2
Science drivers for U.S. Physics Grid Projects:
iVDGL, GriPhyN and PPDG (”Trillium”)
ATLAS & CMS experiments @ CERN LHC
100s of Petabytes
2007 - ?
High Energy & Nuclear Physics expts
~1 Petabyte (1000 TB)
1997 – present
LIGO (gravity wave search)
100s of Terabytes
2002 – present
Sloan Digital Sky Survey
10s of Terabytes
2001 – present
Data growth Community growth
2007 2005 2003 2001 2009
Future Grid resources
Massive CPU (PetaOps) Large distributed datasets (>100PB) Global communities (1000s)
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
3
Sharing and federating vast Grid resources
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
4
Grid-enabled
Goal: Implement a production-level
blind galactic-plane search for Gravitational Wave pulsar signals
Run 30 days on ~5-10x more resources
than LIGO has -- using the grid (e.g., 10,000 CPUs for 1 month) Millions of individual jobs
Planning by GriPhyN Chimera/Pegasus
Execution by Condor DAGman File cataloging by Globus RLS Metadata by Globus MCS
Achieved: Access to ~ 6000 CPUs for 1 week
~ 5% utilization due to bottlenecks
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
5
Galaxy Cluster Finding: red-shift analysis, weak lensing effects
Using the GriPhyN Chimera and Pegasus Coarse grained DAG works fine (batch system)
Fine grain DAG has scaling issues (virtual data system)
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
6
Energy frontier high luminosity p-p-collider at CERN
order-of-magnitue step in energy and luminosity for particle physics
1 1 1 1 1 1 1 1 1 1
Constituent Center-of-Mass Energy (GeV) Constituent Center-of-Mass Energy (GeV)
S p p p p S S
S R R E E D D I I L L L L O O C C N N O O R R D D A A H H
C H H L L n n
r t t a a v v e e T T 2 2 P P E E L L
1 1 P P E E L L
) ) N N R R E E C C ( (
s s e e i i l l i i m m a a f f 3 3 : : 9 9 8 8 9 9 1 1
R R A A E E P P S S
S R R E E D D I I L L L L O O C C e e e e
– – + +
2 2 9 9 9 9 1 1 8 8 9 9 1 1 7 7 9 9 1 1 6 6 9 9 1 1
s c c i i s s y y h h P P t t s s r r i i F F f f
r a a e e Y Y
) N N R R E E C C ( ( ) b b a a l l i i m m r r e e F F ( ( ) ) N N R R E E C C ( ( ) d d r r
f n n a a t t S S ( (
A R R E E H H
2 2 2 2 1 1 2 2
S R R E E D D I I L L L L O O C C p p
e
P E E L L C C H H L
R R T T E E P P
) ) Y Y S S E E D D ( (
N N O O U U L L G G : : 9 9 7 7 9 9 1 1 / / J J : : 4 4 7 7 9 9 1 1
: 5 5 7 7 9 9 1 1
p
t : : 5 5 9 9 9 9 1 1
Z Z , , W W : : 3 3 8 8 9 9 1 1
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
7
LHC first to put “real”, multi-organizational, global Grids to work
large resources become available to experiments “opportunistically”
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
8
in 2003 U.S. science projects and Grid projects coming together
to build a multi-organizational Infrastructure: Grid3
virtual data grid laboratory virtual data research end-to-end HENP applications US LHC projects testbeds, data challenges
RHIC Tevatron BaBar U.Buffalo BTeV VDT Korea CMS
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
9
Common Grid operating as coherent loosely-coupled infrastructure. Applications running on Grid3 (Trillium, U.S. LHC), benefiting
LHC (3), SDSS (2), LIGO (1), Biology (2), Computer Science (3).
25 Universities 4 National Labs 2800 CPUs
July-26, 2004, 11:35pm CDT
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
10
example: U.S. CMS Data Challenge Simulation production running on Grid3 since Nov 2003 profited at least 40% non-CMS resources in first quarter 2004
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
11
Tier2 facility logically grouped around their Tier1 regional center
20 – 40% of Tier1? “1-2 FTE support”: commodity CPU & disk, no hierarchical storage Essential university role in extended computing infrastructure Validated by 3 years of experience with proto-Tier2 sites
Specific Functions for Science Collaborations
Physics analysis Simulation Experiment software Support smaller institutions
Official role in Grid hierarchy (U.S.)
Sanctioned by MOU (ATLAS, CMS, LIGO) Local P.I. with reporting responsibilities Selection by collaboration via careful process
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
12
Grid environment built from core Globus and Condor middleware,
as delivered through the Virtual Data Toolkit (VDT)
GRAM, GridFTP, MDS, RLS, VDS, VOMS, … VDT sponsored through GriPhyN and iVDGL, contributions from LCG
…equipped with VO and multi-VO security, monitoring, and
…allowing federation with other Grids where possible, eg. CERN
LHC Computing Grid (LCG)
U.S.ATLAS: GriPhyN Virtual Data System execution on LCG sites U.S.CMS: storage element interoperability (SRM/dCache)
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
13
Simple approach:
Sites consisting of
Computing element (CE) Storage element (SE) Information and monitoring services
VO level, and multi-VO
VO information services Operations (iGOC)
Minimal use of grid-wide systems
No centralized resource broker, replica/metadata catalogs, or
command line interface
to be provided by individual VO’s
Application driven
adapt application to work with Grid-3 services prove application on VO testbeds
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
14
The Grid3 environment consists on a “loosely coupled” set of services
Processing Service
Globus-Gram bridge from Condor-G for central submission four separate queueing systems are being supported
Data Transfer Services
GridFTP interfaces on all sites through gateway systems
Files are transferred into processing sites Results are transferred directly into MSS GridFTP door CMS has moved to SRM-based storage element functionality
VO Management Services
Need central service for authentication, VOMS
Monitoring Services
System and application level monitoring allows status verification and diagnoses
Software Distribution Services
lightweight, based on Pacman
Information Services
top help applications and monitoring, based on MDS
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
15
Goal is to install and configure with minimal human intervention Use Pacman tool and distributed software “caches” Registers site with VO and Grid3 level services Accounts, application install areas & working directories
Compute Element
Storage
Grid3 Site %pacman –get iVDGL:Grid3
VDT VO service GIIS register Info providers Grid3 Schema Log management
$app $tmp
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
16
“what are the services to enable application VOs” “what do providers need to provide resources to their VOs”
Lightweight-nes at the cost of centrally provided functionality examples for this approach: flexible VO security infrastructure
DOEGrids Certificate Authority PPDG and iVDGL Registration Authorities,
with VO or site sponsorship
Automated multi-VO authorization,
using EDG-developed VOMS
Each VO manages a service and it’s members Each Grid3 site is able to generate and locally adjust
gridmap file with authenticated query to each VO service
VOs negotiate policies & priorities with provider directly VOs can run their own storage services
U.S. CMS sites run SRM/dCache storage services on Tier-1 and Tier-2s
iVDGL US CMS US ATLAS LSC SDSS BTeV
Grid3 Sites Grid3 gridmap VOMS servers
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
17
GRID3 GIIS Grid3 Location Grid3 Data DIR Grid3 Applications Grid3 Temporary DIR ATLAS GIIS Boston GRIS Boston U Chicago UFL GRIS ANL BNL ANL BNL Boston U Chicago USCMS GIIS UFL FNAL RiceU CalTech UFL FNAL RiceU CalTech
Grid3 Site Resources VO Index Service (6) Grid3 Index Service
GRID3 Location Grid3 Data_DIR
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
18
VO GIIS MonALISA GIIS Site Catalog Ganglia ACDC JobDB ML repository OS (syscall, /proc) GRIS Job manager Log files System config. Producers Consumers Intermediaries MonALISA client MDViewer IS Clients IS Clients WWW Reports IS Clients Client tools
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
19
Co-located with Abilene NOC, hosted by Indiana University Hosts/manages multi-VO services:
top level Ganglia collectors MonALISA web server and archival service top level GIIS VOMS servers for iVDGL, BTeV, SDSS Site Catalog service iVDGL Pacman caches
Trouble ticket systems
phone (24 hr), web and email based collection and reporting system Investigation and resolution of grid middleware problems at the level
Weekly operations meetings for troubleshooting
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
20
Rule: install application through grid jobs
encourage self-contained environments
11 applications
LHC (ATLAS, CMS) Astrophysics (LIGO/SDSS) Biochemical
Molecular X-ray diffraction GADU/Gnare: compares protein sequences
Computer Science
Adaptive data placement and scheduling algorithms Grid Exerciser
Over 100 users authorized to run on Grid3 Managed to add new sites and new applications
U. Buffalo (Biology) Nov. 2003 Rice U. (CMS Heavy Ions) Feb. 2004
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
21
Grid2003 project milestone metric
Nov 2003, demonstrated at SC2003
2762 (27 sites) 400 Number of CPUs
102 (16) > 10 Number of users > 2-3 TB 1000 > 10 > 4 4.4 TB (11.12.03) Data Transfer per day 1100 Peak number of concurrent jobs 17 Number of site running concurrent apps 10 (+CS) Number of applications
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
22
~40,000 jobs/month sustained running
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
23
strategy of interoperability and joint projects with LCG
Particular to Grid2003:
Common Virtual Data Toolkit (VDT) delivery and support team. Data Movement and Storage Management: U.S. CMS demonstration between
Grid2003 and Cern.
Job Execution: U.S. ATLAS Grid3 application submission to LCG sites using
Chimera Virtual Data System (VDS).
Merge of Information Attributes (GLUE Schema extensions) from Grid3 and LCG.
Other Collaborative Efforts:
Virtual Organization Management Project (VOX) collaboration with European Data
Grid and LCG Security Working Group.
Contributions to and from the wider CMS and ATLAS software and computing
deliverables.
Presentations at and discussions with LCG committees Participation in High Energy Physics Joint Technical Board and Global Grid Forum
Particle and Nuclear Physics Applications Research Group.
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
24
Grid3 was an evolutionary change in environment
The infrastructure was built on established testbed efforts from
participating experiments
However, it was a revolutionary change in scale
The available computing and storage resources increased by 4-5 The human interactions in the project increased
Architecture and operations lessons -> “Lessons learned” document
e.g. scalability issues with headnodes, certificate infrastructure, etc
Grid3 in continuous use since 10/03, now running Atlas DC2 undergoes adiabatic upgrades while operating (Grid3+) next step: OSG-0!
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
25
iteratively build and extend Grid3, to a national infrastructure of
shared resources, benefiting broad set of scientific applications
➡ Open Science Grid build the OSG by contributing our LHC resources into a coherent
infrastructure to provide the initial federation
US LHC Tier-1 and Tier-2 centers significant sized infrastructure!
Laboratories, Universities, Campus Grids etc. participate in OSG,
examples at UW Madison, U.Florida, Fermilab, Purdue, many others:
significantly sized shared “Grid Farms” and storage facilities
Goal: develop OSG such that it will attract and support
partnerships with and contributions from other sciences
build it generic enough, end-to-end, to benefit others
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
26
Joining our U.S. forces into the Open Science Grid consortium :
inclusive collaboration between application scientists, technology providers, resource owners, ...
realizing magnitude of task and needs for ongoing support and effort
Provide a lightweight framework for joint activities...
coordinating activities that cannot be done within one project achieve technical coherence and strategic vision (roadmap) enable communication and provide liaising
... and for reaching common decisions when necessary Technical Groups propose, organize, oversee activities, peer, liaise
now Security and Storage Services; soon Support Centers, Policies
Activities are well defined, scoped set of contributing tasks
provided by participants joining the activity
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
27
Enterprise Technical Groups 0…n (small) Consortium Board (1) Research Grid Projects VO Org Researchers Sites Service Providers Campus, Labs activity 1 activity 1 activity 1 activity 0…N (large) Joint committees (0…N small) Participants provide:
resources, management, project steering groups
OSG Process Framework
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
28
Facilities Laboratory and University Resources
Storage, Processing, Databases..
Production Grid Infrastructure Grid Fabric, Connectivity and Middleware
System Monitoring, Information, Software repositories, Support Centers,
Science Communities Applications and Analysis
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
29
Facilities Laboratory and University Resources
Storage, Processing, Databases..
Production Grid Infrastructure Grid Fabric, Connectivity and Middleware
System Monitoring, Information, Software repositories, Support Centers,
Science Communities Applications and Analysis
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
30
Sustained persistent production Grid infrastructure Continuous need for Grid Lab for development and integration instrumental for developing
Grid3 and the OSG…: U.S. CMS/Atlas testbeds!
important role for iVDGL to
deliver and maintain this lab environment for U.S. Grids!
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
31
develop OSG-0 by increasing functionality and scale
services for persistent and temporary storage, data transfer services policy, accounting, authorization, security, configuration infrastructure
OSG-0 milestone planned for Feb-2005
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
32
CERN
Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x
Physics Department
Germany
Tier 1
USA FermiLab UK France Italy NL USA Brookhaven ……….
Open Science Grid
Tier-2 Tier-2 Tier-2 Uni x Uni u Uni z Tier-2
Open Science Grid: Universities, Labs, LHC Tier-1, Tier-2, ...
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
33
Complex matrix of partnerships — opportunities!
NSF and DOE, Universities and Labs, physics and computing dept. Application sciences, computer sciences, information technologies Science and education U.S., Europe, Asia, North and South America U.S. LHC and Trillium Grid Projects middleware and infrastructure projects CERN and U.S. HEP labs, LCG and Experiments, EGEE and OSG
~85 participants on OSG stakeholder mail list
~50 authors on OSG CHEP abstract
Joint Steering Meetings (Trillium, U.S. LHC) to define OSG program OSG Blueprint meeting July 12-15 and Sept 7,8 OSG workshop hosted by U.S. Atlas in Boston Sept 9, 10
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
34
The success of Grid3 has demonstrated that
Grids bring real value to science experiments U.S. can use its Tier-1 & Tier-2 facilities in common Grid Environment can operate and use a Multi-Organization Grid with distributed
Grid w/ >20 sites can run robust & performant for simple applications Feasibility of strategy to federate & share resources
We are making concrete workplans to
Operate the current infrastructure for LHC data challenges Evolve to increased capability and performance for start of 2005 Start longer term Engineering of Services and Capabilities Continue interoperability and joint projects with the LCG
Lothar A T Bauerdick Fermilab ISGC 2004 Academia Sinica July 27, 2004
35
Acknowledge the excellent work of the Grid3 teams:
participation was at the level of 58 people. 8 worked full time. 10 worked half time.
20 site administrators worked quarter time. Total effort ~ 7 FTE-years (17 FTE for 5 months).
Special thanks for help on this talk with slides, graphics, ideas etc to
Ruth Pordes/Fermilab, Rob Gardner/U.Chicago, Grid3 Taskforce leads Ian Fisk/Fermilab,J.Rodrigues/U.Florida,Leigh Grundhoefer/U.Indiana
Rob Gardner Mike Wilde Ian Fisk Scott Koranda Rich Baker Jim Annis Peter Couvares Jorge Rodriguez Leigh Grundhoeffer Nickolai Kuropatine Xin Zhao John Hicks Ed May Alain Roy Brian Moe Fred Luerhing Iowna Sakrejda Yuri Smirnov Marco Mambelli Anzar Afaq Suresh Singh Carey Kireyev Alain DeSmet Jerry Gieraltowski Doug Olson Brian Tierney Saul Youssef Anne Heavey Terrence Martin Andrew Zahn Scott Gose Vijay Sekhri Dantong Yu Lawrence Sorrillo Yong Xia Rob Quick Michael Ernst Greg Graham Bobby Brown Bockjoo Kim Jens Voekler Ruth Pordes Matt Allen Yujun Wu Lisa Giacchetti Joe Kaiser Erik Paulson George Fekete Dan Engh Kihyeon Cho James Letts Tim Thomas John Weigand Iosif Legrand Mark Green Craig Prescott Nosa Olomu Ben Clifford Dan Bradley Timur Perelmutov Patrick McGuigan Shawn McKee Guarang Mehta