Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e - PowerPoint PPT Presentation

The Pilot Way To Grid Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e h a l f O f 1 , D a n i e l C B r a d l e y 2 , I g o r S f i l i g o i 1 , P a r a g M h a s h i l k a r 1 , B u r t H o l z m a n 3 , F r a n k W ü r t h w e i n 3 S a n j a y P a d h i 1 F e r m i l a b , B a t a v i a , I L 2 , U n i v e r s i t y O f W i s c o n s i n a t M a d i s o n , W I 3 , U n i v e r s i t y O f C a l i f o r n i a a t S a n D i e g o , C A

Overview 2  Grid Computing  Pilot Workload Management (WMS) Paradigm  Security Considerations  GlideinWMS implementation of Pilot Paradigm  Pseudo-interactive Monitoring using glideinWMS  Scalability of glideinWMS  Summary  References The pilot way to Grid resources using glideinWMS March 31, 2009

Grid Computing 3  Distributed computing paradigm spanning many administrative domains.  Widely deployed the scientific communities with high computing demands o High Energy Physics (HEP) o Astro Physics Communities o Weather Surveys o Biology o […]  General purpose Grids used by the scientific communities o Open Science Grid (OSG) o European Grid for E-SciEnce (EGEE) o […] The pilot way to Grid resources using glideinWMS March 31, 2009

Typical Grid Use Case 4  Grid Site: An administrative domain Administrators deploy Grid middleware o with following components- Compute Element (CE) running a  gatekeeper which executes jobs on behalf of users Client tools on compute resources (or  worker nodes) to talk to commonly used Grid services Local Batch System (BS)   From user’s perspective Pros o Large pool of resources to satisfy their  computing needs Cons o Middleware problems in managing  the job Progress of the job is hidden from the  user making monitoring it complicated Heterogeneity of the resources over  Computing jobs Grid Site the Grid Need a meta WMS to manage Grid Batch System  Compute resource jobs User CE running Gatekeeper The pilot way to Grid resources using glideinWMS March 31, 2009

Pilot WMS Paradigm 5 Pilot or just-in-time paradigm  Pilot factory submits pilot jobs o to different grid sites Pilots start running on the o compute resources and fetch user jobs from the user job queue of WMS Advantages of pilot based  WMS Forms a virtual private pool of o compute resources Partially hides heterogeneity of o grid sites from the user. Pilot jobs o If the environment is bad, pilot  exits, preventing the user job to Pilot jobs Pilot WMS start and thus fail Computing jobs Grid Site Act as a wrapper and makes sure  Batch System Compute resource that the environment is right for the user job to execute. User Gatekeeper The pilot way to Grid resources using glideinWMS March 31, 2009

Security Considerations 6  Pilots are authenticated and authorized by the site gatekeeper.  Concerns with pilot based WMS  User jobs do not traverse through the site gatekeeper:  Does not fit well with the Grid model of authenticating / authorizing / accounting of user jobs.  Since pilot bootstraps the user job, both pilot and user job run under same OS user  This allows a malicious user to Pilot jobs Pilot WMS compromise the pilot job Computing jobs Grid Site infrastructure. Batch System Compute resource User Gatekeeper The pilot way to Grid resources using glideinWMS March 31, 2009

Security Considerations 7  Possible Solution  Deploy mini-gatekeepers on the worker nodes to authenticate/authorize user jobs.  OSG and EGEE sites deploy gLExec, which acts as a mini gatekeepers on worker nodes to authenticate / authorize user jobs. Pilot jobs Pilot WMS Computing jobs Grid Site Batch System Compute resource User Gatekeeper The pilot way to Grid resources using glideinWMS March 31, 2009

Pseudo-interactive Monitoring in Pilot WMS 8  Users need more info –  When something goes wrong  In case of very long running jobs  Information useful to the user -  What processes are running (ps)  Peek at the log files (cat/tail)  What files have been created (ls)  Peek at the process stack (gdb bt)  Is the machine thrashing? (top)  Above information can be Monitoring jobs obtained through batch jobs Pilot jobs Pilot WMS Computing jobs Grid Site  Pilot starts another job that Batch System Compute resource acts as a monitoring job User Gatekeeper The pilot way to Grid resources using glideinWMS March 31, 2009

glideinWMS Implementation of the Pilot Paradigm 9 glideinWMS is based on Condor with the  VO Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites. Condor as a user job WMS  glideinWMS Factory  Glidein factories creates and submits pilot … o … jobs to the grid sites using CondorG Condor collector acts as a dashboard for o message exchanging Factory receives orders from the VO o frontend via the dashboard VO Frontend  VO frontend monitors the CondorWMS and o regulates the number of pilot jobs sent by glidein factories via the dashboard Frontend acts as a match maker for the o glideins Negotiator Collector VO Frontend  Pilot jobs Schedd All network traffic is authenticated and  Computing jobs Grid Site integrity checked Compute resource Batch System Support pseudo-interactive monitoring  Startd Gatekeeper out of the box Dashboard GFactory … Frontend Condor-G The pilot way to Grid resources using glideinWMS March 31, 2009

glideinWMS Implementation of the Pilot Paradigm 10 glideinWMS is based on Condor with the  VO Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites. Condor as a user job WMS  Condor collector acts as an information o … database … Condor startd manages the compute o resource Condor schedd acts as the job queue for o users jobs Startd and schedd advertise the resource o and jobs respectively to the collector using condor classAds Condor negotiator acts as a match maker o between compute resources and user jobs glideinWMS Factory  Negotiator Collector VO Frontend  Pilot jobs Schedd All network traffic is authenticated and Computing jobs Grid Site  integrity checked Compute resource Batch System Startd Gatekeeper Support pseudo-interactive monitoring  Dashboard GFactory out of the box … Frontend Condor-G The pilot way to Grid resources using glideinWMS March 31, 2009

glideinWMS Implementation of the Pilot Paradigm 11 glideinWMS is based on Condor with the VO  Frontend and the Factory sending pilot jobs (i.e. glideins) to the grid sites. Condor as a user job WMS  Condor collector acts as an information database o Condor startd manages the compute resource o Condor schedd acts as the job queue for users jobs … o … Startd and schedd advertise the resource and jobs o respectively to the collector Condor negotiator acts as a match maker between o compute resources and user jobs glideinWMS Factory  Glidein factories creates and submits pilot jobs to o the grid sites using CondorG Condor collector acts as a dashboard for message o exchanging Factory receives orders from the VO frontend via o the dashboard VO Frontend  VO frontend monitors the CondorWMS and o Negotiator Collector regulates the number of pilot jobs sent by glidein Pilot jobs Schedd factories via the dashboard Frontend acts as a match maker for the glideins Computing jobs Grid Site o All network traffic is authenticated and integrity Compute resource Batch System  checked Startd Gatekeeper Support pseudo-interactive monitoring out of Dashboard GFactory  … the box Frontend Condor-G The pilot way to Grid resources using glideinWMS March 31, 2009

Scalability of glideinWMS 12  Centralized WMS are generally less scalable  glideinWMS scalability issues found  The centralized user queue keeping track of thousands of running jobs is memory exhaustive.  Security handshake in establishing communication between different components could be expensive in case of high network latency  glideinWMS addresses these scalability issues by  Deploying multiple instances of the user queue service to spread the load  Increasing the memory of the machine that hosts schedd service  Deploying multiple slave collectors to reduce the impact of communication issues because of high network latency  Table below summarizes the scalability achieved with a deployment running 1 Master collector, 70 slave collectors and using system with 16GB of memory to host the schedd service. Criteria Design goal Achieved so far Total number of user jobs in the queue at any given time 100k 200k Number of glideins in the system at any given time 10k ~26k Number of running jobs per schedd at any given time 10k ~23k Grid sites handled ~100 ~100 The pilot way to Grid resources using glideinWMS March 31, 2009

glideinWMS in CMS Operations 13 CMS operations using glideinWMS at it’s seven Running more than 20k glideins at any given time archival storage sites CMS operations at Tier1 site at Fermilab The pilot way to Grid resources using glideinWMS March 31, 2009

Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e - PowerPoint PPT Presentation

The Pilot Way To Grid Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e h a l f O f 1 , D a n i e l C B r a d l e y 2 , I g o r S f i l i g o i 1 , P a r a g M h a s h i l k a r 1 , B u r t H o l z m a n 3 , F r a n k

GlideinWMS Marco Mambelli Stakeholders Meeting January 9, 2019 Overview Upcoming releases

GlideinWMS Marco Mambelli Stakeholders Meeting November 13, 2019 Overview Project updates

GlideinWMS Marco Mambelli Stakeholders Meeting May 8, 2019 Overview Completed and Upcoming

GlideinWMS Marco Mambelli Stakeholders Meeting July 11, 2018 Overview Releases since last

GlideinWMS Marco Mambelli Stakeholders Meeting May 11, 2018 Overview Releases since last

GlideinWMS Marco Mambelli Stakeholders Meeting September 18, 2019 Overview Project updates

GlideinWMS Parag Mhashilkar Stakeholders Meeting January 07, 2016 Overview

GlideinWMS Parag Mhashilkar Stakeholders Meeting May 15, 2015 Overview

HEPCloud Resource Provisioning Anthony Tiradani OSG Blueprint Meeting 21 February 2018 HEPCloud

Nested Resources July 2012 by Anton Nested resources resources :pages do resources :posts

Accessible Ajax on Rails Jarkko Laine with Geoffrey Grosenbach r.resources :categories do |cat|

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Glideins for CMS on OSG Jeff Dost (UCSD) Overview Architecture Concept of a Global Queue

Using the OS The Basic Abstractions Processes Files Other Resources Processes

Resources Please note: I have added additional resources to the gender specific resources section.

Medical Repair Capabilities 262 Northstar Dr, Suite 122, Rural Hall, NC 27045-9445

TITULO Plateaus in the Cordillera del Cndor, Ecuador ATBC-OTS Meeting 2013 David Neill &

Mid-Block Road Crossing of Highway 404 Between 16 th Avenue and Major Mackenzie Drive Municipal

Condors, Eagles and Wind Energy: Why have this workshop? Focus on Renewable Energy Energy

WIND WIND ENERGY & ENERGY & AUDUBON CALIFORNIA Garry George Garry George Renewable

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Advancing Gold Discoveries and Building Resources April 2017 1 Forward Looking Statement No

Migrating from Grid to Cloud: Case Study from GEO Grid National Institute of Advanced Industrial

Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e - PowerPoint PPT Presentation

The Pilot Way To Grid Resources Using GlideinWMS P a r a g M h a s h i l k a r O n B e h a l f O f 1 , D a n i e l C B r a d l e y 2 , I g o r S f i l i g o i 1 , P a r a g M h a s h i l k a r 1 , B u r t H o l z m a n 3 , F r a n k

GlideinWMS Marco Mambelli Stakeholders Meeting January 9, 2019 Overview Upcoming releases

GlideinWMS Marco Mambelli Stakeholders Meeting November 13, 2019 Overview Project updates

GlideinWMS Marco Mambelli Stakeholders Meeting May 8, 2019 Overview Completed and Upcoming

GlideinWMS Marco Mambelli Stakeholders Meeting July 11, 2018 Overview Releases since last

GlideinWMS Marco Mambelli Stakeholders Meeting May 11, 2018 Overview Releases since last

GlideinWMS Marco Mambelli Stakeholders Meeting September 18, 2019 Overview Project updates

GlideinWMS Parag Mhashilkar Stakeholders Meeting January 07, 2016 Overview

GlideinWMS Parag Mhashilkar Stakeholders Meeting May 15, 2015 Overview

HEPCloud Resource Provisioning Anthony Tiradani OSG Blueprint Meeting 21 February 2018 HEPCloud

Nested Resources July 2012 by Anton Nested resources resources :pages do resources :posts

Accessible Ajax on Rails Jarkko Laine with Geoffrey Grosenbach r.resources :categories do |cat|

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Glideins for CMS on OSG Jeff Dost (UCSD) Overview Architecture Concept of a Global Queue

Using the OS The Basic Abstractions Processes Files Other Resources Processes

Resources Please note: I have added additional resources to the gender specific resources section.

Medical Repair Capabilities 262 Northstar Dr, Suite 122, Rural Hall, NC 27045-9445

TITULO Plateaus in the Cordillera del Cndor, Ecuador ATBC-OTS Meeting 2013 David Neill &amp;

Mid-Block Road Crossing of Highway 404 Between 16 th Avenue and Major Mackenzie Drive Municipal

Condors, Eagles and Wind Energy: Why have this workshop? Focus on Renewable Energy Energy

WIND WIND ENERGY &amp; ENERGY &amp; AUDUBON CALIFORNIA Garry George Garry George Renewable

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Advancing Gold Discoveries and Building Resources April 2017 1 Forward Looking Statement No

Migrating from Grid to Cloud: Case Study from GEO Grid National Institute of Advanced Industrial

TITULO Plateaus in the Cordillera del Cndor, Ecuador ATBC-OTS Meeting 2013 David Neill &

WIND WIND ENERGY & ENERGY & AUDUBON CALIFORNIA Garry George Garry George Renewable