Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, - PowerPoint PPT Presentation

Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, and APF John Hover OSG All-Hands Meeting 2013 Indianapolis, Indiana John Hover 13 Mar 2013 1

Outline Rationale – In general... – OSG-specific Dependencies/Limitations Current Status – VMs with Boxgrinder – AutoPyFactory (APF) and Panda – Condor Scaling Work – EC2 Spot, Openstack Next Steps and Plans Discussion A Reminder John Hover 13 Mar 2013 2

Rationale Why Cloud interfaces rather than Globus? Common interface for end-user virtualization management, thus.. Easy expansion to external cloud resources-- same workflow to expand to: • Local Openstack resources. • Commercial and academic cloud resources. • Future OSG and DOE site cloud resources. Includes all benefits of non-Cloud virtualization: customized OS environments for reliable opportunistic usage. Flexible facility management: • Reboot host nodes without draining queues. • Move running VMs to other hosts. Flexible VO usage: • Rapid prototyping and testing of platforms for experiments. John Hover 13 Mar 2013 3

OSG Rationale Why are we talking about this at an OSG meeting? – OSG VOs are interested in cloud usage, both local, remote and commercial. – The new OSG CE (HTCondor-based) could easily provide an interface to local or remote Cloud-based resources, while performing authentication/authorization. – OSG itself may consider offering a central, transparent gateway to external cloud resources. (Mentioned in Ruth's talk regarding commerical partnerships for CPU and storage.) This work addresses the ease, flexibility, and scalability of cloud-based clusters. This talk is a technical overview of an end-to-end modular approach. John Hover 13 Mar 2013 4

Dependencies/Limitations Inconsistent behavior, bugs, immature software: – shutdown -h means destroy instance on EC2, but means shut off on OpenStack (leaving the instance to count against quota). – When starting large numbers of VMs, sometimes a few enter ERROR state, requiring removal (Openstack) – Boxgrinder requires patches for mixed libs, and SL5/EC2. – EC2 offers public IPs, Openstack nodes often behind NAT VO infrastructures often not designed to be fully dynamic: – E.g., ATLAS workload system assumes static sites. – Data management assumes persistent endpoints – Others? Any element that isn't made to be created, managed, and cleanly deleted programmatically. John Hover 13 Mar 2013 5

VM Authoring Programmatic Worker Node VM creation using Boxgrinder: – http://boxgrinder.org/ – http://svn.usatlas.bnl.gov/svn/griddev/boxgrinder/ Notable features: – Modular appliance inheritance. The wn-atlas definition inherits the wn- osg profile, which in turn inherits from base. – Connects back to static Condor schedd for jobs. – BG creates images dynamically for kvm/libvirt, EC2, virtualbox, vmware via 'platform plugins'. – BG can upload built images automatically to Openstack (v3), EC2, libvirt , or local directory via 'delivery plugins'. Important for OSG: Easy to test on your workstation! – OSG could provide pre-built VMs (would need contextualization) or – OSG could provide extensible templates for VOs. John Hover 13 Mar 2013 6

Boxgrinder Base Appliance name: sl5-x86_64-base repos: os: - name: "sl58-x86_64-os" name: sl baseurl: “http://host/path/repo” version: 5 hardware: files: partitions: "/root/.ssh": "/": - "authorized_keys" size: 5 "/etc": packages: - "ntp/step-tickers" - bind-utils - "ssh/sshd_config" - curl - ntp post: - openssh-clients base: - openssh-server - "chown -R root:root /root/.ssh" - subversion - "chmod -R go-rwx /root/.ssh" - telnet - "chmod +x /etc/rc.local" - vim-enhanced - "/sbin/chkconfig sshd on" - wget - "/sbin/chkconfig ntpd on" - yum John Hover 13 Mar 2013 7

Boxgrinder Child Appliance name: sl5-x86_64-batch appliances: - sl5-x86_64-base packages: - condor repos: - name: "htcondor-stable" baseurl: "http://research.cs.wisc.edu/htcondor/yum/stable/rhel5" files: "/etc": - "condor/config.d/50cloud_condor.config" - “condor/password_file” - "init.d/condorconfig" post: base: - "/usr/sbin/useradd slot1" - "/sbin/chkconfig condor on" - "/sbin/chkconfig condorconfig on" John Hover 13 Mar 2013 8

Boxgrinder Child Appliance 2 name: sl5-x86_64-wn-osg summary: OSG worker node client. appliances: - sl5-x86_64-base packages: - osg-ca-certs - osg-wn-client - yum-priorities repos: - name: "osg-release-x86_64" baseurl: "http://dev.racf.bnl.gov/yum/snapshots/rhel5/osg-release- 2012-07-10/x86_64" - name: "osg-epel-deps" baseurl: "http://dev.racf.bnl.gov/yum/grid/osg-epel- deps/rhel/5Client/x86_64" files: "/etc": - " profile.d/osg.sh " post: base: - "/sbin/chkconfig fetch-crl-boot on" - "/sbin/chkconfig fetch-crl-cron on" John Hover 13 Mar 2013 9

John Hover 13 Mar 2013 10

WN Deployment Recipe Build and upload VM: svn co http://svn.usatlas.bnl.gov/svn/griddev/boxgrinder <Add your condor_password file> <Edit COLLECTOR_HOST to point to your collector> boxgrinder-build -f boxgrinder/sl5-x86_64-wn-atlas.appl -p ec2 -d ami boxgrinder-build -f boxgrinder/sl5-x86_64-wn-atlas.appl -p ec2 -d ami --delivery-config region:us-west-2,bucket:racf-cloud-2 #~.boxgrinder/config s3: plugins: access_key: AKIAJRDFC4GBBZY72XHA openstack: secret_access_key: XXXXXXXXXXX username: jhover bucket: racf-cloud-1 password: XXXXXXXXX account_number: 4159-7441-3739 tenant: bnlcloud region: us-east-1 host: cldext03.usatlas.bnl.gov snapshot: false port: 9292 overwrite: true John Hover 13 Mar 2013 11

Elastic Cluster: Components Static HTCondor central manager – Standalone, used only for Cloud work. AutoPyFactory (APF) configured with two queues – One observes a Panda queue, when jobs are activated, submits pilots to local cluster Condor queue. – Another observes the local Condor pool. When jobs are Idle, submits WN VMs to IaaS (up to some limit). When WNs are Unclaimed, shuts them down. Worker Node VMs – Generic Condor startds associated connect back to local Condor cluster. All VMs are identical, don't need public IPs, and don't need to know about each other. – CVMFS software access. Panda site – Associated with static BNL SE, LFC, etc. John Hover 13 Mar 2013 12

John Hover 13 Mar 2013 13

#/etc/apf/queues.conf [BNL_CLOUD] wmsstatusplugin = Panda wmsqueue = BNL_CLOUD batchstatusplugin = Condor batchsubmitplugin = CondorLocal schedplugin = Activated sched.activated.max_pilots_per_cycle = 80 sched.activated.max_pilots_pending = 100 batchsubmit.condorlocal.proxy = atlas-production batchsubmit.condorlocal.executable = /usr/libexec/wrapper.sh [BNL_CLOUD-ec2-spot] wmsstatusplugin = CondorLocal wmsqueue = BNL_CLOUD batchstatusplugin = CondorEC2 batchsubmitplugin = CondorEC2 schedplugin = Ready,MaxPerCycle,MaxToRun sched.maxpercycle.maximum = 100 sched.maxtorun.maximum = 5000 batchsubmit.condorec2.gridresource = https://ec2.amazonaws.com/ batchsubmit.condorec2.ami_id = ami-7a21bd13 batchsubmit.condorec2.instance_type = m1.xlarge batchsubmit.condorec2.spot_price = 0.156 batchsubmit.condorec2.access_key_id = /home/apf/ec2-racf-cloud/access.key batchsubmit.condorec2.secret_access_key = /home/apf/ec2-racf- cloud/secret.key John Hover 13 Mar 2013 14

Elastic Cluster Components Condor scaling test used manually started EC2/Openstack VMs. Now we want APF to manage this: 2 AutoPyFactory (APF) Queues – First (standard) observes a Panda queue, submits pilots to local Condor pool. – Second observes a local Condor pool, when jobs are Idle, submits WN VMs to IaaS (up to some limit). Worker Node VMs – Condor startds join back to local Condor cluster. VMs are identical, don't need public IPs, and don't need to know about each other. Panda site (BNL_CLOUD) – Associated with BNL SE, LFC, CVMFS-based releases. – But no site-internal configuration (NFS, file transfer, etc). John Hover 13 Nov 2012 15

VM Lifecycle Management Current status: – Automatic ramp-up working properly. – Submits properly to EC2 and Openstack via separate APF queues. – Passive draining when Panda queue work completes. – Out-of-band shutdown and termination via command line tool: – Required configuration to allow APF user to retire nodes. ( _condor_PASSWORD_FILE). Next steps: – Active ramp-down via retirement from within APF. – Adds in tricky issue of “un-retirement” during alternation between ramp- up and ramp-down. – APF issues condor_off -peaceful -daemon startd -name <host> – APF uses condor_q and condor_status to associate startds with VM jobs. Adds in startd status to VM job info and aggregate statistics. John Hover 13 Nov 2012 16

Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, - PowerPoint PPT Presentation

Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, and APF John Hover OSG All-Hands Meeting 2013 Indianapolis, Indiana John Hover 13 Mar 2013 1 Outline Rationale In general... OSG-specific Dependencies/Limitations

Getting popular Figure 1 : Condor downloads by platform Figure 2 : Known # of Condor hosts

OSG All-Hands Meeting UNC - Chapel Hill March 4, 2008 Condor on RCAC Clusters Campus Condor

Condor Gold plc www.condorgold.com 1 CONDOR GOLD PLC DISCLAIMER This written presentation (the

CONDOR GOLD Presentation PDAC 3 - 6 March 2019 CONDOR GOLD PLC Disclaimer This

Condor Gold plc www.condorgold.com 4 th to 5 th March 2013 1 CONDOR GOLD PLC DISCLAIMER This

CONDOR GOLD Presentation 121 Mining Investment 20 & 21 May 2019 1 CONDOR GOLD PLC

What is Condor? Specialized job and resource management system (RMS) for compute intensive

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid & Cloud Computing

Condor Resources plc Ocean Equities Mining for Growth Conference 7 th -8 th September 2011

Condor Resources plc Master Investor Conference 16 th April 2011 www.condorresourcesplc.com 1

Condor Resources plc Proactive Investors Presentation 10 th February 2011

USA Site Report: DOSAR C.M. Jenkins 9/23/2009 DOSAR Site Report - C M Jenkins 1 Condor Cluster

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Makeflow Work Local Condor Torque Queue W W Makefile FutureGrid Private Torque W

I nternational research The evidence on clusters is clear Firms located in clusters are more

YUMMY YUMMY FRUIT SALAD: AN ANALYSIS OF APPLE PAY Image stolen from:

A Microservices Journey Susanne Kaiser @suksr CTO at Just Software @JustSocialApps Each

Unreliable Publish/Subscribe Protocols Georgios Bouloukakis 1,3 , Ajay Kattepur 2 , Nikolaos

Modern CDCL SAT Solvers SAT / SMT Summer School 12. June 2012 Fondazione Bruno Kessler Trento,

Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

Semantic Information Broker for Smart Spaces: Value Offering Deployment Options Kick-off talk for

RUNNING MYSQL IN K8S Version: 03.10.19 Mykola Marzhan Has been developing deployment, update

Review of FAIR operating costs at CERN on 21 st January 2013 Present: MAC members: G. Bisoffi, K.

Sambuz

Useful Links

Newsletter

Mail Us

Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, - PowerPoint PPT Presentation

Large-scale Cloud-based clusters using Boxgrinder, Condor, Panda, and APF John Hover OSG All-Hands Meeting 2013 Indianapolis, Indiana John Hover 13 Mar 2013 1 Outline Rationale In general... OSG-specific Dependencies/Limitations

Getting popular Figure 1 : Condor downloads by platform Figure 2 : Known # of Condor hosts

OSG All-Hands Meeting UNC - Chapel Hill March 4, 2008 Condor on RCAC Clusters Campus Condor

Condor Gold plc www.condorgold.com 1 CONDOR GOLD PLC DISCLAIMER This written presentation (the

CONDOR GOLD Presentation PDAC 3 - 6 March 2019 CONDOR GOLD PLC Disclaimer This

Condor Gold plc www.condorgold.com 4 th to 5 th March 2013 1 CONDOR GOLD PLC DISCLAIMER This

CONDOR GOLD Presentation 121 Mining Investment 20 &amp; 21 May 2019 1 CONDOR GOLD PLC

What is Condor? Specialized job and resource management system (RMS) for compute intensive

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid &amp; Cloud Computing

Condor Resources plc Ocean Equities Mining for Growth Conference 7 th -8 th September 2011

Condor Resources plc Master Investor Conference 16 th April 2011 www.condorresourcesplc.com 1

Condor Resources plc Proactive Investors Presentation 10 th February 2011

USA Site Report: DOSAR C.M. Jenkins 9/23/2009 DOSAR Site Report - C M Jenkins 1 Condor Cluster

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Makeflow Work Local Condor Torque Queue W W Makefile FutureGrid Private Torque W

I nternational research The evidence on clusters is clear Firms located in clusters are more

YUMMY YUMMY FRUIT SALAD: AN ANALYSIS OF APPLE PAY Image stolen from:

A Microservices Journey Susanne Kaiser @suksr CTO at Just Software @JustSocialApps Each

Unreliable Publish/Subscribe Protocols Georgios Bouloukakis 1,3 , Ajay Kattepur 2 , Nikolaos

Modern CDCL SAT Solvers SAT / SMT Summer School 12. June 2012 Fondazione Bruno Kessler Trento,

Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

Semantic Information Broker for Smart Spaces: Value Offering Deployment Options Kick-off talk for

RUNNING MYSQL IN K8S Version: 03.10.19 Mykola Marzhan Has been developing deployment, update

Review of FAIR operating costs at CERN on 21 st January 2013 Present: MAC members: G. Bisoffi, K.

Sambuz

Useful Links

Newsletter

Mail Us

CONDOR GOLD Presentation 121 Mining Investment 20 & 21 May 2019 1 CONDOR GOLD PLC

A Decade of Condor at Fermilab Steven Timm timm@fnal.gov Fermilab Grid & Cloud Computing