Magic Castle Terraforming the Cloud for HPC Flix-Antoine Fortin, - - PowerPoint PPT Presentation

magic castle
SMART_READER_LITE
LIVE PREVIEW

Magic Castle Terraforming the Cloud for HPC Flix-Antoine Fortin, - - PowerPoint PPT Presentation

Magic Castle Terraforming the Cloud for HPC Flix-Antoine Fortin, FOSDEM20 Why are there more wizards in Harry Potter than in Lord of the Rings? Context Canada Digital Research Infrastructure Education and Training in Compute Canada Over


slide-1
SLIDE 1

Magic Castle

Terraforming the Cloud for HPC

Félix-Antoine Fortin, FOSDEM20

slide-2
SLIDE 2

Why are there more wizards in Harry Potter than in Lord of the Rings?

slide-3
SLIDE 3

Context

slide-4
SLIDE 4

Canada Digital Research Infrastructure

slide-5
SLIDE 5

Education and Training in Compute Canada

  • Over 150 workshops / year
  • Most workshops use the

HPC software environment

  • HPC clusters require an

account

  • Account creation process

can take a few days

Could we replicate the HPC environment for training?

slide-6
SLIDE 6

So what is the difgerence between HP and LotR? ?

slide-7
SLIDE 7

So what is the difgerence between HP and LotR? Wizardry Schools

slide-8
SLIDE 8

Proposal

slide-9
SLIDE 9

HPC Wizard Tower by Simon Guilbault

slide-10
SLIDE 10

demo

slide-11
SLIDE 11

CC Wizard: Magic Castle Voice Assistant

slide-12
SLIDE 12

CC Wizard: Magic Castle Voice Assistant

slide-13
SLIDE 13

Magic Castle

Open source project that instantiates a Compute Canada cluster replica in any major cloud with Terraform and Puppet

  • Create instances

○ Management nodes ○ Login nodes ○ Compute nodes

  • Create volumes, network, network acls
  • Create certificates, dns records, passwords
  • Configuration done via input parameters

https://github.com/computecanada/magic_castle

slide-14
SLIDE 14

Terraform

  • Tool for building,

changing, and versioning infrastructure

  • Infrastructure is

described using a high-level configuration syntax.

  • Create resources that

can then be setup by a config management tool.

  • Config management tool

used for deploying, configuring and managing servers.

  • Define configurations

for each host

  • Continuously check

whether the required configuration is in place and is not altered

Puppet

slide-15
SLIDE 15

Overview of a Magic Castle Release

Magic Castle provider* main.tf data.tf variables.tf

  • utput.tf

infrastructure.tf cloud-init mgmt.yaml puppet.yaml provider.tf

*could be any in [aws, azure, gcp, openstack, ovh]

slide-16
SLIDE 16

Infrastructure

slide-17
SLIDE 17

Overview of a Magic Castle Release

Magic Castle provider* main.tf data.tf variables.tf

  • utput.tf

infrastructure.tf cloud-init mgmt.yaml puppet.yaml provider.tf

*could be any in [aws, azure, gcp, openstack, ovh]

slide-18
SLIDE 18

Architecture

slide-19
SLIDE 19

Architecture - login nodes

slide-20
SLIDE 20

Architecture - management nodes

slide-21
SLIDE 21

Architecture - compute nodes

slide-22
SLIDE 22

Main Interface

slide-23
SLIDE 23

Overview of a Magic Castle Release

Magic Castle provider* main.tf data.tf variables.tf

  • utput.tf

infrastructure.tf cloud-init mgmt.yaml puppet.yaml provider.tf

*could be any in [aws, azure, gcp, openstack, ovh]

slide-24
SLIDE 24

Magic Castle Terraform Main Module

4 sections

  • 1. Cloud provider selection
  • 2. Infrastructure customization
  • 3. Cloud Provider specifics inputs
  • 4. DNS Configuration (optional)
slide-25
SLIDE 25

MC Module - 1. source

source = "./provider"

slide-26
SLIDE 26

cluster_name = "fosdem" domain = "computecanada.dev" image = "CentOS-7-x64-2019-07" nb_users = 100 public_keys = [file("~/.ssh/id.pub")]

MC Module - 2.1 Infrastructure customization

slide-27
SLIDE 27

MC Module - 2.2 Instance definition

instances = { mgmt = { type = "p4-6gb", count = 1 }, login = { type = "p2-3gb", count = 1 }, node = { type = "p2-3gb", count = 1 } }

slide-28
SLIDE 28

MC Module - 2.3 Storage definition

storage = { type = "nfs" home_size = 100 project_size = 50 scratch_size = 50 }

slide-29
SLIDE 29

MC Module - 3. Cloud Provider Specific Inputs

Examples:

  • OpenStack list of floating ips
  • Google GPU attachment for compute nodes
  • AWS / Azure / Google Cloud region
slide-30
SLIDE 30

MC Module - 4. DNS Configuration (optional)

source = "./dns/cloudflare" name = module.provider.cluster_name domain = module.provider.domain email = "you@example.com" public_ip = module.provider.ip rsa_public_key = module.provider.rsa_public_key sudoer_username = module.provider.sudoer_username

slide-31
SLIDE 31

Apply Plan

$ terraform apply Apply complete! Resources: 30 added, 0 changed, 0 destroyed. Outputs: admin_username = centos guest_passwd = **redacted** guest_usernames = user[01-10] hostnames = [pirate.calculquebec.cloud, pirate1.calculquebec.cloud] public_ip = [206.12.90.97]

slide-32
SLIDE 32

Challenges: Infrastructure as Code

  • Designing the main user interface that would limit

the references to a provider specific implementation / API.

  • Terraform configuration language tends to favor

repetition over re-use of code.

  • Regrouping every components that are common amongst

providers

slide-33
SLIDE 33

Provisioning

slide-34
SLIDE 34

Overview of a Magic Castle Release

Magic Castle provider* main.tf data.tf variables.tf

  • utput.tf

infrastructure.tf cloud-init mgmt.yaml puppet.yaml provider.tf

*could be any in [aws, azure, gcp, openstack, ovh]

slide-35
SLIDE 35

Bootstrap Puppet

1. Inject data from TF 2. Upgrade CentOS 3. Install Puppet rpms 4. Configure Puppet certificates 5. Setup host configuration

slide-36
SLIDE 36

l

  • g

i n 1 node1 node2 m g m t 1 node3 node4 node5 n

  • d

e 6 Provisioning with Puppet and Consul

slide-37
SLIDE 37

Challenges: Provisioning

  • Every steps of the provisioning need to work

without human intervention.

  • Once provisioned, the cluster need to stay healthy
  • n itself - users are not necessarily sys admins.
  • Provisioning both master and slave services without

proper syncing mechanism.

slide-38
SLIDE 38

Software

slide-39
SLIDE 39

Batteries Included

  • FreeIPA

○ Kerberos ○ BIND ○ 389 DS LDAP

  • NFS
  • Slurm
  • Globus Endpoint
  • JupyterHub with BatchSpawner
  • Compute Canada CVMFS
  • LMOD
slide-40
SLIDE 40

Compute Canada Software Stack - CVMFS

  • CernVM File System (CVMFS) provides a scalable,

reliable and low-maintenance software distribution service;

  • Compute Canada CVMFS repo:

○ 600+ scientific applications ○ 4,000+ permutations of version/arch/toolchain ○ All compiled with EasyBuild

  • Available from anywhere
  • PEARC19 paper
slide-41
SLIDE 41

Key Takeaways

  • 1. Terraform can be used to build complex

things and modules simplify that complexity.

  • 2. Magic Castle is a teaching and

development meta-platform for HPC.

slide-42
SLIDE 42

Magic Castle Replicates a Compute Canada Cluster in 20 min.

slide-43
SLIDE 43

Questions ?