Research in Production Clouds Designed for Transition intelligent - - PowerPoint PPT Presentation

research in production
SMART_READER_LITE
LIVE PREVIEW

Research in Production Clouds Designed for Transition intelligent - - PowerPoint PPT Presentation

Mike May, Technology Director Research in Production Clouds Designed for Transition intelligent architectures and big data science WHO I AM Mike May, Technical Director 30+ cloud deployments 4 production US Government clouds 6


slide-1
SLIDE 1

Research in Production

Mike May, Technology Director

Clouds Designed for Transition

intelligent architectures and big data science

slide-2
SLIDE 2

WHO I AM

  • Mike May, Technical Director

– 30+ cloud deployments

  • 4 production US Government clouds
  • 6 simultaneous research DARPA clouds

– Background

  • Cybersecurity
  • HPC Systems Engineering
  • Stacker since 2013

2

slide-3
SLIDE 3

WHAT WE DO

  • Supporting 10 Active R&D Programs

– Mostly DARPA programs – Design and deploy upstream IaaS

  • OpenStack
  • Mesos
  • Kubernetes
  • Burst support to public clouds

– Drastically diverse research goals

  • Data science and data analytic heavy

3

slide-4
SLIDE 4

WHAT WE DEPLOY

4

slide-5
SLIDE 5

BANANA(🍍) FOR SCALE

  • 1 DARPA Program Cluster (OpenStack and Mesos)

– 476 Raw CPU cores (952 Threads/vCPUs) – 13TB RAM; never overprovisioned – 2PB Raw Disk – 14 GPU Nodes – 48 Pascal GPUs

  • 172032 CUDA Cores
  • 576GB of VRAM

– Bare metal Mesos with burst to OpenStack – GPU development VMs available in OpenStack – GPU batch job support in Mesos – 100% Open Source Tools

5

slide-6
SLIDE 6

HOW IS IT USED

  • Seamless development to production experience

– Hardware is shifted between IaaS offerings as needed – Development heavy start, production heavy transition

  • Simultaneous provisioned and batch job support
  • CI/CD processes automatically promote work to

relevant cluster resources

– Provided to users automatically and they have full control – Fire and forget methodology; fail fast

  • L2 isolation by default

6

slide-7
SLIDE 7

SYSTEM LIMITS

  • CPU overprovisioning is never >8x (which is a

lot)

– CPU NOPs are our enemy too – We collect metadata on performance; metadata is collected on boxes that are currently not

  • verprovisioned

– Per-program level – Case-by-case

7

slide-8
SLIDE 8

SYSTEM LIMITS (CONT.)

  • RAM is NEVER overprovisioned

– Problems bubbled up as bare metal OS issues

  • GPUs are (painfully) special

– In and out of batch processing pipelines – Obvious but important: development and experiments change use case and needs

8

slide-9
SLIDE 9

STARTING WITH A BASELINE

  • We needed a baseline that others could

reproduce locally

  • Fuel was a great start because of the web

interface that made the process much easier to ingest

9

slide-10
SLIDE 10

CUSTOMIZE OFF OF BASELINE

  • Ansible supported all customizations applied

after a baseline deployment

  • ”Program-public” for all to see

10

slide-11
SLIDE 11

CLOUD OPS

  • Cloud administration

– Configuration management

  • “Ansiblize” all the things
  • Idempotency is key

– Automation

  • “(Almost) any task I have done more than once is to be

automated/scripted”

– Easier said than done

11

slide-12
SLIDE 12

SYSTEMS FROM CODE

  • All it takes to build an OpenStack base image

– GitLab – Packer – Ansible

12

slide-13
SLIDE 13

A LITTLE CODE

13

slide-14
SLIDE 14

CLOUDS AS CODE

CODE FOR OUR CURRENT DEPLOYMENTS

Every management and service task is captured and reviewable by the entire team.

14

slide-15
SLIDE 15

SELF-SERVICE PROXY WITH AUTHENTICATION

15

AUTHENTICATION USER DRIVEN

slide-16
SLIDE 16

LESSONS LEARNED

  • Putting off automation reduces the chance you will ever

do it

  • Monitoring is hard to do right but powerful to

understand user’s interactions with services

  • Ground truth / root cause EVERYTHING

– Issues, alerts, crashes, user reports

  • Researchers are biased (and so are admins and
  • perators)
  • Evacuation must always be an option

– Resource planning

  • Document and train by default

16

slide-17
SLIDE 17

17

THANK YOU!