research in production
play

Research in Production Clouds Designed for Transition intelligent - PowerPoint PPT Presentation

Mike May, Technology Director Research in Production Clouds Designed for Transition intelligent architectures and big data science WHO I AM Mike May, Technical Director 30+ cloud deployments 4 production US Government clouds 6


  1. Mike May, Technology Director Research in Production Clouds Designed for Transition intelligent architectures and big data science

  2. WHO I AM • Mike May, Technical Director – 30+ cloud deployments • 4 production US Government clouds • 6 simultaneous research DARPA clouds – Background • Cybersecurity • HPC Systems Engineering • Stacker since 2013 2

  3. WHAT WE DO • Supporting 10 Active R&D Programs – Mostly DARPA programs – Design and deploy upstream IaaS • OpenStack • Mesos • Kubernetes • Burst support to public clouds – Drastically diverse research goals • Data science and data analytic heavy 3

  4. WHAT WE DEPLOY 4

  5. BANANA( 🍍 ) FOR SCALE • 1 DARPA Program Cluster (OpenStack and Mesos) – 476 Raw CPU cores (952 Threads/vCPUs) – 13TB RAM; never overprovisioned – 2PB Raw Disk – 14 GPU Nodes – 48 Pascal GPUs • 172032 CUDA Cores • 576GB of VRAM – Bare metal Mesos with burst to OpenStack – GPU development VMs available in OpenStack – GPU batch job support in Mesos – 100% Open Source Tools 5

  6. HOW IS IT USED • Seamless development to production experience – Hardware is shifted between IaaS offerings as needed – Development heavy start, production heavy transition • Simultaneous provisioned and batch job support • CI/CD processes automatically promote work to relevant cluster resources – Provided to users automatically and they have full control – Fire and forget methodology; fail fast • L2 isolation by default 6

  7. SYSTEM LIMITS • CPU overprovisioning is never >8x (which is a lot) – CPU NOPs are our enemy too – We collect metadata on performance; metadata is collected on boxes that are currently not overprovisioned – Per-program level – Case-by-case 7

  8. SYSTEM LIMITS (CONT.) • RAM is NEVER overprovisioned – Problems bubbled up as bare metal OS issues • GPUs are (painfully) special – In and out of batch processing pipelines – Obvious but important: development and experiments change use case and needs 8

  9. STARTING WITH A BASELINE • We needed a baseline that others could reproduce locally • Fuel was a great start because of the web interface that made the process much easier to ingest 9

  10. CUSTOMIZE OFF OF BASELINE • Ansible supported all customizations applied after a baseline deployment • ”Program-public” for all to see 10

  11. CLOUD OPS • Cloud administration – Configuration management • “Ansiblize” all the things • Idempotency is key – Automation • “(Almost) any task I have done more than once is to be automated/scripted” – Easier said than done 11

  12. SYSTEMS FROM CODE • All it takes to build an OpenStack base image – GitLab – Packer – Ansible 12

  13. A LITTLE CODE 13

  14. CLOUDS AS CODE CODE FOR OUR CURRENT DEPLOYMENTS Every management and service task is captured and reviewable by the entire team. 14

  15. SELF-SERVICE PROXY WITH AUTHENTICATION USER DRIVEN AUTHENTICATION 15

  16. LESSONS LEARNED • Putting off automation reduces the chance you will ever do it • Monitoring is hard to do right but powerful to understand user’s interactions with services • Ground truth / root cause EVERYTHING – Issues, alerts, crashes, user reports • Researchers are biased (and so are admins and operators) • Evacuation must always be an option – Resource planning • Document and train by default 16

  17. THANK YOU! 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend