cern s virtual file system for global scale software
play

CERNs Virtual File System for Global-Scale Software Delivery Jakob - PowerPoint PPT Presentation

CERNs Virtual File System for Global-Scale Software Delivery Jakob Blomer for the CernVM-FS team CERN, EP-SFT MSST 2019, Santa Clara University Agenda High Energy Physics Computing Model Software Distribution Challenge CernVM-FS: A


  1. CERN’s Virtual File System for Global-Scale Software Delivery Jakob Blomer for the CernVM-FS team CERN, EP-SFT MSST 2019, Santa Clara University

  2. Agenda High Energy Physics Computing Model Software Distribution Challenge CernVM-FS: A Purpose-Built Software File System jblomer@cern.ch CernVM-FS / MSST’19 1 / 20

  3. High Energy Physics Computing Model

  4. Accelerate & Collide jblomer@cern.ch CernVM-FS / MSST’19 2 / 20

  5. Measure & Analyze • Billions of independent “events” • Each event subject to complex software processing ⊃ High-Throughput Computing jblomer@cern.ch CernVM-FS / MSST’19 3 / 20

  6. Federated Computing Model • Physics and computing: international collaborations • “The Grid”: ≍ 160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa jblomer@cern.ch CernVM-FS / MSST’19 4 / 20

  7. Federated Computing Model Additional opportunistic HPC e. g. resources, backfill slots • Physics and computing: international collaborations • “The Grid”: ≍ 160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa jblomer@cern.ch CernVM-FS / MSST’19 4 / 20

  8. Software Distribution Challenge

  9. The Anatomy of a Scientific Software Stack 0.1 MLOC Individual changing Analysis Code Key Figures for LHC Experiments • Hundreds of (novice) developers 4 MLOC Experiment Software Framework • > 100 000 files per release • 1 TB / day of nightly builds 5 MLOC High Energy Physics • ∼ 100 000 machines world-wide Libraries • Daily production releases, 20 MLOC stable Compiler remain available “eternally” System Libraries OS Kernel jblomer@cern.ch CernVM-FS / MSST’19 5 / 20

  10. Container Image Distribution Libs . . . Linux • Containers are easier to create than to role-out at scale • Due to network congestion: long startup-times in large clusters • Impractical image cache management on worker nodes • Ideally: Containers for isolation and orchestration, but not for distribution jblomer@cern.ch CernVM-FS / MSST’19 6 / 20

  11. Shared Software Area on General Purpose DFS Working Set • ≈ 2 % to 10 % of all available files are requested at runtime • Median of file sizes: < 4 kB Software Flash Crowd Effect dDoS • O (MHz) meta data request rate • O (kHz) file open rate • • • Shared Software Area jblomer@cern.ch CernVM-FS / MSST’19 7 / 20

  12. Software vs. Data Software Data POSIX interface put, get, seek, streaming File dependencies Independent files O (kB) per file O (GB) per file Whole files File chunks Absolute paths Relocatable WORM (“write-once-read-many”) Billions of files Versioned Software is massive not in volume but in number of objects and meta-data rates jblomer@cern.ch CernVM-FS / MSST’19 8 / 20

  13. CernVM-FS: A Purpose-Built Software File System

  14. Design Objectives Transformation HTTP Transport a1240 Caching & Replication 41ae3 . . . 7e95b Read-Only Content-Addressed Objects, Read/Write File System File System Merkle Tree Worker Nodes Software Publisher / Master Source 1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access jblomer@cern.ch CernVM-FS / MSST’19 9 / 20

  15. Design Objectives Transformation HTTP Transport Several CDN options : a1240 41ae3 Caching & Replication . . . • Apache + Squids 7e95b • Ceph/S3 Read-Only Content-Addressed Objects, Read/Write File System File System Merkle Tree • Commercial CDN Worker Nodes Software Publisher / Master Source 1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access jblomer@cern.ch CernVM-FS / MSST’19 9 / 20

  16. Scale of Deployment Source / Stratum 0 Replica / Stratum 1 Site / Edge Cache LHC infrastructure: • > 1 billion files • ≍ 100 000 nodes • 5 replicas, 400 web caches jblomer@cern.ch CernVM-FS / MSST’19 10 / 20

  17. High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxy Web Servery O (100) nodes / server O (10) DCs / server Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  18. High-Availability by Horizontal Scaling Server side: stateless services Data Center Load Balancing Web Servery O (100) nodes / server O (10) DCs / server HTTP HTTP Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  19. High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Web Servery O (100) nodes / server O (10) DCs / server Failover Worker Nodes HTTP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  20. High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Failover Worker Nodes Geo-IP HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  21. High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Failover Worker Nodes HTTP jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  22. High-Availability by Horizontal Scaling Server side: stateless services Data Center Caching Proxies Mirror Serversy O (100) nodes / server O (10) DCs / server Worker Nodes Pre-populated Cache jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

  23. Reading CernVM-FS Global rAA Basic System Utilities HTTP Cache Hierarchy Fuse OS Kernel File System CernVM-FS Repository (HTTP or S3) Memory Buffer Persistent Cache All Versions Available ∼ 1 GB ∼ 20 GB ∼ 10 TB • Fuse based, independent mount points, e. g. /cvmfs/atlas.cern.ch • High cache effiency because entire cluster likely to use same software jblomer@cern.ch CernVM-FS / MSST’19 12 / 20

  24. Writing Staging Area CernVM-FS Read-Only Union File System Read/Write Interface File System, S3 Publishing new content [ ~ ]# cvmfs_server transaction containers.cern.ch [ ~ ]# cd /cvmfs/containers.cern.ch && tar xvf ubuntu1610.tar.gz [ ~ ]# cvmfs_server publish containers.cern.ch jblomer@cern.ch CernVM-FS / MSST’19 13 / 20

  25. Use of Content-Addressable Storage /cvmfs/alice.cern.ch Object Store amd64-gcc6.0 • Compressed files and chunks 4.2.0 • De-duplicated ChangeLog . . . File Catalog Compression, Hashing • Directory structure, symlinks 806fbb67373e9... • Content hashes of regular files Repository • Large files: chunked with rolling checksum • Digitally signed • Time to live Object Store File catalogs • Partitioned / Merkle hashes ⊕ Immutable files, trivial to check for corruption, versioning, efficient replication (possibility of sub catalogs) − compute-intensive, garbage collection required jblomer@cern.ch CernVM-FS / MSST’19 14 / 20

  26. Partitioning of Meta-Data • certificates aarch64 • Locality by software version • Locality by frequency of changes x86_64 • Partitioning up to software librarian, steering through .cvmfscatalog magical marker files gcc Python v8.3 v3.4 jblomer@cern.ch CernVM-FS / MSST’19 15 / 20

  27. Deduplication and Compression 24 months of software releases for a single LHC experiment 600 15 File system entries · 10 6 Volume [GB] 400 10 200 5 100 1 s s s s s d e e e e e e i l t l t r fi fi s a a s t n c c e r r a i a i r e l l p p p l l l u u u u m l A g g d d o e e R t R t C u u o o h h t t i u W W jblomer@cern.ch CernVM-FS / MSST’19 16 / 20

  28. Site-local network traffic: CernVM-FS compared to NFS NFS server before and after the switch: Site squid web cache before and after the switch: Source: Ian Collier jblomer@cern.ch CernVM-FS / MSST’19 17 / 20

  29. Latency sensitivity: CernVM-FS compared to AFS Use case: starting “stressHepix” standard benchmark Startup overhead vs. latency 600 AFS 12 CernVM-FS 500 Throughput [Mbit/s] 10 Throughput 400 8 ∆ t [min] 300 6 4 200 2 100 0 0 LAN 25 50 100 150 Round trip time [ms] jblomer@cern.ch CernVM-FS / MSST’19 18 / 20

  30. Principal Application Areas ❶ Production Software ❸ Unpacked Container Images Example: /cvmfs/ligo.egi.eu Example: /cvmfs/singularity.opensciencegrid.org � Most mature use case � Works out of the box with Singularity ★ Fully unprivileged deployment of fuse module � CernVM-FS driver for Docker ★ Integration with containerd / kubernetes ❷ Integration Builds ❹ Auxiliary data sets Example: /cvmfs/lhcbdev.cern.ch Example: /cvmfs/alice-ocdb.cern.ch � High churn, requires regular garbage collection � Benefits from internal versioning ★ Update propagation from minutes to seconds • Depending on volume requires more planning for the CDN components ★ Current focus of developments jblomer@cern.ch CernVM-FS / MSST’19 19 / 20

  31. Summary • CernVM-FS: special-purpose virtual file system that provides a global shared software area for many scientific collaborations • Content-addressed storage and asynchronous writing ( publishing ) key to meta-data scalability • Current areas of development: • Fully unprivileged deployment • Integration with containerd/kubernetes image management engine https://github.com/cvmfs/cvmfs jblomer@cern.ch CernVM-FS / MSST’19 20 / 20

  32. Backup Slides

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend