CERNs Virtual File System for Global-Scale Software Delivery Jakob - - PowerPoint PPT Presentation

cern s virtual file system for global scale software
SMART_READER_LITE
LIVE PREVIEW

CERNs Virtual File System for Global-Scale Software Delivery Jakob - - PowerPoint PPT Presentation

CERNs Virtual File System for Global-Scale Software Delivery Jakob Blomer for the CernVM-FS team CERN, EP-SFT MSST 2019, Santa Clara University Agenda High Energy Physics Computing Model Software Distribution Challenge CernVM-FS: A


slide-1
SLIDE 1

CERN’s Virtual File System for Global-Scale Software Delivery

Jakob Blomer for the CernVM-FS team CERN, EP-SFT MSST 2019, Santa Clara University

slide-2
SLIDE 2

Agenda High Energy Physics Computing Model Software Distribution Challenge CernVM-FS: A Purpose-Built Software File System

jblomer@cern.ch CernVM-FS / MSST’19 1 / 20

slide-3
SLIDE 3

High Energy Physics Computing Model

slide-4
SLIDE 4

Accelerate & Collide

jblomer@cern.ch CernVM-FS / MSST’19 2 / 20

slide-5
SLIDE 5

Measure & Analyze

  • Billions of independent “events”
  • Each event subject to complex

software processing ⊃ High-Throughput Computing

jblomer@cern.ch CernVM-FS / MSST’19 3 / 20

slide-6
SLIDE 6

Federated Computing Model

  • Physics and computing:

international collaborations

  • “The Grid”: ≍160 data centers
  • Approx.: global batch system
  • Code moves to the data rather

than vice versa

jblomer@cern.ch CernVM-FS / MSST’19 4 / 20

slide-7
SLIDE 7

Federated Computing Model

  • Physics and computing:

international collaborations

  • “The Grid”: ≍160 data centers
  • Approx.: global batch system
  • Code moves to the data rather

than vice versa Additional opportunistic resources,

  • e. g.

HPC backfill slots

jblomer@cern.ch CernVM-FS / MSST’19 4 / 20

slide-8
SLIDE 8

Software Distribution Challenge

slide-9
SLIDE 9

The Anatomy of a Scientific Software Stack

Compiler System Libraries OS Kernel 20 MLOC High Energy Physics Libraries 5 MLOC Experiment Software Framework 4 MLOC Individual Analysis Code 0.1 MLOC stable changing

Key Figures for LHC Experiments

  • Hundreds of (novice) developers
  • > 100 000 files per release
  • 1 TB / day of nightly builds
  • ∼100 000 machines world-wide
  • Daily production releases,

remain available “eternally”

jblomer@cern.ch CernVM-FS / MSST’19 5 / 20

slide-10
SLIDE 10

Container Image Distribution

Linux Libs . . .

  • Containers are easier to create than to role-out at scale
  • Due to network congestion: long startup-times in large

clusters

  • Impractical image cache management on worker nodes
  • Ideally: Containers for isolation and orchestration,

but not for distribution

jblomer@cern.ch CernVM-FS / MSST’19 6 / 20

slide-11
SLIDE 11

Shared Software Area on General Purpose DFS

Working Set

  • ≈2 % to 10 % of all available files are requested at runtime
  • Median of file sizes: < 4 kB

Flash Crowd Effect

  • O(MHz) meta data request rate
  • O(kHz) file open rate

dDoS Shared Software Area Software

  • • •

jblomer@cern.ch CernVM-FS / MSST’19 7 / 20

slide-12
SLIDE 12

Software vs. Data

Software Data POSIX interface put, get, seek, streaming File dependencies Independent files O(kB) per file O(GB) per file Whole files File chunks Absolute paths Relocatable WORM (“write-once-read-many”) Billions of files Versioned Software is massive not in volume but in number of objects and meta-data rates

jblomer@cern.ch CernVM-FS / MSST’19 8 / 20

slide-13
SLIDE 13

CernVM-FS: A Purpose-Built Software File System

slide-14
SLIDE 14

Design Objectives

a1240 41ae3 . . . 7e95b

Read/Write File System Content-Addressed Objects, Merkle Tree Read-Only File System Transformation HTTP Transport Caching & Replication Software Publisher / Master Source Worker Nodes

  • 1. World-wide scalability
  • 2. Infrastructure compatibility
  • 3. Application-level consistency
  • 4. Efficient meta-data access

jblomer@cern.ch CernVM-FS / MSST’19 9 / 20

slide-15
SLIDE 15

Design Objectives

a1240 41ae3 . . . 7e95b

Read/Write File System Content-Addressed Objects, Merkle Tree Read-Only File System Transformation HTTP Transport Caching & Replication Software Publisher / Master Source Worker Nodes

  • 1. World-wide scalability
  • 2. Infrastructure compatibility
  • 3. Application-level consistency
  • 4. Efficient meta-data access

Several CDN options:

  • Apache + Squids
  • Ceph/S3
  • Commercial CDN

jblomer@cern.ch CernVM-FS / MSST’19 9 / 20

slide-16
SLIDE 16

Scale of Deployment

LHC infrastructure:

  • > 1 billion files
  • ≍100 000 nodes
  • 5 replicas, 400 web caches

Source / Stratum 0 Replica / Stratum 1 Site / Edge Cache

jblomer@cern.ch CernVM-FS / MSST’19 10 / 20

slide-17
SLIDE 17

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes Caching Proxy

O(100) nodes / server

Web Servery

O(10) DCs / server

HTTP HTTP

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-18
SLIDE 18

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes Load Balancing

O(100) nodes / server

Web Servery

O(10) DCs / server

HTTP HTTP HTTP HTTP

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-19
SLIDE 19

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes Caching Proxies

O(100) nodes / server

Web Servery

O(10) DCs / server

HTTP Failover HTTP

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-20
SLIDE 20

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes Caching Proxies

O(100) nodes / server

Mirror Serversy

O(10) DCs / server

HTTP Failover Geo-IP

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-21
SLIDE 21

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes Caching Proxies

O(100) nodes / server

Mirror Serversy

O(10) DCs / server

HTTP Failover

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-22
SLIDE 22

High-Availability by Horizontal Scaling

Server side: stateless services

Data Center Worker Nodes

Pre-populated Cache

Caching Proxies

O(100) nodes / server

Mirror Serversy

O(10) DCs / server

jblomer@cern.ch CernVM-FS / MSST’19 11 / 20

slide-23
SLIDE 23

Reading

rAA Basic System Utilities CernVM-FS OS Kernel Fuse Global HTTP Cache Hierarchy File System Memory Buffer ∼1 GB CernVM-FS Persistent Cache ∼20 GB Repository (HTTP or S3) All Versions Available ∼10 TB

  • Fuse based, independent mount points, e. g. /cvmfs/atlas.cern.ch
  • High cache effiency because entire cluster likely to use same software

jblomer@cern.ch CernVM-FS / MSST’19 12 / 20

slide-24
SLIDE 24

Writing

CernVM-FS Read-Only Staging Area Union File System Read/Write Interface File System, S3

Publishing new content [ ~ ]# cvmfs_server transaction containers.cern.ch [ ~ ]# cd /cvmfs/containers.cern.ch && tar xvf ubuntu1610.tar.gz [ ~ ]# cvmfs_server publish containers.cern.ch

jblomer@cern.ch CernVM-FS / MSST’19 13 / 20

slide-25
SLIDE 25

Use of Content-Addressable Storage

Repository

/cvmfs/alice.cern.ch amd64-gcc6.0 4.2.0 ChangeLog . . . 806fbb67373e9...

Object Store File catalogs Compression, Hashing

⊕ Immutable files, trivial to check for corruption, versioning, efficient replication − compute-intensive, garbage collection required Object Store

  • Compressed files and chunks
  • De-duplicated

File Catalog

  • Directory structure, symlinks
  • Content hashes of regular files
  • Large files: chunked with rolling checksum
  • Digitally signed
  • Time to live
  • Partitioned / Merkle hashes

(possibility of sub catalogs)

jblomer@cern.ch CernVM-FS / MSST’19 14 / 20

slide-26
SLIDE 26

Partitioning of Meta-Data

  • aarch64

x86_64 gcc v8.3 Python v3.4 certificates

  • Locality by software version
  • Locality by frequency of changes
  • Partitioning up to software librarian,

steering through .cvmfscatalog magical marker files

jblomer@cern.ch CernVM-FS / MSST’19 15 / 20

slide-27
SLIDE 27

Deduplication and Compression

1 5 10 15 100 200 400 600

R e g u l a r fi l e s W i t h

  • u

t d u p l i c a t e s R e g u l a r fi l e s W u t h

  • u

t d u p l i c a t e s C

  • m

p r e s s e d A l l e n t r i e s

File system entries ·106 Volume [GB] 24 months of software releases for a single LHC experiment

jblomer@cern.ch CernVM-FS / MSST’19 16 / 20

slide-28
SLIDE 28

Site-local network traffic: CernVM-FS compared to NFS

NFS server before and after the switch: Site squid web cache before and after the switch:

Source: Ian Collier jblomer@cern.ch CernVM-FS / MSST’19 17 / 20

slide-29
SLIDE 29

Latency sensitivity: CernVM-FS compared to AFS

Use case: starting “stressHepix” standard benchmark

2 4 6 8 10 12 LAN 25 50 100 150 100 200 300 400 500 600 ∆t [min] Throughput [Mbit/s] Round trip time [ms] Startup overhead vs. latency AFS CernVM-FS Throughput

jblomer@cern.ch CernVM-FS / MSST’19 18 / 20

slide-30
SLIDE 30

Principal Application Areas

❶ Production Software

Example: /cvmfs/ligo.egi.eu

Most mature use case ★ Fully unprivileged deployment of fuse module ❸ Unpacked Container Images

Example: /cvmfs/singularity.opensciencegrid.org

Works out of the box with Singularity CernVM-FS driver for Docker ★ Integration with containerd / kubernetes ❷ Integration Builds

Example: /cvmfs/lhcbdev.cern.ch

High churn, requires regular garbage collection ★ Update propagation from minutes to seconds ❹ Auxiliary data sets

Example: /cvmfs/alice-ocdb.cern.ch

Benefits from internal versioning

  • Depending on volume requires more planning

for the CDN components

★ Current focus of developments jblomer@cern.ch CernVM-FS / MSST’19 19 / 20

slide-31
SLIDE 31

Summary

  • CernVM-FS: special-purpose virtual file system that provides a global shared software area for

many scientific collaborations

  • Content-addressed storage and asynchronous writing (publishing) key to meta-data scalability
  • Current areas of development:
  • Fully unprivileged deployment
  • Integration with containerd/kubernetes image management engine

https://github.com/cvmfs/cvmfs

jblomer@cern.ch CernVM-FS / MSST’19 20 / 20

slide-32
SLIDE 32

Backup Slides

slide-33
SLIDE 33

Links

Source code: https://github.com/cvmfs/cvmfs Downloads: https://cernvm.cern.ch/portal/filesystem/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: cvmfs-talk@cern.ch JIRA bug tracker: https://sft.its.cern.ch/jira/projects/CVM

slide-34
SLIDE 34

Supported Platforms

  • A Platforms:
  • EL 5–7 (soon: 8) AMD64
  • Ubuntu 16.04, 18.04 AMD64
  • B Platforms
  • macOS, latest two versions
  • SLES 11 – 12
  • Fedora, latest two versions
  • Debian 8–9
  • Ubuntu 12.04 and 14.04
  • EL7 AArch64
  • IA32 architecture
  • Experimental: POWER, Raspberry Pi, RISC-V
  • Blue sky idea: Windows based on ProjFS

Based on the current needs. Any platform with Fuse support should be straight-forward to address.

slide-35
SLIDE 35

CernVM-FS à la Carte

❶ HPC Client Deployment

  • Piz Daint:

Europe’s fastest supercomputer

  • Runs LHC

experiment jobs from native Fuse client

Lower cache layer on /gpfs/... ≤ ≤ ≤ Upper cache layer in WN Memory M Gila (CSCS)

❷ Data Namespace: /cvmfs/*.osgstorage.org

Site A Site B Secure POSIX access · · · graft namespace

XRootD

Namespace Data Source HTTPS + Client Authz

slide-36
SLIDE 36

CernVM-FS à la Carte

❶ HPC Client Deployment

  • Piz Daint:

Europe’s fastest supercomputer

  • Runs LHC

experiment jobs from native Fuse client

Lower cache layer on /gpfs/... ≤ ≤ ≤ Upper cache layer in WN Memory M Gila (CSCS)

❷ Data Namespace: /cvmfs/*.osgstorage.org

Site A Site B Secure POSIX access · · · graft namespace

XRootD

Namespace Data Source HTTPS + Client Authz

Custom client cache: hierarchical cache + RAM disk plugin

slide-37
SLIDE 37

CernVM-FS à la Carte

❶ HPC Client Deployment

  • Piz Daint:

Europe’s fastest supercomputer

  • Runs LHC

experiment jobs from native Fuse client

Lower cache layer on /gpfs/... ≤ ≤ ≤ Upper cache layer in WN Memory M Gila (CSCS)

❷ Data Namespace: /cvmfs/*.osgstorage.org

Site A Site B Secure POSIX access · · · graft namespace

XRootD

Namespace Data Source HTTPS + Client Authz

Custom client cache: hierarchical cache + RAM disk plugin access to multiple PB of data

slide-38
SLIDE 38

CernVM-FS à la Carte

❶ HPC Client Deployment

  • Piz Daint:

Europe’s fastest supercomputer

  • Runs LHC

experiment jobs from native Fuse client

Lower cache layer on /gpfs/... ≤ ≤ ≤ Upper cache layer in WN Memory M Gila (CSCS)

❷ Data Namespace: /cvmfs/*.osgstorage.org

Site A Site B Secure POSIX access · · · graft namespace

XRootD

Namespace Data Source HTTPS + Client Authz

Custom client cache: hierarchical cache + RAM disk plugin access to multiple PB of data A u t h e n t i c a t i

  • n

p l u g i n + high-bandwidth CDN

slide-39
SLIDE 39

Distributed Publishing

Gateway S3 Bucket (Ceph, AWS, . . . ) ssh or CI slave Populate HTTP CDN

Coordinating Multiple Publisher Nodes

  • Concurrent publisher nodes access storage

through gateway services

  • Gateway services:
  • API for publishing
  • Issues leases for sub paths
  • Receives change sets as set of

signed object packs

slide-40
SLIDE 40

Notification Service

Fast distribution channel for repository manifest: useful for CI pipelines, data QA Notification Service

AMQP HTTP > cvmfs_swissknife notify -p ... WebSocket WebSocket CVMFS_NOTIFICATION_SERVER=...

  • Optional service supporting a regular repository
  • Subscribe component integrated with the client, automatic reload on changes

→ CernVM-FS writing remains asynchronous but with fast response time in O(seconds)

slide-41
SLIDE 41

Unpacked Container Images in CernVM-FS

CernVM-FS Container Integration

  • Goal: avoid network congestion by starting unpacked containers from CernVM-FS
  • Client / worker node: requires CernVM-FS plug-ins for
  • Docker (available)
  • containerd (in contact with upstream developers)
  • CernVM-FS repository: efficient publishing of containers

Container Publishing Service Add-on service on the publisher node to facilitate container conversion from a Docker registry

Presentation

slide-42
SLIDE 42

Custom Docker Graph Driver I

Host machine Docker client Docker daemon Internet Docker registry Graphdriver plugin CVMFS Repository

S3 HTTP CVMFS

Graphdriver plugin plugin API S3 client CVMFS Client

AUFS Regular image Thin image

read-write layer local read-only layer thin image layer read-only layer on CVMFS

“Thin image”: empty layer con- taining only a descriptor that points to the actual unpacked layers in /cvmfs

slide-43
SLIDE 43

Custom Docker Graph Driver II

λ docker plugin install cvmfs/graphdriver λ docker run cvmfs/thin_cernvm "echo Hello, World!" See https://cvmfs.readthedocs.io/en/latest/cpt-graphdriver.html

slide-44
SLIDE 44

Container Publishing Service: Workflow

Push Docker Image MR to Wishlist Unpack and Publish Push Thin Image Wishlist https://gitlab.cern.ch/unpacked/sync

version: 1 user: cvmfsunpacker cvmfs_repo: 'unpacked.cern.ch'

  • utput_format: >

https://gitlab-registry.cern.ch/unpacked/sync/$(image) input:

  • 'https://registry.hub.docker.com/library/fedora:latest'
  • 'https://registry.hub.docker.com/library/debian:stable'
  • 'https://registry.hub.docker.com/library/centos:latest'

/cvmfs/unpacked.cern.ch

# Singularity /registry.hub.docker.com/fedora:latest -> \ /cvmfs/unpacked.cern.ch/.flat/d0/d0932... # Docker with thin image /.layers/f0/1af7... User CernVM-FS

slide-45
SLIDE 45

Enabling Feature for Container Publishing: Tarball Ingestion

Direct path for the common pattern of publishing tarball contents $ cvmfs_server transaction $ tar -xf ubuntu.tar.gz $ cvmfs_server publish $ zcat ubuntu.tar.gz | \ cvmfs_server ingest -t -

publish::SyncUnion Overlayfs Aufs Tarball

CernVM-FS Read-Only Read/Write Scratch Area CernVM-FS Read-Only Read/Write Scratch Area

Uses libarchive: support for rpm, zip,

  • etc. could be easily added

Uses libarchive: support for rpm, zip,

  • etc. could be easily added

Performance Example Ubuntu 18.04 container – 4 GB in 250 k files: 56 s untar + 1 min publish vs. 74s ingest

slide-46
SLIDE 46

Directory Organization

10 20 30 40 50 5 10 15 20 Fraction of Files [%] Directory Depth Athena 17.0.1 CMSSW 4.2.4 LCG Externals R60

Typical non-LHC soft- ware: majority of files in directory level ≤ 5

slide-47
SLIDE 47

Cumulative File Size Distribution

24 26 28 210 212 214 216 218 10 20 30 40 50 60 70 80 90 100 File size [B] Percentile ATLAS LHCb ALICE CMS UNIX Web Server Requested

  • cf. Tanenbaum et al. 2006 for “Unix” and “Webserver”
slide-48
SLIDE 48

LHC Data Flow

Source: Harvey et al.

slide-49
SLIDE 49

Selected publications i

Blomer, J., Aguado-Sanchez, C., Buncic, P., and Harutyunyan, A. (2011). Distributing LHC application software and conditions databases using the CernVM file system. Journal of Physics: Conference Series, 331(042003). Blomer, J., Buncic, P., Meusel, R., Ganis, G., Sfiligoi, I., and Thain, D. (2015). The evolution of global scale filesystems for scientific software distribution. Computing in Science and Engineering, 17(6):61–71. Weitzel, D., Bockelman, B., Dykstra, D., Blomer, J., and Meusel, R. (2017). Accessing data federations with cvmfs. Journal of Physics: Conference Series, 898(062044).

slide-50
SLIDE 50

Selected publications ii

Blomer, J., Ganis, G., Hardi, N., and Popescu, R. (2017). Delivering LHC Software to HPC Compute Elements with CernVM-FS, pages 724–730. Number 10524 in Lecture Notes in Computer Science. Springer. Hardi, N., Blomer, J., Ganis, G., and Popescu, R. (2018). Making containers lazy with docker and CernVM-FS. Journal of Physics: Conference Series, 1085.