jalien
play

JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID - PowerPoint PPT Presentation

A Large Ion Collider Experiment JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID FRAMEWORK Miguel.Martinez.Pedreira@cern.ch | Track 3: Distributed Computing | 12/07/2018 A Large Ion Collider Experiment COMPUTING CHALLENGE


  1. A Large Ion Collider Experiment JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID FRAMEWORK Miguel.Martinez.Pedreira@cern.ch | Track 3: Distributed Computing | 12/07/2018

  2. A Large Ion Collider Experiment COMPUTING CHALLENGE § Factor 10 increase on CPU and data usage from 2011 to 2017 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 2

  3. A Large Ion Collider Experiment LOOKING TO THE FUTURE Yearly maintained resources growth (expected to continue) + O 2 facility for synchronous and asynchronous data processing (60PB and 100K cores at the beginning of Run3) scalability of the software used in the past 10+ years is under question decision to rewrite the entire ALICE high-level Grid services stack CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 3

  4. A Large Ion Collider Experiment TECHNOLOGIES SET Code and object cache WebSocket DB Backends: File Communication for Catalogue, TaskQueue, ROOT and end clients TransferQueue Software and X.509 Security/certificates calibration data model distribution Data access protocol CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 4

  5. A Large Ion Collider Experiment CENTRAL SERVICES WebSocketS + X.509 jCentral Java Serialized SSL + X.509 X.509 C ertification A uthority CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 5

  6. A Large Ion Collider Experiment EASY TO DEPLOY AND SCALE § Unique jar to deploy anywhere § Each Central Service ( jCentral ) instance has the full functionality § Hierarchical application configuration § Files, defaults, database § Simplified dependencies § Java § Xrootd § Deployed on CVMFS § Previous framework: perl+packages, xrootd, openssl, httpd, c-perl bindings, swig, libxml2, zlib, ncurses, gsoap, classad, …! CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 6

  7. A Large Ion Collider Experiment AUTHENTICATION AND AUTHORIZATION § Storage: keep the current model in ALICE (10+ years) § Signed envelopes created by Central Services § Each envelope allows for unique user-file-operation X.509 § Central Services and Storages decoupled Following closely discussions and recommendations from the § Client/server: new Token Certificates WLCG Authz WG § Full-fledged X.509 provided by the JAliEn CA and created by Central Services § Fine-grained capabilities assigned to each token Full details of security model in § Map the operations and file access allowed V. Yurchenko’s poster § E.g. Pilot Token can only do job matching CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 7

  8. A Large Ion Collider Experiment As discussed also in the PILOT IMPLEMENTATION WLCG Containers WG 1-slot queue -> Start one pilot per slot Batch Queue -> startup script with embedded pilot token Full-node queue -> Start one pilot C=ch/O=AliEn/CN=JobAgent Potentially containerized Details on getJob() JobAgent instance container C=ch/O=AliEn/CN=Jobs/CN=<owner>/OU=<owner>/OU=<jobId> utilization in [jobId, token, JDL] Start in isolated environment* ALICE in M. Storetvedt’s all job and presentation JobWrapper instance Start in isolated monitoring calls environment* WebSocketS Payload * Can be a simple wrapper script or container/singularity CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 8

  9. A Large Ion Collider Experiment SCALABLE AND RELIABLE BACKENDS Cassandra/ScyllaDB MySQL N1 N2 Master Run3 challenge § 50B entries § … N3 O(100K) ops/s § Slave1 Slave2 … N4 Manual sharding Automatic sharding § § Split file hierarchy into tables No single point of failure, HA § § Single point of failure Horizontal scaling, cheap hardware § § Rely on good hardware for performance Consistency § § Today: Paradigm change § § 15B entries SQL to noSQL § § O(10K) ops/s § 6TB on disk § CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 9

  10. A Large Ion Collider Experiment BENCHMARK RESULTS Mixed (10 read : 1 write) Gauss (5B, 2.5B, 10M) § Cassandra/ScyllaDB follow the same global architecture 1400 § The internal implementation is very different 1200 Cassandra ScyllaDB 1000 K ops / second Java (JVM) C++ 800 Unaware of kernel/machine hardware Kernel tuning, hardware probes Java thread based as standard application, Splits into 1 DB core per CPU core, splits 600 relies on kernel for most of resource RAM/DB cores, bypasses network from 400 management kernel (no syscalls), complex memory management 200 Several sync/lock points Fully async (polling) 0 § Application and schema compatible with both Cassandra read backends Cassandra write Scylla read CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 10 Scylla write

  11. A Large Ion Collider Experiment SUMMARY § ALICE is looking forward to a major detector and software upgrade in Run3 § In addition to the standard 20-30% yearly growth, ALICE introduces the O 2 facility for synchronous and asynchronous data processing § To cope with the increased capacity and complexity, we have decided to re-write the top level ALICE Grid services: § employing modern technologies § incorporating the best practices discussed in various WLCG WGs § The development is well under way and will be ready in time for Run3 § For the interested: the JAliEn code repository and support list: § https://gitlab.cern.ch/jalien/jalien § jalien-support@cern.ch CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend