JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID - - PowerPoint PPT Presentation

jalien
SMART_READER_LITE
LIVE PREVIEW

JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID - - PowerPoint PPT Presentation

A Large Ion Collider Experiment JALIEN THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID FRAMEWORK Miguel.Martinez.Pedreira@cern.ch | Track 3: Distributed Computing | 12/07/2018 A Large Ion Collider Experiment COMPUTING CHALLENGE


slide-1
SLIDE 1

JALIEN

A Large Ion Collider Experiment

THE NEW ALICE HIGH-PERFORMANCE AND HIGH-SCALABILITY GRID FRAMEWORK

Miguel.Martinez.Pedreira@cern.ch | Track 3: Distributed Computing | 12/07/2018

slide-2
SLIDE 2

COMPUTING CHALLENGE

A Large Ion Collider Experiment 2 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

§ Factor 10 increase on CPU and data usage from 2011 to 2017

slide-3
SLIDE 3

LOOKING TO THE FUTURE

A Large Ion Collider Experiment 3 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

Yearly maintained resources growth (expected to continue) + O2 facility for synchronous and asynchronous data processing (60PB and 100K cores at the beginning of Run3) scalability of the software used in the past 10+ years is under question decision to rewrite the entire ALICE high-level Grid services stack

slide-4
SLIDE 4

TECHNOLOGIES SET

A Large Ion Collider Experiment 4 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

WebSocket

X.509

Software and calibration data distribution DB Backends: File Catalogue, TaskQueue, TransferQueue Security/certificates model Communication for ROOT and end clients Code and object cache Data access protocol

slide-5
SLIDE 5

CENTRAL SERVICES

A Large Ion Collider Experiment 5 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

jCentral

WebSocketS + X.509 Java Serialized SSL + X.509 X.509 Certification Authority

slide-6
SLIDE 6

EASY TO DEPLOY AND SCALE

§ Unique jar to deploy anywhere § Each Central Service (jCentral) instance has the full functionality § Hierarchical application configuration

§ Files, defaults, database

§ Simplified dependencies

§ Java § Xrootd § Deployed on CVMFS § Previous framework: perl+packages, xrootd, openssl, httpd, c-perl bindings, swig, libxml2, zlib, ncurses, gsoap, classad, …!

A Large Ion Collider Experiment 6 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

slide-7
SLIDE 7

AUTHENTICATION AND AUTHORIZATION

A Large Ion Collider Experiment 7 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

§ Storage: keep the current model in ALICE (10+ years) § Signed envelopes created by Central Services § Each envelope allows for unique user-file-operation § Central Services and Storages decoupled § Client/server: new Token Certificates § Full-fledged X.509 provided by the JAliEn CA and created by Central Services § Fine-grained capabilities assigned to each token

§ Map the operations and file access allowed § E.g. Pilot Token can only do job matching

X.509

Following closely discussions and recommendations from the WLCG Authz WG Full details of security model in

  • V. Yurchenko’s poster
slide-8
SLIDE 8

PILOT IMPLEMENTATION

A Large Ion Collider Experiment 8 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira Batch Queue -> startup script with embedded pilot token C=ch/O=AliEn/CN=JobAgent JobAgent instance C=ch/O=AliEn/CN=Jobs/CN=<owner>/OU=<owner>/OU=<jobId>

Start in isolated environment*

getJob() [jobId, token, JDL] all job and monitoring calls

* Can be a simple wrapper script or container/singularity

As discussed also in the WLCG Containers WG

1-slot queue -> Start one pilot per slot Full-node queue -> Start one pilot Potentially containerized Details on container utilization in ALICE in M. Storetvedt’s presentation JobWrapper instance

WebSocketS Start in isolated environment*

Payload

slide-9
SLIDE 9

SCALABLE AND RELIABLE BACKENDS

A Large Ion Collider Experiment 9 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

Master Slave1 Slave2 …

MySQL § Manual sharding § Split file hierarchy into tables § Single point of failure § Rely on good hardware for performance § Today: § 15B entries § O(10K) ops/s § 6TB on disk § Run3 challenge § 50B entries § O(100K) ops/s Cassandra/ScyllaDB

N1 N2 N3 N4 …

§ Automatic sharding § No single point of failure, HA § Horizontal scaling, cheap hardware § Consistency § Paradigm change § SQL to noSQL

slide-10
SLIDE 10

BENCHMARK RESULTS

A Large Ion Collider Experiment 10 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

Cassandra read Cassandra write Scylla read Scylla write

200 400 600 800 1000 1200 1400

K ops / second

§ Cassandra/ScyllaDB follow the same global architecture § The internal implementation is very different

Cassandra ScyllaDB Java (JVM) C++ Unaware of kernel/machine hardware Kernel tuning, hardware probes Java thread based as standard application, relies on kernel for most of resource management Splits into 1 DB core per CPU core, splits RAM/DB cores, bypasses network from kernel (no syscalls), complex memory management Several sync/lock points Fully async (polling)

§ Application and schema compatible with both backends

Mixed (10 read : 1 write) Gauss (5B, 2.5B, 10M)

slide-11
SLIDE 11

SUMMARY

A Large Ion Collider Experiment 11 CHEP 2018 | JAliEn: ALICE Grid framework | Miguel Martinez Pedreira

§ ALICE is looking forward to a major detector and software upgrade in Run3 § In addition to the standard 20-30% yearly growth, ALICE introduces the O2 facility for synchronous and asynchronous data processing § To cope with the increased capacity and complexity, we have decided to re-write the top level ALICE Grid services:

§ employing modern technologies § incorporating the best practices discussed in various WLCG WGs

§ The development is well under way and will be ready in time for Run3 § For the interested: the JAliEn code repository and support list:

§ https://gitlab.cern.ch/jalien/jalien § jalien-support@cern.ch