Are objects the right level of abstraction to enable the convergence - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan ✝ , Gabriel Antoniu ◇ , Jesús Montes*, María S. Pérez *, Luc Bougé ◇ * Universidad Politécnica de Madrid, Madrid, Spain — ✝ INSA Rennes / IRISA, Rennes, France — ◇ Inria Rennes Bretagne-Atlantique, Rennes, France

A Catalyst for Convergence: Data Science 1

An Approach:   The BigStorage H2020 Project Can we build a converged storage system for HPC and Big Data? 2

The BigStorage Consortium 3

One concern 4

HPC App 5

HPC App (POSIX) File System 6

HPC App (POSIX) File System

folder / file hierarchies permissions Supports random reads and writes to files ◇ atomic file renaming multi-user protection 7

Supports random reads and writes to files 8

Supports random reads and writes to files Objects 8

HPC App Object Storage System 9

HPC App Big Data App Object Storage System 10

HPC App Big Data App Object Storage System K/V Store DB FS 10

Big Data App HPC App K/V Store DB Object Storage System FS 10

Big Data App HPC App K/V Store DB Object Storage System Object Storage System 10

Big Data App HPC App K/V Store DB Converged Object Storage System 11

MonALISA monitoring platform of the CERN LHC ALICE Experiment A Big Data use-case 12

One problem… A scientific monitoring service, monitoring the ALICE CERN LHC experiment: - Ingests events at a rate of up to 16 GB/s, - Produces more than 10 9 data files per year Computes 35.000+ aggregates in real-time Current lock-based platform does not scale …multiple requirements - Multi-object write synchronization support - Atomic, lock-free writes - High-performance reads - Horizontal scalability 13

Object Storage System Why is write synchronization needed? write(count,6) Aggregate computation is a three-step operation: read(count) 1. Read current value remotely from storage sync( ) 2. Update it with the new data ack 3. Write the updated value remotely to storage 5 Aggregate update needs to be atomic (transactions) Also, adding a new data to persistent storage and updating the related aggregates needs to be performed atomically as well. Client 14

At which level to handle concurrency management? 15

At the application level? Thread 1 Thread 2 Thread 3 Enables fine-grained synchronization (app knowledge) …but significantly complexities application design, and typically only guarantees isolation. Synchronization layer At a middleware level? Eases application design… …but has a performance cost (zero knowledge), and usually also only guarantees isolation. At a storage level? Object Storage System Also eases application design, Transactional better performance than middleware (storage knowledge), Object Storage System and may offer additional consistency guarantees. 16

Aren’t existing transactional object stores enough? 17

Not quite. Existing transactional systems typically only ensure consistency of writes In most current systems, reads are performed atomically only because objects are small enough to be located on a single server, i.e. - Records for database systems - Values for Key-Value stores Yet, for large objects, reads spanning multiple chunks should always return a consistent view 18

T ý r transactional design T ý r internally maps all writes to transactions - Multi-chunk, and even multi-object operations are processed with a serializable order - Ensures that all chunk replicas are consistent T ý r uses a high-performance, sequentially-consistent transaction chain algorithm: WARP [1]. [1] R. Escriva et al. – Warp: Lightweight Multi-Key Transactions for Key-Value Stores 19

T ý r is alive! Fully implemented as a prototype with ~22.000 lines of C Lock-free, queue-free, asynchronous design. Leveraging well-known technologies: - Google LevelDB [1] for node-local persistent storage, - Google FlatBuffers [2] for message serialization, - UDT [3] as network transfer protocol. [1] http://leveldb.org/ [2] https://google.github.io/flatbuffers [3] http://udt.sourceforge.net/ 20

T ý r evaluation with MonALISA MonALISA data collection was re-implemented atop T ý r, and evaluated using real data T ý r was compared to other state-of-the-art, object-based storage systems: + - RADOS / librados (Ceph) - Azure Storage Blobs (Microsoft) - BlobSeer (Inria) Experiments run on the Microsoft Azure cloud, up to 256 nodes 3 x replication factor for all systems 21

Synchronized write performance: Evaluating transactional write performance 5 Avg. throughput (mil. ops / sec) We add fine-grained, application-level, lock-based 3,75 synchronization to T ý r competitors Performance of T ý r competitors decrease due to the 2,5 synchronization cost Clear advantage of Atomic operations over Read- 1,25 Update-Write aggregate updates 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent writers T ý r (Atomic operations) Tyr (Read-Update-Write) RADOS (Synchronized) BlobSeer (Synchronized) Azure Blobs (Synchronized) 22

Read performance 8 Avg. throughput (mil. ops / sec) We simulate MonALISA reads, varying the number of 6 concurrent readers Slightly lower performance than RADOS, but offers 4 read consistency guarantees T ý r lightweight read protocol allows it to outperform 2 BlobSeer and Azure Storage 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent readers T ý r RADOS BlobSeer Azure Blobs 23

The next step Big Data App HPC App RDB K/V Store Converged Object Storage System T ý r as a base layer for higher-level T ý r for HPC applications? storage abstractions? 24

Before that: A study of feasibility 25

Current storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC PFS Big Data DFS 26

“Converged” storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC Adapter Big Data Adapter Object-based storage calls Converged Object Storage System 27

Object-oriented primitives - Object Access: random object read, object size - Object Manipulation: random object write, truncate - Object Administration: create object, delete object - Namespace Access: scan all objects - These operations are similar to those permitted by the POSIX-IO API on a single file - Directory-level operations do not have their object-based storage counterpart (flat nature of these kinds of systems) - Low number of them - Emulated using the scan operation (far from optimized, but compensated by the gains permitted by using a flat namespace and simpler semantics) 28

Representative set of HPC/BD applications Platform Application Usage Total reads Total writes R/W ratio Profile mpiBLAST Protein docking 27.7 GB 12.8 MB 2.1*10^3 Read-intensive MOM Oceanic model 19.5 GB 3.2 GB 6.01 Read-intensive HPC/MPI Sediment ECOHAM 0.4 GB 9.7 GB 4.2*10^-2 Write-intensive propagation Video Ray Tracing 67.4 GB 71.2 GB 0.94 Balanced processing Sort Text processing 5.8 GB 5.8 GB 1.00 Balanced Connected Graph 13.1 GB 71.2 MB 0.18 Read-intensive Component processing Cloud/Spark Grep Text processing 55.8 GB 863.8 MB 64.52 Read-intensive Decision Tree Machine learning 59.1 GB 4.7 GB 12.58 Read-intensive Tokenizer Text processing 55.8 GB 235.7 GB 0.24 Write-intensive 29

Original operation Rewritten operation Operation Action Operation count create(/foo/bar) create(/foo__bar) open(/foo/bar) open(/foo__bar) mkdir Create directory 43 read(fd) read(bd) rmdir Remove directory 43 write(fd) write(bd) mkdir(/foo) Dropped operation opendir (Input data Open/List directory 5 directory) opendir(/foo) scan(/), return all files matching /foo__* opendir (other Open/List directory 0 directories) rmdir(/foo) scan(/), remove all files matching /foo__* 31

T ý r and RADOS vs Lustre (HPC) , HDFS/CephFS (Big Data) - Grid’5000 experimental testbed distributed over 11 sites in France and Luxembourg (parapluie cluster, Rennes) - 2 x 12-core 1.7 Ghz 6164 HE, 48 GB of RAM, and 250 GB HDD. - HPC applications: Lustre 2.9.0 and MPICH 3.2 [67], on a 32-node cluster. - Big data applications: Spark 2.1.0, Hadoop / HDFS 2.7.3 and Ceph Kraken on a 32-node cluster 33

HPC applications 34

BD applications 35

HPC/BD applications 36

Conclusions - Tyr is a novel high-performance object-based storage system providing built-in multi object transactions - Object-based storage convergence is possible, leading to a significant performance improvement on both platforms (HPC and Cloud) - A completion time improvement of up to 25% for big data applications and 15% for HPC applications when using object-based storage 37

Thank you!

Are objects the right level of abstraction to enable the convergence - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri, Alexandru Costan , Gabriel Antoniu , Jess Montes, Mara S. Prez , Luc Boug Universidad

Data Abstraction Announcements Data Abstraction Data Abstraction 4 Data Abstraction

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Predicate Abstraction with SATABS Existential Abstraction Predicate Abstraction for Software

Data Abstraction Announcements Data Abstraction Data Abstraction Programmers Compound

Point, Line, & Plane 1 Abstraction Abstraction is the act of considering something as a

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Chapter 3: Data Abstraction Modularity and Abstraction Abstraction, modularity, information

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

The Problem of Temporal Abstraction How do we connect the high level to the low-level? "

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

light right light right light right light right to steady the tongue, hold the sides of

Managing Water Abstraction Reforming Abstraction and Modernising Regulation Richard Austen Water

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

61A Lecture 18 Announcements Sequences The Sequence Abstraction 4 The Sequence Abstraction

Daria Merkurjev Workshop logis3cs Disparate backgrounds Some will find this class easy

Important modules: Biopython, SQL & COM Information sources python.org tutor list

Rule-based Modeling William S. Hlavacek Theoretical Division Los Alamos National Laboratory

CSI5126 . Algorithms in bioinformatics Essential Cellular Biology (continued) Marcel Turcotte

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

BISC/CS303: Bioinformatics Spring 2008 Administrivia Instructors: Brian Tjaden and Brett

Overview Overview Processors Interconnect Look at the 3 Japanese HPCs Examine the

Hands-on Exercises C H I P S T E R A N D F E D E R A T E D C L O U D Slides and Exercises m

Are objects the right level of abstraction to enable the convergence - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan , Gabriel Antoniu , Jess Montes*, Mara S. Prez *, Luc Boug * Universidad

Data Abstraction Announcements Data Abstraction Data Abstraction 4 Data Abstraction

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Predicate Abstraction with SATABS Existential Abstraction Predicate Abstraction for Software

Data Abstraction Announcements Data Abstraction Data Abstraction Programmers Compound

Point, Line, &amp; Plane 1 Abstraction Abstraction is the act of considering something as a

Objects &amp; Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Chapter 3: Data Abstraction Modularity and Abstraction Abstraction, modularity, information

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

The Problem of Temporal Abstraction How do we connect the high level to the low-level? &quot;

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

light right light right light right light right to steady the tongue, hold the sides of

Managing Water Abstraction Reforming Abstraction and Modernising Regulation Richard Austen Water

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

61A Lecture 18 Announcements Sequences The Sequence Abstraction 4 The Sequence Abstraction

Daria Merkurjev Workshop logis3cs Disparate backgrounds Some will find this class easy

Important modules: Biopython, SQL &amp; COM Information sources python.org tutor list

Rule-based Modeling William S. Hlavacek Theoretical Division Los Alamos National Laboratory

CSI5126 . Algorithms in bioinformatics Essential Cellular Biology (continued) Marcel Turcotte

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

BISC/CS303: Bioinformatics Spring 2008 Administrivia Instructors: Brian Tjaden and Brett

Overview Overview Processors Interconnect Look at the 3 Japanese HPCs Examine the

Hands-on Exercises C H I P S T E R A N D F E D E R A T E D C L O U D Slides and Exercises m

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri, Alexandru Costan , Gabriel Antoniu , Jess Montes, Mara S. Prez , Luc Boug Universidad

Point, Line, & Plane 1 Abstraction Abstraction is the act of considering something as a

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

The Problem of Temporal Abstraction How do we connect the high level to the low-level? "

Important modules: Biopython, SQL & COM Information sources python.org tutor list