 
              Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan ✝ , Gabriel Antoniu ◇ , Jesús Montes*, María S. Pérez *, Luc Bougé ◇ * Universidad Politécnica de Madrid, Madrid, Spain — ✝ INSA Rennes / IRISA, Rennes, France — ◇ Inria Rennes Bretagne-Atlantique, Rennes, France
A Catalyst for Convergence: Data Science 1
An Approach: The BigStorage H2020 Project Can we build a converged storage system for HPC and Big Data? 2
The BigStorage Consortium 3
One concern 4
HPC App 5
HPC App (POSIX) File System 6
HPC App (POSIX) File System
folder / file hierarchies permissions Supports random reads and writes to files ◇ atomic file renaming multi-user protection 7
Supports random reads and writes to files 8
Supports random reads and writes to files Objects 8
HPC App Object Storage System 9
HPC App Big Data App Object Storage System 10
HPC App Big Data App Object Storage System K/V Store DB FS 10
Big Data App HPC App K/V Store DB Object Storage System FS 10
Big Data App HPC App K/V Store DB Object Storage System Object Storage System 10
Big Data App HPC App K/V Store DB Converged Object Storage System 11
MonALISA monitoring platform of the CERN LHC ALICE Experiment A Big Data use-case 12
One problem… A scientific monitoring service, monitoring the ALICE CERN LHC experiment: - Ingests events at a rate of up to 16 GB/s, - Produces more than 10 9 data files per year Computes 35.000+ aggregates in real-time Current lock-based platform does not scale …multiple requirements - Multi-object write synchronization support - Atomic, lock-free writes - High-performance reads - Horizontal scalability 13
Object Storage System Why is write synchronization needed? write(count,6) Aggregate computation is a three-step operation: read(count) 1. Read current value remotely from storage sync( ) 2. Update it with the new data ack 3. Write the updated value remotely to storage 5 Aggregate update needs to be atomic (transactions) Also, adding a new data to persistent storage and updating the related aggregates needs to be performed atomically as well. Client 14
At which level to handle concurrency management? 15
At the application level? Thread 1 Thread 2 Thread 3 Enables fine-grained synchronization (app knowledge) …but significantly complexities application design, and typically only guarantees isolation. Synchronization layer At a middleware level? Eases application design… …but has a performance cost (zero knowledge), and usually also only guarantees isolation. At a storage level? Object Storage System Also eases application design, Transactional better performance than middleware (storage knowledge), Object Storage System and may offer additional consistency guarantees. 16
Aren’t existing transactional object stores enough? 17
Not quite. Existing transactional systems typically only ensure consistency of writes In most current systems, reads are performed atomically only because objects are small enough to be located on a single server, i.e. - Records for database systems - Values for Key-Value stores Yet, for large objects, reads spanning multiple chunks should always return a consistent view 18
T ý r transactional design T ý r internally maps all writes to transactions - Multi-chunk, and even multi-object operations are processed with a serializable order - Ensures that all chunk replicas are consistent T ý r uses a high-performance, sequentially-consistent transaction chain algorithm: WARP [1]. [1] R. Escriva et al. – Warp: Lightweight Multi-Key Transactions for Key-Value Stores 19
T ý r is alive! Fully implemented as a prototype with ~22.000 lines of C Lock-free, queue-free, asynchronous design. Leveraging well-known technologies: - Google LevelDB [1] for node-local persistent storage, - Google FlatBuffers [2] for message serialization, - UDT [3] as network transfer protocol. [1] http://leveldb.org/ [2] https://google.github.io/flatbuffers [3] http://udt.sourceforge.net/ 20
T ý r evaluation with MonALISA MonALISA data collection was re-implemented atop T ý r, and evaluated using real data T ý r was compared to other state-of-the-art, object-based storage systems: + - RADOS / librados (Ceph) - Azure Storage Blobs (Microsoft) - BlobSeer (Inria) Experiments run on the Microsoft Azure cloud, up to 256 nodes 3 x replication factor for all systems 21
Synchronized write performance: Evaluating transactional write performance 5 Avg. throughput (mil. ops / sec) We add fine-grained, application-level, lock-based 3,75 synchronization to T ý r competitors Performance of T ý r competitors decrease due to the 2,5 synchronization cost Clear advantage of Atomic operations over Read- 1,25 Update-Write aggregate updates 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent writers T ý r (Atomic operations) Tyr (Read-Update-Write) RADOS (Synchronized) BlobSeer (Synchronized) Azure Blobs (Synchronized) 22
Read performance 8 Avg. throughput (mil. ops / sec) We simulate MonALISA reads, varying the number of 6 concurrent readers Slightly lower performance than RADOS, but offers 4 read consistency guarantees T ý r lightweight read protocol allows it to outperform 2 BlobSeer and Azure Storage 0 25 50 75 100125150175200225250275300325350375400425450475500 Concurrent readers T ý r RADOS BlobSeer Azure Blobs 23
The next step Big Data App HPC App RDB K/V Store Converged Object Storage System T ý r as a base layer for higher-level T ý r for HPC applications? storage abstractions? 24
Before that: A study of feasibility 25
Current storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC PFS Big Data DFS 26
“Converged” storage stack HPC App HPC App HPC App Big Data App Big Data App Big Data App I/O library/ BD Framework calls Big Data Framework I/O Library POSIX-like calls HPC Adapter Big Data Adapter Object-based storage calls Converged Object Storage System 27
Object-oriented primitives - Object Access: random object read, object size - Object Manipulation: random object write, truncate - Object Administration: create object, delete object - Namespace Access: scan all objects - These operations are similar to those permitted by the POSIX-IO API on a single file - Directory-level operations do not have their object-based storage counterpart (flat nature of these kinds of systems) - Low number of them - Emulated using the scan operation (far from optimized, but compensated by the gains permitted by using a flat namespace and simpler semantics) 28
Representative set of HPC/BD applications Platform Application Usage Total reads Total writes R/W ratio Profile mpiBLAST Protein docking 27.7 GB 12.8 MB 2.1*10^3 Read-intensive MOM Oceanic model 19.5 GB 3.2 GB 6.01 Read-intensive HPC/MPI Sediment ECOHAM 0.4 GB 9.7 GB 4.2*10^-2 Write-intensive propagation Video Ray Tracing 67.4 GB 71.2 GB 0.94 Balanced processing Sort Text processing 5.8 GB 5.8 GB 1.00 Balanced Connected Graph 13.1 GB 71.2 MB 0.18 Read-intensive Component processing Cloud/Spark Grep Text processing 55.8 GB 863.8 MB 64.52 Read-intensive Decision Tree Machine learning 59.1 GB 4.7 GB 12.58 Read-intensive Tokenizer Text processing 55.8 GB 235.7 GB 0.24 Write-intensive 29
30
Original operation Rewritten operation Operation Action Operation count create(/foo/bar) create(/foo__bar) open(/foo/bar) open(/foo__bar) mkdir Create directory 43 read(fd) read(bd) rmdir Remove directory 43 write(fd) write(bd) mkdir(/foo) Dropped operation opendir (Input data Open/List directory 5 directory) opendir(/foo) scan(/), return all files matching /foo__* opendir (other Open/List directory 0 directories) rmdir(/foo) scan(/), remove all files matching /foo__* 31
32
T ý r and RADOS vs Lustre (HPC) , HDFS/CephFS (Big Data) - Grid’5000 experimental testbed distributed over 11 sites in France and Luxembourg (parapluie cluster, Rennes) - 2 x 12-core 1.7 Ghz 6164 HE, 48 GB of RAM, and 250 GB HDD. - HPC applications: Lustre 2.9.0 and MPICH 3.2 [67], on a 32-node cluster. - Big data applications: Spark 2.1.0, Hadoop / HDFS 2.7.3 and Ceph Kraken on a 32-node cluster 33
HPC applications 34
BD applications 35
HPC/BD applications 36
Conclusions - Tyr is a novel high-performance object-based storage system providing built-in multi object transactions - Object-based storage convergence is possible, leading to a significant performance improvement on both platforms (HPC and Cloud) - A completion time improvement of up to 25% for big data applications and 15% for HPC applications when using object-based storage 37
Thank you!
Recommend
More recommend