Performance and Scalability Evaluation of the Ceph Parallel File - PowerPoint PPT Presentation

Performance and Scalability Evaluation of the Ceph Parallel File System Presented by Feiyi Wang Co-authors: Mark Nelson (Inktank), Sarp Oral, Scotty Atchley, Sage Weil (Inktank), Bradley W. Settlemyer, Blake Caldwell, Jason Hill

Introduction • Oak Ridge Leadership Computing (OLCF) – Jaguar, served by Spider 1 (2008), 240 GB/s, 10 PB, serving more than 26,000 clients. 192 OSS and 1, 344 OSTs – Titan, to be served by Spider 2 (2013), 1TB/s, 32 PB (after RAID) • Both Spider 1 and 2 are used for scratch I/O. HPSS is used for archival storage. • New technology evaluation: Ceph for HPC? Managed by UT-Battelle for the U. S. Department of Energy

Ceph Overview • Ceph is a distributed storage system designed for scalability, reliability and performance. • The system is based on a distributed object storage service called (RADOS). • Data objects are distributed across Object Storage Devices (OSD), using CRUSH, a deterministic hashing function that allows flexible placement policies. • CephFS builds distributed cache-coherent file system on top of RADOS. • Ceph metadata servers store all metadata in RADOS objects; Ceph can adaptively adjust the distribution of namespace across a pool of metadata servers. Managed by UT-Battelle for the U. S. Department of Energy

Ceph Architecture Managed by UT-Battelle for the U. S. Department of Energy

Testbed Environment • DDN SFA10K as storage backend • SFA10K organizes disks into various RAID levels by two active-active RAID controllers; each RAID controller has two RAID processors; each RAID processor has a dual-port IB QDR cards. • 200 SAS drives and 280 SATA drives in 10 disk enclosures. • The storage rack is driven by 4 server hosts with IB QDR connections. Managed by UT-Battelle for the U. S. Department of Energy

Test Methodology • Our strategy is bottom up. Along I/O path, we establish first the expected theoretical performance, then the observed performance. After tuning efforts, we finally establish the baseline performance at that level. • Generally, we expect performance loss as we move up; The degree of the loss is an indication of how well the system is engineered and balanced. • Four key components: – Block devices – Local/back-end file system – Storage network – Parallel File system Managed by UT-Battelle for the U. S. Department of Energy

Baseline Performance • IB QDR theoretical maximum is around 3.2GB/s, in practice, we observed 3.0 GB/s. With 4 IB QDR connections, we are inline with DDN’s theoretical maximum: 12 GB/s. • Block-level: – Each LUN is a RAID 6 (8+2) array – 8 data disks and 2 parity disks – Write-back cache on has a major impact on SATA RAID group (288 MB vs. 955 MB/s), a minor impact on SAS RAID group (1.12 GB vs. 1.4 GB/s) • Aggregate performance: we observe 11 GB/s for 28 SATA LUNs or 20 SAS LUNs • We conclude that 11 GB/s as baseline performance number, and limitation comes from RAID controller performance. Managed by UT-Battelle for the U. S. Department of Energy

RADOS Scaling (1) 4 Servers, 4 Clients, 4MB I/O We observed period of high performance followed by period of low performance or outright stalls across different backend file systems (1) TCP auto-tune enabled Jim Schutt: “ … unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throtter, and limits on the total memory used by TCP stack ” (2) TCP auto-tune disabled Managed by UT-Battelle for the U. S. Department of Energy

RADOS Scaling (2) (a) Scaling OSD per server (b) Scaling OSD servers (a) Through experimentation, we observed that number of concurrent operations are critical to archive high throughput. The graph shows 32 concurrent 4MB objects in flight. All tests are performed with replication set to 1. (b) 4 OSD servers, each with 11 OSDs. The perfect scaling would give us aggregate read at 6616MB/s and write at 5640 MB/s; We are observing a loss of 13.6% and 16.0% respectively. Managed by UT-Battelle for the U. S. Department of Energy

File System Level: A Different Story Bottom line: though we have obtained reasonable performance at RADOS level, it did not translate into file system level performance, at all. Managed by UT-Battelle for the U. S. Department of Energy

Improving RADOS • osd_op_threads, 7.3% and 9% improvement • journal_aio, 11.5% and 16.3% improvement • Other probed parameters: no tangible and repeatable impacts Managed by UT-Battelle for the U. S. Department of Energy

Improving Ceph File System Performance We observed significant performance impact due to client side CRC32. More so on write then read. Inktank has since implemented SSE4 instruction based CRC32 for Intel CPU. To improvement IOR scaling performance: (1) Increase read-ahead cache on client side (2) Inktank investigated heavy lock contention during parallel compaction in Linux memory manager. A bug in kernel 3.5 Managed by UT-Battelle for the U. S. Department of Energy

Summary • Ceph is still under rapid development, and our results shows that. In between versions, large performance swings. Comparing to CephFS, RADOS is much more stable. • Through tuning efforts, we are able to observe Ceph perform at about 70% of raw hardware capacity at RADOS level and 62% at file system level. • Ceph performs “metadata + data” journaling, which maybe fine for host system with locally attached disks, but hurts in SFA10K- alike hardware, where block devices are exposed through IB over SRP protocol. Managed by UT-Battelle for the U. S. Department of Energy

Performance and Scalability Evaluation of the Ceph Parallel File - PowerPoint PPT Presentation

Performance and Scalability Evaluation of the Ceph Parallel File System Presented by Feiyi Wang Co-authors: Mark Nelson (Inktank), Sarp Oral, Scotty Atchley, Sage Weil (Inktank), Bradley W. Settlemyer, Blake Caldwell, Jason Hill

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

Ceph storage with Rook Running Ceph on Kubernetes Alexander Trost, Rook Maintainer and DevOps

Know more about your Ceph Cluster with ELK Stack Cameron Seader Technology Strategist

Scaling Your Storage Using Ceph Wido den Hollander #CCCEU Who am I? Wido den Hollander

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3

How to backup Ceph at scale FOSDEM, Brussels, 2018.02.04 About me Bartomiej wicki OVH

XtreemFS a case for object-based storage in Grid data management Jan Stender, Zuse Institute

On the Power of In-Network Caching in the Hadoop Distributed File System ERIC NEWBERRY,

Networked File System CS333 S20 :: Williams College Course Logistics Lab 3a Teams, repos,

NFS Heterogeneous systems must be supported Different HW, OS, underlying file system

Chicago, Illinois Oct 11 12, 2012 Initiative Motivation This Project Is Inspired By

The Network Operation Centre of a RREN: The Network Operation Centre of a RREN: Anella Cient

AfriN AfriNIC 11 IC 11 No Novemb ember er 200 2009 AfriNIC Anti-Abuse Group S. Moonesamy

June 28, 2018 The webinar will begin at 12:00 PM ET. Please listen through the audio on your

Performance and Scalability Evaluation of the Ceph Parallel File - PowerPoint PPT Presentation

Performance and Scalability Evaluation of the Ceph Parallel File System Presented by Feiyi Wang Co-authors: Mark Nelson (Inktank), Sarp Oral, Scotty Atchley, Sage Weil (Inktank), Bradley W. Settlemyer, Blake Caldwell, Jason Hill

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer &lt;lgrimmer@suse.com&gt; |

CEPHALOPODS AND SAMBA IRA COOPER - SambaXP 2016.05.12 AGENDA CEPH Architecture. Why CEPH?

Ceph Rados Block Device Venky Shankar Ceph Developer, Red Hat SNIA, 2017 1 WHAT IS CEPH?

Linux Open Source Distributed Filesystem Ceph at SURFsara Remco van Vugt July 2, 2013 1/ 34

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

BLUESTORE: A NEW STORAGE BACKEND FOR CEPH ONE YEAR IN SAGE WEIL 2017.03.23 OUTLINE Ceph

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Ceph: All-in-One Network Data Storage What is Ceph and how we use it to backend the Arbutus cloud

Ceph storage with Rook Running Ceph on Kubernetes Alexander Trost, Rook Maintainer and DevOps

Know more about your Ceph Cluster with ELK Stack Cameron Seader Technology Strategist

Scaling Your Storage Using Ceph Wido den Hollander #CCCEU Who am I? Wido den Hollander

Presentation: 1. I-Max Ceph key points 2. Exams 3. Dimensions 4. Technical features 5. I-Max

Ceph &amp; RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3

How to backup Ceph at scale FOSDEM, Brussels, 2018.02.04 About me Bartomiej wicki OVH

XtreemFS a case for object-based storage in Grid data management Jan Stender, Zuse Institute

On the Power of In-Network Caching in the Hadoop Distributed File System ERIC NEWBERRY,

Networked File System CS333 S20 :: Williams College Course Logistics Lab 3a Teams, repos,

NFS Heterogeneous systems must be supported Different HW, OS, underlying file system

Chicago, Illinois Oct 11 12, 2012 Initiative Motivation This Project Is Inspired By

The Network Operation Centre of a RREN: The Network Operation Centre of a RREN: Anella Cient

AfriN AfriNIC 11 IC 11 No Novemb ember er 200 2009 AfriNIC Anti-Abuse Group S. Moonesamy

June 28, 2018 The webinar will begin at 12:00 PM ET. Please listen through the audio on your

Managing and Monitoring Ceph with the Ceph Dashboard Lenz Grimmer <lgrimmer@suse.com> |

Ceph & RocksDB (Cloud Storage ) Ceph Basics Placement Group PG#1 PG#2 PG#3