An Exploration into Object Storage Lance Evans Raghu Chandrasekar - - PowerPoint PPT Presentation

an exploration into object storage
SMART_READER_LITE
LIVE PREVIEW

An Exploration into Object Storage Lance Evans Raghu Chandrasekar - - PowerPoint PPT Presentation

An Exploration into Object Storage Lance Evans Raghu Chandrasekar Office of the CTO Storage R&D Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking


slide-1
SLIDE 1

An Exploration into Object Storage

Lance Evans Office of the CTO Raghu Chandrasekar Storage R&D

slide-2
SLIDE 2

Safe Harbor Statement

Dagstuhl Seminar #17202 2

This presentation may contain forward-looking statements that are based on our current

  • expectations. Forward looking statements may include statements about our financial

guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical

  • facts. These statements are only predictions and actual results may materially vary from

those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

slide-3
SLIDE 3

Lexicon?

Dagstuhl Seminar #17202 3

Storage

Ephemeral

Persistent

Consistent

Coherent

Resilient

Durable

Reliable

File

Object

Cache

Tier

Namespace

Secure

Attribute Authenticate

slide-4
SLIDE 4

Motivations

  • Transition from files to objects (wild west)
  • Convergence of analytics and HPC (which is which?)
  • Hardware evolution (SMR, NAND, PCM, TLAs)
  • Open-source movement (mostly commercial)
  • Dev cost and agility (10-yr gestation cycle)
  • Specialized frameworks (still require common infrastructure)
  • Flat namespaces (enable bazillion objects)
  • Scaled DBs (graph, schema-less, NewSQL)
  • Required scale (256-256k compute nodes, 64-64k storage devices)

Dagstuhl Seminar #17202 4

slide-5
SLIDE 5

HPC & Analytics Convergence

Dagstuhl Seminar #17202 5 256k Node Management, Monitor, Service Infrastructure Analytics Framework with Local Caching Scalable Metadata Services User Application High-speed dragonfly fabric Compute Store HPC File or Object with Optional Caching Flash Pmem Flash Flash Flash Flash Flash RDMA Transport Flash Flash Flash Flash Flash User Application Flash Pmem RDMA Transport

POSIX Files, HDF5 Containers, K/V Spark RDDs, K/V, or Other Discover, Query APPLICATION OBJECT INTERFACE STORAGE OBJECT INTERFACE

slide-6
SLIDE 6

SAROJA: Architecture

Dagstuhl Seminar #17202 6

HPC codes and Analytics frameworks POSIX HDF5/NetCDF RESTful/S3/CDMI SAROJA user-space library Native API (put/get semantics) Metadata plugins Data plugins Control plugins

“An Exploration into Object Storage for Exascale Supercomputers”, Raghunath Raja Chandrasekar, Lance Evans, Robert Wespetal, Cray User Group Conference (CUG) 2017

slide-7
SLIDE 7

The Client

  • libsaroja.so, Put/Get semantics, FUSE, and wrapfs (in-kernel)
  • Client-side intelligence
  • Pluggable backends: Metadata, data, control
  • Algorithmic data node selection
  • POSIX <-> KV/NoSQL metadata translation
  • Interface with consensus agents
  • Retain fidelity of structured data formats

Dagstuhl Seminar #17202 7

slide-8
SLIDE 8

Dagstuhl Seminar #17202 8

“…performance degradation caused by FUSE can be completely imperceptible or as high as 83% even when optimized; and relative CPU utilization can increase by 31%...”

slide-9
SLIDE 9

Metadata Services

  • Traditional HPC metadata services
  • Interface dependent
  • Strongly consistent and normalized
  • H/A based fault-tolerance
  • Desirable characteristics
  • Right Consistency, Availability, Partition Tolerance (CAP) balance
  • {Scalability and performance} vs {Strong consistency} tradeoff
  • API agnostic – Service POSIX, Objects, Structured datasets, etc.
  • Storage-device conscious (NVMe, PMEM)
  • Analytics on the metadata

Dagstuhl Seminar #17202 9

slide-10
SLIDE 10

NoSQL an option?

  • Distributed Hash Tables for data placement
  • Fault-tolerance with failure domains
  • Built-in consensus mechanisms
  • Log-Structured Merge Trees to optimize KV storage
  • Cassandra for our initial proof-of-concept
  • Scaled to thousands of nodes, 10PB of data, 1trillion requests/day
  • In use at 1500+ companies in production
  • Low-level APIs in C

Dagstuhl Seminar #17202 10

slide-11
SLIDE 11

Metadata models for POSIX

Dagstuhl Seminar #17202 11

/mnt /mnt/foo /mnt/bar /mnt/bar/a /mnt/bar/b (#6789) /mnt/bar/b/file123 (#1234) KEY: /mnt/bar/b/file123 VAL: atime;mtime;size;xattrs KEY: 6789 VAL: 1234; KEY: 1234 VAL: atime;mtime;size;xattrs KEY: 6789, part_key, hash(file123) VAL: atime;mtime;size;xattrs; *data_obj; Pathname-as-key Two-tier collections IndexFS

slide-12
SLIDE 12

Scaling with POSIX-over-NoSQL

Dagstuhl Seminar #17202 12 20000 40000 60000 80000 100000 120000 140000 160000 180000 1 2 4 8

Creates per Second Number of Cassandra Servers

File Creation rate (mdtest)

  • POSIX over SAROJA
  • 4480 MPI ranks
  • 2.2 million files total
  • TCP over GNI on XC40
  • Lots of cheating

Peak Lustre file creation rate (w/o DNE)

slide-13
SLIDE 13

Data Path Services

  • Algorithmic mapping of object shards to distributed servers
  • Data resilience by means of replication or erasure-coding
  • On-the-fly recovery in the face of failures
  • Data movement between tiers within and outside the store
  • Efficient and granular use of underlying fabric/memory/storage

Dagstuhl Seminar #17202 13

slide-14
SLIDE 14

Ceph in the Data Path

  • Data plugin w/librados in C
  • Stripes of a BLOB/File/HDF5 container -> Ceph object
  • Static approach to track extents (currently), à la Lustre/DataWarp
  • Clients track servers based on ID+offset
  • Two-tier path: Replicated pmem tier and Erasure-coded flash tier
  • Ceph OSD backends: FileStore, BlueStore, PMStore, and others
  • Ceph Messengers

Dagstuhl Seminar #17202 14

slide-15
SLIDE 15

Co-design Concepts

  • NVMe Flash
  • Intel SPDK: Polling vs Interrupts
  • Controller Memory Buffers
  • Persistent Memory
  • 64B, 512B, atomicity guarantees, “sector tearing”
  • NVML suite of libraries
  • DAX+mmap()
  • Open-Channel SSDs
  • LightNVM framework in kernel
  • Storage data structures like LSM-Trees become the FTL
  • Fabric concepts from MPI/SHMEM

Dagstuhl Seminar #17202 15

ALL POSSIBLE FROM USER-SPACE!

slide-16
SLIDE 16

Questions?

Lance Evans lance@cray.com Raghu Chandrasekar raghu@cray.com