Campaign gn S Stor orage stor orage f for or tiers space ce - - PowerPoint PPT Presentation

campaign gn s stor orage
SMART_READER_LITE
LIVE PREVIEW

Campaign gn S Stor orage stor orage f for or tiers space ce - - PowerPoint PPT Presentation

Campaign gn S Stor orage stor orage f for or tiers space ce f for or everything Peter Braam Co-founder & CEO Campaign Storage, LLC 2017-05 campaignstorage.com Contents Brief overview of the system Creating and updating


slide-1
SLIDE 1

Campaign gn S Stor

  • rage

stor

  • rage f

for

  • r tiers

space ce f for

  • r everything

Peter Braam

Co-founder & CEO Campaign Storage, LLC 2017-05 campaignstorage.com

slide-2
SLIDE 2

Contents

  • Brief overview of the system
  • Creating and updating policy databases
  • Data management API’s

5/7/17 Campaign Storage LLC 2

slide-3
SLIDE 3

Thank you

The reviewers of our paper asked quite a few insightful questions Thank you.

5/7/17 Campaign Storage LLC 3

slide-4
SLIDE 4

Campaign Storage

Invented at LANL Being productized at Campaign Storage

5/7/17 Campaign Storage LLC 4

slide-5
SLIDE 5

CPU or GPU packages CPU cores High Bandwidth Memory NVRAM e.g. XPOINT, PCM, STTRAM RAM FLASH DISK

BW Cost $/ (GB/s) $10 (CPU included!) $10 $200 $2K $30K Capacity Cost $/GB $ $8 $0.3 $0.05 $0.01 Node BW (GB/sec) 1 TB/s 100 GB/s 20 GB/s 5 GB/s Cluster BW (TB/sec) 1 PB/s 100 TB/s 5 TB/s 100 GB/s 10’s GB/s Software Language level Language level HDF5 / DAOS DDN IME Cray Data Warp Parallel FS Campaign Storage Archive Campaign

5/7/17 Campaign Storage LLC 5

TAPE

slide-6
SLIDE 6

Campaign Storage - a new tier

5/7/17 Campaign Storage LLC 6

Parallel File System Archive Parallel File System Archive Burst Buffer Campaign Storage Cloud decreasing emphasis High BW, high $$$ Decreasing capacities

Old World New World

TB/sec 10 GB/sec 100 GB/sec large, reliable archive stage

slide-7
SLIDE 7

HPC Cluster A Simulation Cluster 20PF Burst Buffer 5 PB & 5 TB/s HPCD & Viz Cluster HDFS HPC Cluster B HPCDCluster 20PF Burst Buffer 5 PB & 5 TB/s Lustre FS 1 TB/s Campaign Storage Campaign Storage Mover Nodes Campaign Storage Metadata Repository Campaign Storage Mover Nodes

Parallel Staging & Archiving

Campaign Storage Object Repository

Search & Data Management

Campaign Storage

5/7/17 Campaign Storage LLC 7

File System Interface Optional other tools:

  • Policy managers (e.g. Robinhood)
  • Workflow managers (e.g. Irods)

customer infrastructure

slide-8
SLIDE 8

Campaign Storage

It is …

A file system - staging and archiving Built from low cost HW but:

  • Industry standard object stores
  • Existing metadata stores

High integrity High capacity, ultra scalable Not highest BW or lowest latency

  • 10-100x higher than archives
  • 10x lower than PFS

It is not …

General purpose file system

  • Wait … these don’t exist actually

Using object stores has problems

  • Data mover support takes effort
  • We will ease that pain

5/7/17 Campaign Storage LLC 8

slide-9
SLIDE 9

Implementation - modules

5/7/17 Campaign Storage LLC 9

OS with VFS and Fuse MarFS Object Storage Metadata FS

Data Movers

HSM – Lustre / DMAPI MPI Enterprise NAS gridftp

Management

Analytics & Search Migration Containers

slide-10
SLIDE 10

Campaign Storage - deployment

5/7/17 Campaign Storage LLC 10

Campaign Storage

Campaign Storage Mover Nodes Campaign Storage Metadata Repository Campaign Storage Mover Nodes Campaign Storage Object Repository

Search & Data Management

Object Repository Disk object stores

  • Commercial & OSS

Archival object stores

  • Black Pearl

Full POSIX objects

  • Stored in metadata FS

Metadata Repository Some nearly POSIX distributed FS with EA’s

  • Lustre / ZFS
  • GPFS

Move & Manage Nodes: 1-100’s

  • Mount MarFS & other FS

Mover software

  • Software on mover node

Management

  • Search analytics in MarFS
  • 3rd party movement
  • Containers

deploy

slide-11
SLIDE 11

Policy Databases

5/7/17 Campaign Storage LLC 11

slide-12
SLIDE 12

Traditional approach

Database with a record for each file Found in HPSS, Robinhood, DMF etc Used for Understanding what is in the file system which files are old, recent, big, belong to group, on device Assist in automatic (“policy”) or manual data management Typically histogram ranges are computed from search results

5/7/17 Campaign Storage LLC 12

slide-13
SLIDE 13

Challenges

Challenges Performance – both ingest and queries queries on 100M file database can take minutes Scalability Requires significant RAM (e.g. 30% of DB size) Handling more than 1B files is very difficult presently Never 100% in sync Adds load to premium storage

5/7/17 Campaign Storage LLC 13

slide-14
SLIDE 14

Approaches

Horizontally scaling key value store LANL is exploring this A variety of proprietary approaches – e.g. Komprise Histogram analytics Maintaining aggregate data has it own challenges: e.g. How to measure the change in size of a file Very few changelogs record old size

5/7/17 Campaign Storage LLC 14

slide-15
SLIDE 15

Analytics - subtree search

Every directory has histogram recording properties of its subtree

  • encode: #files, #bytes in subtree have a property?
  • Limited granularity, limited relational algebra
  • Store perhaps ~100,000 properties per directory

Examples:

  • Quota in subtree? User/group database for subtree?
  • What fileservers contain files?
  • Geospatial information in file?
  • (file type, size, access time) tuples
  • Allows limited relational algebra

Not a new idea. Can be added to ZFS & Lustre

5/7/17 Campaign Storage LLC 15

slide-16
SLIDE 16

5/7/17 Campaign Storage LLC 16

Include e.g. linked list of subdirectories and database of parents of files link count > 1

slide-17
SLIDE 17

Iterate over subdirectories

5/7/17 Campaign Storage LLC 17

Histo DB prv nxt Subdir 1 dir Subdir 3 Histo DB prv nxt Subdir 2 Histo DB prv nxt Histo DB

slide-18
SLIDE 18

Key properties

Generate initially from a scan, then update with changelogs mathematically prove histo(changelog ○ FS1) = histo_update(changelog) + histo(FS1) Additive property: histograms can be added, either increase count or add new bars histo(dir) = sum histo(subdirs) + contributions(files in dir) this is Merkl tree property – graft subtrees with simple addition Keep 100% consistent with snapshots Space consumption on par with policy database with 100K histogram buckets

5/7/17 Campaign Storage LLC 18

slide-19
SLIDE 19

Inserting subtrees

5/7/17 Campaign Storage LLC 19

Histo DB subtree / /a /a/b Histo DB subtree Histo DB Histo DB Histo DB

+ + + =

slide-20
SLIDE 20

Evaluation

A single histogram lookup may provide the overview that a policy search provided But A histogram approach may has insufficient data for efficient general

  • searches. Adapting histograms can be costly – how common is this?

5/7/17 Campaign Storage LLC 20

slide-21
SLIDE 21

Missing Storage API’s

5/7/17 Campaign Storage LLC 21

slide-22
SLIDE 22

Reflect on Storage Software

Since 1980’s a utility has been added “afs” “bfs” “cfs” … “zfs” implements a set of non-standardized features file sets, data layout, ACL’s ACL’s and extended attributes became part of POSIX in 2000’s Storage software almost always centers around batch data operations: caches do this inside the OS utilities do this – rsync, zip, cloud software does this – dropbox containers do this - Docker

5/7/17 Campaign Storage LLC 22

slide-23
SLIDE 23

Lack of standardized API’s

Unnecessarily complicated software Not portable, locked in to a platform

5/7/17 Campaign Storage LLC 23

slide-24
SLIDE 24

Example - data movement across many files

  • Objective store batches of files
  • New concept: file level I/O vectorization
  • Includes server driven ordering
  • Packing small files into one object
  • Cache flushes

5/7/17 Campaign Storage LLC 24

int copy_file_range(copy_range *r, uint count, int flags) struct copy_range { int source_fd; int dest_fd;

  • ff_t source_offset;
  • ff_t dest_offset;

size_t length; }

slide-25
SLIDE 25

Extending the API - alternatives

In some areas concepts must be defined data layout sub-sets and subtrees of file systems (very similar to “mount”) DB world solved this problem – SQL as a domain specific language A file level data management solution could build on: asynchronous data and metadata API’s batch / transaction boundaries intelligent processing Possibly a better approach than more API calls evidence is seen in SQFSCK New problems will keep appearing, e.g. doing this in clusters

5/7/17 Campaign Storage LLC 25

slide-26
SLIDE 26

Thank you

5/7/17 Campaign Storage LLC 26

slide-27
SLIDE 27

Metadata Movement

5/7/17 Campaign Storage LLC 27

slide-28
SLIDE 28

Batch metadata handling

Well studied problem, not easily productized Several sides to the problem

  • 1. scale out the server side – data layout
  • 2. bulk communication

in many cases this utilizes replay of operations

  • 3. tree requires linking subtrees and subsets

Conflicting demands between latency & throughput

5/7/17 Campaign Storage LLC 28

slide-29
SLIDE 29

Role of containers

Fundamentally Unlikely

different tiers perform data movement at similar granularity

Containers are a must-have

5/7/17 Campaign Storage LLC 29

slide-30
SLIDE 30

Example Container Functionality

5/7/17 Campaign Storage LLC 30

Layer 1 Base layer ZFS file system ZFS snapshot ZFS clone ZFS snapshot ZFS clone Container layer Serialized differential Serialized differential Analytics differential Container analytics ZFS Pool Application interface Implementation Slower tier interface

slide-31
SLIDE 31

Containers as distributed namespace

Requires being able to locate the container Location database: a subtree resides on a node Performance will scale well as long as containers can be large enough Fragmented vs. co-located metadata Local node performance x #nodes Related to STT trees, not identical. CMU published a series of papers

  • n this.

5/7/17 Campaign Storage LLC 31

slide-32
SLIDE 32

Other approaches / key unsolved problems

Other approaches: Peer to peer metadata protocols LANL scaled them to 1B file creates / sec (in an experiment) Allow conflicts Distributed namespace consistency An “epoch” approach tracking dependent updates should work There is little understanding of fragmented vs contiguous MD

5/7/17 Campaign Storage LLC 32

slide-33
SLIDE 33

Conclusions

Campaign Storage: bulk data store, archive – focus on data movement Massive data handling at file level is important Amazon introduced S3FS, Dropbox and Gdrive rule Search, batch metadata movement key ingredients Richer API’s or a DSL could create a better eco system campaignstorage.com

5/7/17 Campaign Storage LLC 33

slide-34
SLIDE 34

Thank you

5/7/17 Campaign Storage LLC 34