Key features 1. Require no dedicated resources 2. Almost no - - PowerPoint PPT Presentation

key features
SMART_READER_LITE
LIVE PREVIEW

Key features 1. Require no dedicated resources 2. Almost no - - PowerPoint PPT Presentation

DeltaFS Indexed Massive Dir S oftware-Defined Storage For Fast Query PDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor, Garth Gibson Brad Settlemyer, Gary Grider, Fan Guo Carnegie Mellon University


slide-1
SLIDE 1

PDSW-DISCS 2017

Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor, Garth Gibson Brad Settlemyer, Gary Grider, Fan Guo

Carnegie Mellon University Los Alamos National Laboratory (LANL)

DeltaFS Indexed Massive Dir

Software-Defined Storage

For Fast Query

slide-2
SLIDE 2

Key features

  • 1. Require no dedicated resources
  • 2. Almost no post-processing is needed
  • 3. Low I/O overhead

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 2

DeltaFS Indexed Massive Dir

slide-3
SLIDE 3

Target workloads

  • 1. Data-intensive HPC simulations
  • 2. Not designed for indexing checkpoints
  • 3. I/O bandwidth is limited

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 3

DeltaFS Indexed Massive Dir

slide-4
SLIDE 4

Agenda

Part 1 – Motivation Part 2 – In-situ indexing design Part 3 – API, LANL VPIC integration Conclusion

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 4

slide-5
SLIDE 5

Existing HPC builds indexes during post-processing

Delay queries until post-processing done (5-20% simulation time)

App Lustre Queries Indexing

Write Tmp 1 2 3

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 5

slide-6
SLIDE 6

Problem faced:

The increasing time-to-science

Due to the growing gap between compute and I/O Inefficient support on small data

simulation start query finish

slide-7
SLIDE 7

Processing data in-transit while data is written to storage

Need separate resources for sorting and indexing

App Lustre Queries Indexing

Tmp

MapReduce

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 7

slide-8
SLIDE 8

In-situ indexing directly on app nodes using app resources

Lustre Queries data + index Tmp

No need for a separate indexing cluster

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 8

App + Indexing

slide-9
SLIDE 9

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 9

Key idea: Reuse storage write-back buffering and idle CPU cycles for in-situ indexing

slide-10
SLIDE 10

Example app: LANL VPIC

VPIC simulation

Each VPIC process simulates millions of particles Particle

40 bytes

Particles move across processes during a simulation

Small random writes After simulation: high-selective queries

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 10

slide-11
SLIDE 11

TBs I/O per trajectory fetch

Query a single particle trajectory A B C

TBs search

Data object 1M ... Simulation procs One output file per VPIC process A B E C D F P P P ... 1M ... 1M A C E

file-per-process

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 11

slide-12
SLIDE 12

5,000x faster than baseline with DeltaFS in-situ indexing

0.0625 0.25 1 4 16 64 256 1024 4096

Query Time (sec) DeltaFS (w/ 1 CPU core) Baseline (Full-system parallel scan w/ 3k CPU cores)

Time for reading a single particle trajectory

(10TB, 48 billion particles)

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 12

slide-13
SLIDE 13

System design:

Light-weight in-situ indexing

  • 1. Tiny mem footprint
  • 2. Zero write amplification
  • 3. No read back

Part II

slide-14
SLIDE 14

Resource-efficient indexing by log-structured I/O

Tiny mem footprint, full storage b/w util.

data log index

Lustre

buffer App thread Indexing thread

App proc

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 14

slide-15
SLIDE 15

LSM-Trees compacts all the time, but we can’t afford it

Must aim for low I/O overhead at 10%-20%

Compute I/O Compute I/O

Total simulation

Compaction easily causes 1000% I/O overhead by reading/writing previously written data

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 15

slide-16
SLIDE 16

In-situ indexing by aggressive data partitioning

A B C D E F A B C D E F

All-to-all shuffle

App process #0 App process #1 App process #2 … Compute I/O Compute I/O

Bound the number of data needed per query per timestep

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 16

slide-17
SLIDE 17

...

data block index block filter

...

data block data block

In-situ indexing as a file system lib component

No dedicated cluster needed

shuffle receiver

Index Log

WriteBuffer

Data Log

shuffle sender

App data

All-to-all shuffle

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 17

slide-18
SLIDE 18

Programming interface:

Indexed Massive Directory (IMD)

Part III In-situ indexing keyed on filenames mkdir(“./particles”, DELTAFS_IMD)

slide-19
SLIDE 19

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 19

How to use Indexed Massive Dir (IMD)

  • 1. Data searched together go into a single IMD file

e.g. one file for each particle

  • 2. Create as many IMD files as you want

e.g. 1 trillion files for 1 trillions particles

Query you data by “open-read-close”

slide-20
SLIDE 20

VPIC using DeltaFS IMD

Simulation procs

One IMD file per VPIC particle

P P P 1M ... 1T Indexed Massive Directory

file-per-particle

A A D D B B E E C C F F ... Data object 1M ... Index object

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 20

A B C

TBs MBs search

slide-21
SLIDE 21

LANL Trinity Experiments

Compute Node

32 cores/node … 1-99 compute nodes, 496 million – 48 billion particles

buffer

VPIC VPIC-DeltaFS

buffer

VPIC-Baseline VPIC

Queries

No post-processing

SSD

Burst-buffer Lustre

HDD

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 21

DeltaFS indexing

slide-22
SLIDE 22

245x 665x 532x 625x 992x 2221x 4049x 5112x 0.015625 0.0625 0.25 1 4 16 64 256 1024 4096 496 992 1,984 3,968 7,936 16,368 32,736 49,104

Query Time (sec) Simulation Size (million particles)

Baseline (Full-system parallel scan) DeltaFS (w/ 1 CPU core)

1 node 99 nodes 2 nodes 4 node 8 node 66 nodes 33 nodes 16 nodes PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 22

slide-23
SLIDE 23

9.63x 4.78x 2.42x 1.56x 1.29x 1.13x 1.15x 1.13x 40 80 120 160 200 496 992 1,984 3,968 7,936 16,368 32,736 49,104

I/O Time per Dump (sec) Simulation Size (million particles)

Baseline DeltaFS Tiny simulations Bigger simulations

1 node 99 nodes 2 nodes 4 node 8 node 66 nodes 33 nodes 16 nodes PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 23

slide-24
SLIDE 24

Conclusion

  • Indexed Massive Dir (~3% app mem, compaction-free, POSIX API)
  • Powered by Mercury RPC
  • DeltaFS is one of the Mochi micro-services

PDSW-DISCS 2017 http://www.pdl.cmu.edu/ 24

In-situ indexing for transparent, almost-free query acceleration

no dedicated nodes, no post-processing, ~15% I/O overhead

https://mercury-hpc.github.io/ https://press3.mcs.anl.gov/mochi/ https://github.com/pdlfs/deltafs