Understanding Data Motion in the Modern HPC Data Center Glenn K. - - PowerPoint PPT Presentation

understanding data motion in the modern hpc data center
SMART_READER_LITE
LIVE PREVIEW

Understanding Data Motion in the Modern HPC Data Center Glenn K. - - PowerPoint PPT Presentation

Understanding Data Motion in the Modern HPC Data Center Glenn K. Lockwood Shane Snyder Suren Byna Philip Carns Nicholas J. Wright - 1 - Scientific computing is more than compute! Tape Tape Tape Tape Tape Tape Tape Tape GW Tape


slide-1
SLIDE 1

Glenn K. Lockwood

Shane Snyder Suren Byna Philip Carns Nicholas J. Wright

Understanding Data Motion in the Modern HPC Data Center

  • 1 -
slide-2
SLIDE 2

Science Gateway Science Gateway GW GW Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Data Transfer Data Transfer Data Transfer Data Transfer Data Transfer Data Transfer

Scientific computing is more than compute!

Router SN SN SN SN SN Center-wide Fabric Center-wide Network WAN (ESnet) CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN ION ION SN SN SN SN SN SN SN SN ION ION ION ION Storage Fabric ION ION GW GW

slide-3
SLIDE 3

Science Gateway Science Gateway GW GW Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Tape Data Transfer Data Transfer Data Transfer Data Transfer Data Transfer Data Transfer

Goal: Understand data motion everywhere

Router SN SN SN SN SN Center-wide Fabric Center-wide Network WAN (ESnet) CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN ION ION SN SN SN SN SN SN SN SN ION ION ION ION Storage Fabric ION ION GW GW

slide-4
SLIDE 4

Our simplified model for data motion

  • 4 -

External Facilities Storage Systems Compute Systems Compute- External Storage- External Compute- Storage Storage- Storage Compute- Compute

slide-5
SLIDE 5

Mapping this model to NERSC

  • 5 -

External Facilities Storage Systems Compute Systems

slide-6
SLIDE 6

Relevant logs kicking around at NERSC

  • 6 -

External Facilities Storage Systems Compute Systems Globus logs

no remote storage system info

Darshan

data volumes come with caveats

HPSS logs

some remote storage system info missing

slide-7
SLIDE 7

Normalizing data transfer records

  • 7 -

Storage System Compute System Compute-Storage

Parameter Example

Source site, host, storage system NERSC, Cori, System Memory Destination site, host, storage system NERSC, Cori, cscratch1 (Lustre) Time of transfer start and finish June 4 @ 12:28 – June 4 @ 12:32 Volume of data transferred 34,359,738,368 bytes Tool that logged transfer Darshan, POSIX I/O module Owner of data transferred uname=glock, uid=69615

slide-8
SLIDE 8

What is possible with this approach?

  • 8 -

May 1 – August 1, 2019

  • 194 million transfers
  • 78.6 PiB data moved

Outside World HPSS Gateway DTNs Cori cscratch home project burst buffer Science Gateways archive

≤ 8 TiB/day > 8 TiB/day > 16 TiB/day > 32 TiB/day > 64 TiB/day > 128 TiB/day

slide-9
SLIDE 9

Visualizing data motion as a graph

  • Job I/O is most

voluminous

  • Home file system

usage is least voluminous

  • Burst buffer is read-

heavy

  • Users prefer to

access archive directly from Cori than use DTNs

  • 9 -

Outside World HPSS Gateway DTNs Cori cscratch home project burst buffer Science Gateways archive

≤ 8 TiB/day > 8 TiB/day > 16 TiB/day > 32 TiB/day > 64 TiB/day > 128 TiB/day

slide-10
SLIDE 10

Mapping this data to our model

  • 10 -

Outside World HPSS Gateway DTNs Cori cscratch home project burst buffer Science Gateways archive

External Facilities Storage Systems Compute Systems Storage- External Compute- Storage Storage- Storage

slide-11
SLIDE 11

Adding up data moved along each vector

  • Job I/O is significant
  • Inter-tier is significant

– I/O outside of jobs ~ job write traffic – Fewer tiers, fewer tears

  • HPC I/O is not just

checkpoint-restart!

  • 11 -

C

  • m

p u t e

  • S

t

  • r

a g e S t

  • r

a g e

  • C
  • m

p u t e S t

  • r

a g e

  • S

t

  • r

a g e S t

  • r

a g e

  • W

A N W A N

  • S

t

  • r

a g e 512 GiB 2 TiB 8 TiB 32 TiB 128 TiB 512 TiB Data Transferred (TiB)

External Facilities Storage Systems Compute Systems Storage- External Compute- Storage Storage- Storage

slide-12
SLIDE 12

1 b y t e 3 2 b y t e s 1 K i B 3 2 K i B 1 M i B 3 2 M i B 1 G i B 3 2 G i B 1 T i B 3 2 T i B 1 P i B Size of transfer 0.00 0.25 0.50 0.75 1.00 Cumulative fraction of total transfers

Globus transfers Darshan transfers HPSS transfers Files at rest

Examining non-job I/O patterns

  • Hypothesis: non-job I/O

is poorly formed

– Job I/O: optimized – Others: fire-and-forget

  • Users transfer larger

files than they store (good)

  • Archive transfers are

largest (good)

  • WAN transfers are

smaller than job I/O files (less good)

  • 12 -

31 KiB 720 KiB 1,800 KiB 47,000 KiB

slide-13
SLIDE 13

Few users resulted in the most transfers

  • 1,562 unique users
  • Top 4 users = 66% of

volume transferred

  • Users 5-8 = 5.8%

– All used multiple transfer vectors – Henry is a storage-

  • nly user
  • 13 -

1 M i B 3 2 M i B 1 G i B 3 2 G i B 1 T i B 3 2 T i B 1 P i B Size of transfer 0.00 0.25 0.50 0.75 1.00 Cumulative fraction of total volume transferred

Amy/Darshan Bob/Darshan Carol/Darshan Dan/Darshan Eve/Darshan,Globus,HPSS Frank/Darshan,Globus Gail/Darshan,Globus Henry/HPSS

slide-14
SLIDE 14

Examining transfers along many dimensions

  • Break down

transfers by r/w and file system

  • Top users are

read-heavy

– Rereading same files – Targeting cscratch (Lustre)

  • 14 -

0 bytes 128 TiB 256 TiB 384 TiB 512 TiB 640 TiB 768 TiB Write by User Write by FS Reads by User Reads by FS

Amy Bob Carol Dan Other users tmpfs burst buffer cscratch homes project

slide-15
SLIDE 15

Tracing using users, volumes, and directions

  • Correlating reveals

workflow coupling

– S-S precedes C-S/S- C – 2:1 RW ratio during job – Data reduction of archived data

  • This was

admittedly an exceptional case

  • 15 -

1 2 Compute- Storage (TiB) 1 2 Storage- Compute (TiB) Apr 29, 2019 May 12, 2019 May 25, 2019 Jun 7, 2019 Jun 20, 2019 Jul 3, 2019 Jul 16, 2019 Jul 29, 2019 1 2 Storage- Storage (TiB)

slide-16
SLIDE 16

Is this the full story?

Quantify the amount of transfers not captured

  • Compare volume

transferred to system monitoring (storage systems)

  • Compare bytes in to

bytes out (transfer nodes)

  • 16 -

Outside World HPSS Gateway DTNs Cori cscratch home project burst buffer Science Gateways archive

slide-17
SLIDE 17

Not every data transfer was captured

  • 100% true data volume

should be captured by transfers

  • Missing lots of data—

why?

  • 17 -

cscratch project Burst Buffer archive Outside World 20 40 60 80 100 % True Data Volume Captured by Transfers In/Write Out/Read

– Darshan logs not generated; cp missing – Globus-HPSS adapter logs absent – Only Globus logged; rsync/bbcp absent

slide-18
SLIDE 18

Identifying leaky transfer nodes

  • Incongruency (Δ)

– data in vs. data out – FOM for how “leaky” a node is – Δ = 0 means all bytes in = all bytes out

  • Cori: expect >> 0 because

jobs generate data

  • Science gateways > 0

because ???

  • 18 -

Cori HPSS Gateway DTNs Science Gateways 0.0 0.5 1.0 1.5 Incongruency 1.27 0.137 0.018 0.613

slide-19
SLIDE 19

Towards Total Knowledge of I/O

  • 19 -

Outside World HPSS Gateway DTNs Cori cscratch home project burst buffer Science Gateways archive

New profiling tools to capture I/O from

  • ther transfer tools

(bbcp, scp, etc) Better insight into what is happening inside Docker containers More robust collection

  • f job I/O data; cache-

aware I/O data (LDMS) Improve analysis process to handle complex transfers

slide-20
SLIDE 20

There’s more to HPC I/O than job I/O

  • Inter-tier I/O is too significant to ignore

– need better monitoring of data transfer tools – users benefit from fewer tiers, strong connectivity between tiers – need to optimize non-job I/O patterns

  • Transfer-centric approaches yield new holistic insight

into workflow I/O behavior

– Possible to trace user workflows across a center – Humans in the loop motivate more sophisticated methods

  • 20 -
slide-21
SLIDE 21
  • 21 -

This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contracts DE- AC02- 05CH11231 and DE-AC02-06CH11357. This research used resources and data generated from resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

We’re hiring!

  • Damian Hazen (NERSC)
  • Kristy Kallback-Rose (NERSC)
  • Nick Balthaser (NERSC)
  • Lisa Gerhardt (NERSC)
  • Ravi Cheema (NERSC)
  • Jon Dugan (ESnet)
  • Eli Dart (ESnet)

We gratefully acknowledge the support of

slide-22
SLIDE 22

Few users result in the most transfers

  • 22 -

1 M i B 3 2 M i B 1 G i B 3 2 G i B 1 T i B 3 2 T i B 1 P i B Size of transfer 0.00 0.25 0.50 0.75 1.00 Cumulative fraction of total volume transferred

Amy/Darshan Bob/Darshan Carol/Darshan Dan/Darshan Eve/Darshan,Globus,HPSS Frank/Darshan,Globus Gail/Darshan,Globus Henry/HPSS

10% 1% 0.1% 0.01% 0.001% 0.0001% Percent total volume transferred 10 50 150 250 350 Number of users

slide-23
SLIDE 23

Regularity of user I/O coupling

  • MUTC

– how correlatable is a user’s I/O across all vectors – how easily we can guess what a user’s workflow is doing

  • Strongest correlation only

between job reads and job writes

  • “Excluding C-S/S-C” only

shows workflows with storage-storage or storage- WAN activity

  • 23 -
  • 1.00
  • 0.50

0.00 0.50 1.00 Mean user correlation coefficient 0.0 0.2 0.4 0.6 Fraction of users All vectors Excluding C-S/S-C vectors

1,123 users represented in “all vectors” 486 users represented in “excluding C-S/S-C vectors”