Compar omparativ ive e I/O O Wor orkload kload Char haract - - PowerPoint PPT Presentation

compar omparativ ive e i o o wor orkload kload char
SMART_READER_LITE
LIVE PREVIEW

Compar omparativ ive e I/O O Wor orkload kload Char haract - - PowerPoint PPT Presentation

Compar omparativ ive e I/O O Wor orkload kload Char haract acter eriz ization ion of of Two o Leader Leadership hip Clas lass Stor orage ge Clus luster ers Presented by Sarp Oral Raghul Gunasekaran, Sarp Oral, Jason Hill,


slide-1
SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

Compar

  • mparativ

ive e I/O O Wor

  • rkload

kload Char haract acter eriz ization ion of

  • f

Two

  • Leader

Leadership hip Clas lass Stor

  • rage

ge Clus luster ers Presented by Sarp Oral

Raghul Gunasekaran, Sarp Oral, Jason Hill, Ross Miller, Feiyi Wang, Dustin Leverman Oak Ridge Leadership Computing Facility.

slide-2
SLIDE 2

2

Oak Ridge Leadership Computing Facility

q Design and operate compute and data resources for the most computationally challenging science problems. q Deliver science and transforming discoveries in materials, biology, climate, energy technologies, and basic sciences. q 250+ research organizations, university and industry participants. q Over 500+ active scientific users

slide-3
SLIDE 3

3

Oak Ridge Leadership Computing Facility

  • Compute resources

– TITAN, primary compute platform, 18688 compute clients – EOS, CRAY XC30 compute platform, 736 compute node – Rhea, data analysis cluster, 512 node commodity cluster – Everest, visualization cluster

  • Spider Storage System

– 32PB, +1 TB/s - data resource for OLCF computational needs – Lustre parallel file system – Center-wide shared storage resource, for all OLCF resources – Resilient to system failures, both internal to the storage system as well as computational resources

slide-4
SLIDE 4

4

OLCF System Architecture

Enterprise Storage controllers and large racks of disks are connected via InfiniBand. 36 DataDirect SFA12K-40 controller pairs with 2 Tbyte NL- SAS drives and 8 InifiniBand FDR connections per pair Storage Nodes run parallel file system software and manage incoming FS traffic. 288 Dell servers with 64 GB of RAM each SION II Network provides connectivity between OLCF resources and primarily carries storage traffic. 1600 ports, 56 Gbit/sec InfiniBand switch complex Lustre Router Nodes run parallel file system client software and forward I/O operations from HPC clients. 432 XK7 XIO nodes configured as Lustre routers on Titan Titan XK7 Other OLCF resources XK7 Gemini 3D Torus 9.6 Gbytes/sec per direction InfiniBand 56 Gbit/sec Serial ATA 6 Gbit/sec

Figure reference: S. Oral, et al. OLCF’s 1 TB/s, next-generation lustre file system. In the proceedings of the Cray User Group Conference, 2013

slide-5
SLIDE 5

5

  • Deployed 2014
  • Max bandwidth:1.4 TB/s read and 1.2 TB/s write
  • 36 DDN SFA12K couplets
  • Two namespaces: Atlas1 and Atlas2

– 18 Couplets each, no shared hardware – Purpose: Load balancing and capacity management

  • Why a couplet

– Failover configuration – Bottleneck: ICL (Inter Controller Link)

  • Designed for mixed random I/O workload

– Non-sequential read and write I/O patterns

Spider 2 System

SFA12K SFA12K

ICL

Couplet Host port connectors

(to OSS)

Disk Enclosures

Spider Couplet Setup

slide-6
SLIDE 6

6

Spider File System - Comparison

Spider 1 Spider 2 Years 2008 – 2014 2014 onwards Bandwidth 240 GB/s +1 TB/s Capacity 10 PB 32 PB RAID Controller DDN S2A9000 DDN SFA12KX Disk Type SATA Near-line SAS Number of disks 13,440 20,160 Connectivity IB DDR IB FDR Number of OSTs 1,344 2,016 Number of OSSs 192 288 Lustre version 1.8 2.5 Disk Redundancy RAID 6 ( 8 + 2)

slide-7
SLIDE 7

7

Workload Comparison: Spider 1 vs. Spider 2

Primary Compute Platform: What changed ?

  • 2.3 Petaflop Jaguar à 27 Petaflop Titan
  • CPU à CPU + GPU
  • Memory: 300 à 710 TeraBytes
  • 3D Torus Interconnect bandwidth: 3.2GB/s à 10.4 GB/s
  • I/O router nodes: 192 à 440

What did not change ?

  • # of compute clients: 18688
  • Spider architecture ( just scaled up)
slide-8
SLIDE 8

8

Workload Data

  • From the DDN RAID controllers; using ddntool, a custom tool developed at ORNL
  • Periodic polling: read/write bandwidth and IOPS, request size and latency data.
  • Spider 1 data from 2010 (Jan – June); Spider 2 data from 2015 (April – August)

Characterization Metrics

  • I/O Access (Read vs Write)
  • Peak bandwidth utilization
  • I/O Bandwidth usage trends
  • Request size distribution
  • Service latency distribution

Workload Characterization

MySQL database

ddntool,

management server DDN SFA12KX DDN SFA12KX DDN SFA12KX

.... Data collection system

slide-9
SLIDE 9

9

Read vs Write

Spider 1 Spider 2

~ 60% of I/O is write

20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Percentage (%) DDN Couplets(1-36)

Write 20 40 60 80 100 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

Percentage (%) DDN Controllers(1-48)

Write

> 75% of I/O is write

slide-10
SLIDE 10

10

Peak Bandwidth Utilization

Spider 1 Spider 2 Peak Bandwidth Spider 1

  • ~ 90% for read
  • Only 50% for write

Spider 2

  • ~ 80% for read
  • ~ 75% for write

20 40 60 80 100 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

% of Peak Bandwidth DDN Controllers(1-48)

Max Read Max Write 20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

% of Peak Bandwidth DDN Couplets(1-36)

Max Read Max Write

Reasons:

  • Larger request sizes
  • Write-back cache

enabled

slide-11
SLIDE 11

11

Spider 2 - Bandwidth Usage Trends

Cumulative Distribution Function (CDF) Storage system usage over a month

  • ~92% time usage is less than < 5 GB/s
  • This is expected
  • Most applications are compute-intensive
  • < 5% of runtime is spent on I/O
  • Scientific application’s I/O are bursty

BURST BUFFER !!!!

~50% of our storage space is utilized on an average with

  • Data being purged periodically
  • Large file system idle time (<5GB/s)

40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Percentage (%) Time(days of a month)

% of 32 PB

0.2 0.4 0.6 0.8 1 5 10 50 100 150 200 250

Distribution P(x) Bandwidth (GB/s)

Aggregate bandwidth

slide-12
SLIDE 12

12

Request Size Distribution

Probability Distribution Function (PDF) Spider 1 Spider 2

  • Smallest measurable unit on Spider 1 is 16 KB, Spider 2 is 4KB
  • Large 512 KB requests on Spider 1
  • dm-multipath issue, breaks1MB requests to 2, 512 KB requests
  • deadline I/O request scheduler, in 2011 migrated to noop scheduler

0.1 0.2 0.3 0.4 0.5 0.6 0.7

4k 8k 16k 32k 64k 128k 512k 1M 2M 4M

Distribution P(x) Request Size

Read Write

0.1 0.2 0.3 0.4 0.5 0.6 0.7

4k 8k 16k 32k 64k 128k 512k 1M 2M 4M

Distribution P(x) Request Size

Read Write

slide-13
SLIDE 13

13

Request Service Latency Distribution

Cumulative Distribution Function (CDF) Probability Distribution Function (PDF)

  • Service Latency = Queue time + Disk I/O time
  • 90% of read requests, and 80% of write requests served in less than 16ms
  • 16ms is the smallest measurable unit on the DDN controllers

0.7 0.75 0.8 0.85 0.9 0.95 1 16 32 64 128 256 512 1000

Distribution P(x) Request Latency(ms)

Read Write

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 16 32 64 128 256 512 1000

Distribution P(x) Request Latency(ms)

Read Write

Spider 2

slide-14
SLIDE 14

14

  • Read-ahead cache disabled

– Mixed aggregate read workload is non-sequential – Prefetching read blocks impacts performance (cache trashing)

  • Write-back cache enabled

– ReACT (Real-time Adaptive Cache Technology) – 1MB data blocks written to disk directly, no caching on peer controller – <1MB data blocks

  • Cached and mirrored on either controllers
  • Grouped for single large block write

Request Service Latency Distribution

slide-15
SLIDE 15

15

Conclusion

  • What is our next storage system (for Summit 100+petaflop) ?

– Simply scale up Spider 2 ? Not very likely !!!! – But we will need a center-wide shared storage system like Spider – Explore: Burst Buffer or an intermediate fast I/O cache layer

  • Expected I/O workload trends

– Increased write I/O – Bursty, with identical or increased file system idle times – Support for larger request sizes

  • Open Questions

– How does the next generation of compute platform affect storage system design – Summit: 20+ à 100+ Petaflops but scaling down from 18k to 4k compute nodes