HUF 2017 KEK site report Share Our Experience Koichi Murakami - - PowerPoint PPT Presentation

huf 2017 kek site report
SMART_READER_LITE
LIVE PREVIEW

HUF 2017 KEK site report Share Our Experience Koichi Murakami - - PowerPoint PPT Presentation

HUF 2017 KEK site report Share Our Experience Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba 1 Oct/19/2017 HUF 2017 KEK, TSUKUBA KEK Diversity in accelerator based sciences High Energy Accelerator Research Organization Pursuing


slide-1
SLIDE 1

Oct/19/2017 HUF 2017 KEK, TSUKUBA

1

Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba

Share Our Experience

HUF 2017 KEK site report

slide-2
SLIDE 2

Oct/19/2017 HUF 2017 KEK, TSUKUBA

2

KEK

High Energy Accelerator Research Organization

Diversity in accelerator based sciences

技術の波及

KEK

Photon factory X-ray as a probe J-PARC MLF neutron and m as a probe

Pursuing fundamental laws of nature

Basic science

Technical development and its applications

Superconducting accelerator Energy recovery linac

Material science and its applications

T2K neutrino exp. SuperKEKB and Belle II COMET J-PARC Hadron hall

Accelerator- based BNCT

Pursuing origin of function in materials

4

slide-3
SLIDE 3

Oct/19/2017 HUF 2017 KEK, TSUKUBA

3

Super KEKB / Belle II

SuperKEKB/Belle II is 40 times more powerful machine compered to the previous B factory experiment, KEKB/Belle.

Assumptions: 9 months / year 20 days / month

  • Integ. Lum. (ab-1)

Goal

slide-4
SLIDE 4

Oct/19/2017 HUF 2017 KEK, TSUKUBA

4

Work / Batch Servers

Lenovo NextScale 358nodes 10,024 cores 55TB memory

GPFS Disk Belle II front-end disk

IBM TS3500 / TS1150 Tape : 70 PB (max)

Grid EMI servers Belle II front-end servers

IB 4xFDR IBM ESS x 8 10 PB IBM ESS 600 TB FW SRX3400 Nexsus 7018 10 GbE Lenovo x3550 M5 36 nodes Lenovo x3550 M5 5 nodes HSM Cache Disk DDN SFA14K 3PB

HSM

KEKCC 2016

SX6518 40 GbE

System Resources

CPU : 10,024 cores

p Intel Xeon E5-2697v3 (2.6GHz, 14cores) x 2 358 nodes p 4GB/core (8,000 cores) / 8GB/core (2,000 cores) (for app. use) p 236 kHS06 / site

Disk : 10PB (GPFS) + 3PB (HSM cache) Interconnect : IB 4xFDR Tape : 70 PB (max cap.) HSM data : 11 PB data, 220 M files, 5,200 tapes Total throughput : 100 GB/s (Disk, GPFS), 50 GB/s (HSM, GHI) JOB scheduler : Platfrom LSF v10.1

Facility Tour on Friday

slide-5
SLIDE 5

Oct/19/2017 HUF 2017 KEK, TSUKUBA

5

DDN SFA 12K TS3500 HPSS/GHI servers GPFS (GHI) : 3PB Total throughput : > 50 GB/s

HSM system

slide-6
SLIDE 6

Oct/19/2017 HUF 2017 KEK, TSUKUBA

6

Tape System

Tape Library

p IBM TS3500 (13 racks) p Max. capacity : 70 PB

Tape Drive

p TS1150:54 drives p TS1140 : 12 drives (for media conversion)

Tape Media

p JD : 10TB, 360 MB/s p JC5:7TB, 300 MB/s (reformatted)

p JC4 : 4TB, 250 MB/s p Reformatting was done in background for 10 months (expected).

p Users (experiment groups) pay tape media they use.

IBM TS3500

slide-7
SLIDE 7

Oct/19/2017 HUF 2017 KEK, TSUKUBA

7

HPSS

p We have used HPSS as HSM system for last 15+ years. p 1st layer : GGPS DDN 3PB + 2nd layer : IBM Tape

GHI, GPFS + HPSS

p GPFS parallel file system staging area p Perfect coherence with GPFS access (POSIX I/O) p KEKCC is the pioneer of GHI customers (since 2012). p Data access with high I/O performance and good usability. p Same access speed as GPFS, once data staged p No HPSS client API, no changes in user codes p small file aggregation helps tape performance for small data

GHI, GPFS + HPSS : The Best of Both Worlds

slide-8
SLIDE 8

Oct/19/2017 HUF 2017 KEK, TSUKUBA

8

New system configuration params.

Component Qty. HPSS Core Server 1 HPSS Disk Mover 4 HPSS Tape Mover 3 Mover Storage 600 TB

  • Max. #Files

2 Billion GHI IOM 6 GHI Session Server 3 Software Version HPSS 7.4.3 p2 efix1 GHI 2.5.0 p1 GPFS 4.2.0.1 OS (HPSS nodes) RHEL 6.7 OS (GHI nodes) RHEL7.1

slide-9
SLIDE 9

Oct/19/2017 HUF 2017 KEK, TSUKUBA

9

Data Processing Cycle

Raw data

p Experimental data from detectors, transferred to storage system in real-time. p 2GB/s, sustained for Belle II experiment p x5 the amount of simulation data p Migration to tape, processing to DST, then purged p “Semi-Cold” data (tens to hundreds PB) p Reprocessed sometimes

DST ( Data Summary Tapes )

p “Hot data” ( ~ tens PB) p Data processing to make physics data p Data shared with various ways (GRID access)

Physics summary data

p Handy data set for reducing physics results (N- tuple data)

Requirements for storage system

p High availability (considering electricity cost for operating acc.) p Scalability up to hundreds PB p Data-intensive processing w/ high I/O performance p Hundreds MB/s I/O for many concurrent accesses (Nx10k) from jobs p Local jobs and GRID jobs (distributed analysis) p Data portability to GRID services (POSIX access)

slide-10
SLIDE 10

Oct/19/2017 HUF 2017 KEK, TSUKUBA

10

system improvements (1)

Separated GPFS clusters

☐ GPFS disk system (10PB) and GHI GPFS system (3PB)

☐ using GPFS remote cluster mount ☐ ITO stability and system management (maintenance, updates,..)

COS supports mixed media types.

☐ Can mix different types of tape media as RW COS ☐ JB/JC/JD

Purge policy changed for small size files

☐ number of small files is huge, but less impact on disk space ☐ do not purge small size of files

☐ < 8 MB to < 40 MB / 100 MB ☐ depends on file size distributions in file system

slide-11
SLIDE 11

Oct/19/2017 HUF 2017 KEK, TSUKUBA

11

Improve GHI migration

☐ old : listing all migration files, then migrate one time:

☐ Single migration requests for >100 k files overflows the hpss queues, migration stalled.

☐ new : migration by 10k files with ghi_backup

system improvements (2)

slide-12
SLIDE 12

Oct/19/2017 HUF 2017 KEK, TSUKUBA

12

HSM service on the old system

☐ 3-days downtime for system migration (backup of the current / restore in the new) ☐ Keep GPFS disk mount (read-only) for 2 weeks before the new system

☐ Only staged data on disk is accessible.

System migration

☐ 8.5 PB data, 170 M files, 5,000 tapes ☐ 3-days work on Aug / 15 – 17, 2016

☐ Move physical tapes from the current to new tape library ☐ DB2 migration using QRep ☐ GHI backup and restore

☐ Staging is necessary in the new system.

☐ Admin staging for important data

Take checksum for tape data

☐ 6 months work for higher priority data ☐ Taken directly from tapes (tape-ordered, htared file for small files, as hpss file)

☐ 200 MB/s in average, 4,000 vols.

☐ Store checksum and timestamp into GPFS UDA

system migration works

slide-13
SLIDE 13

Oct/19/2017 HUF 2017 KEK, TSUKUBA

13

Operational issues (HpSS/GHI)

“Overload request on staging”

☐ All data is in the status of purged in the new system due to system migration. ☐ We did not take system downtime for data staging.

☐ data staging in operation : both admin and user staging

☐ GHI staging priority

☐ Initially : user staging > admin staging (ghi_stage, tape-ordered) ☐ Admin staging was not processed.-> user staging piled up (Bad spiral) ☐ Heavy load staging process ☐ hits bugs!! ☐ identify some bad points ☐ patches applied

☐ Thoughts on data migration

☐ enable D2D migration for staged data, late biding D/T data? ☐ Runtime conversion between GPFS and HPPSS could help (3.0.0)

slide-14
SLIDE 14

Oct/19/2017 HUF 2017 KEK, TSUKUBA

14

review on staging performance

What is the bottle neck? Staging performance in long-term view ☐Sep – Dec / 2016 ☐staged files / min. (hourly averaged, GHI) ☐We can see : ☐spikes of admin staging (HPSS cache -> GHI, thousands files /min) ☐Continuous staging important data for three months (Sep-Nov) ☐Low staging performance in some periods (< 5/min, see next)

5 10 15 20 25 30 35 40 1 6 / 9 / 1 1 6 / 9 / 8 1 6 / 9 / 1 5 1 6 / 9 / 2 2 1 6 / 9 / 2 9 1 6 / 1 / 6 1 6 / 1 / 1 3 1 6 / 1 / 2 1 6 / 1 / 2 7 1 6 / 1 1 / 3 1 6 / 1 1 / 1 1 6 / 1 1 / 1 7 1 6 / 1 1 / 2 4 1 6 / 1 2 / 1 1 6 / 1 2 / 8 1 6 / 1 2 / 1 5 1 6 / 1 2 / 2 2 1 6 / 1 2 / 2 9

#

  • f

F i l e / m i n

Stage Speed

fs01 fs02 SUM

Sep Oct Nov Dec

40

CORES*DAYS 50K 100K 150K 200K 250K 2016/04 2016/05 2016/06 2016/07 2016/08 2016/09 2016/10 2016/11 2016/12 2017/01 2017/02 2017/03 Belle Belle2 Grid Had T2K CMB ILC Others

CPU usage

Sep

slide-15
SLIDE 15

Oct/19/2017 HUF 2017 KEK, TSUKUBA

15

Library Accessor performance

0.5 1 1.5 2 2.5 3 3.5 4 9 / 1 7 9 / 1 7 4 9 / 1 7 8 9 / 1 7 1 2 9 / 1 7 1 6 9 / 1 7 2 9 / 1 8 9 / 1 8 4 9 / 1 8 8 9 / 1 8 1 2 9 / 1 8 1 6 9 / 1 8 2 9 / 1 9 9 / 1 9 4 9 / 1 9 8 9 / 1 9 1 2 9 / 1 9 1 6 9 / 1 9 2 9 / 2

( m

  • u

n t / m i n )

Mount Speed

TS1140 TS1150 SUM

☐ We have 54 TS1150 drives, but... ☐ Tape mounts are limited up to 4 tapes / min. ☐ TS3500 Library accessor spec. : 15 sec / (u)mount. –> 60 / 15 = 4 tapes /min. ☐well consistent with observation ☐ ~ 4 files / min staging in case of continuous requests on different tape medias

4

Tape mounts / min

slide-16
SLIDE 16

Oct/19/2017 HUF 2017 KEK, TSUKUBA

16

data localization in small numbers of tapes

10 20 30 40 50 60 70 80 9 / 1 9 / 8 9 / 1 5 9 / 2 2 9 / 2 9 1 / 6 1 / 1 3 1 / 2 1 / 2 7 1 1 / 3 1 1 / 1 1 1 / 1 7 1 1 / 2 4 1 2 / 1 1 2 / 8 1 2 / 1 5

読み出し 要求数、 読み出し テープ 数

読み出し テープ数 読みだし 要求数 10 20 30 40 50 60 70 80 1 1 / 5 1 1 / 5 2 1 1 / 5 4 1 1 / 5 6 1 1 / 5 8 1 1 / 5 1 1 1 / 5 1 2 1 1 / 5 1 4 1 1 / 5 1 6 1 1 / 5 1 8 1 1 / 5 2 1 1 / 5 2 2 1 1 / 6 1 1 / 6 2 1 1 / 6 4 1 1 / 6 6 1 1 / 6 8 1 1 / 6 1 1 1 / 6 1 2 1 1 / 6 1 4 1 1 / 6 1 6 1 1 / 6 1 8 1 1 / 6 2 1 1 / 6 2 2 1 1 / 7 1 1 / 7 2 1 1 / 7 4 1 1 / 7 6 1 1 / 7 8 1 1 / 7 1 1 1 / 7 1 2 1 1 / 7 1 4 1 1 / 7 1 6 1 1 / 7 1 8 1 1 / 7 2 1 1 / 7 2 2 1 1 / 8

読み出し 要求数、 読み出し テープ 数

読み出し テープ数 読みだし 要求数

#Readout requests vs. #Tapes in read

  • Sep. - Dec. (long term)

3 days (short time) #Readout requests #Tapes in read

slide-17
SLIDE 17

Oct/19/2017 HUF 2017 KEK, TSUKUBA

17

data localization in small numbers of tapes

When number of staging requests concentrate to few tapes, we observed performance degradation on data staging. Matter with HPSS tape staging queue ☐ HPSS supports tape-order recall,… ☐ given a list of staging requests, optimize by tape-order ☐ not real-time optimization? ☐ TOR/ROA of 7.5.1 can improve?

slide-18
SLIDE 18

Oct/19/2017 HUF 2017 KEK, TSUKUBA

18

plan for improvement on staging

Optimization on data recall

☐ Reduce tape mount frequency & TOR / ROA ☐ tool : ghi_stage (admin staging) with tape-order recall ☐ CR for ghi_stage priority : done ☐ghi_stage > user staging (initial : ghi_stage < user staging)

Plans

☐ Scenarios for bulk staging ☐Usecase 1: User gives stage-file/dir lists. ☐gather user staging request (polling) -> ghi_stage in background ☐Usecase 2 : Co-work on Job scheduler (LSF) : ☐data prefetching before job dispatch : LSF provides a scheme for this purpose. ☐once data staged, then job dispatched.

☐ Quaid could help

slide-19
SLIDE 19

Oct/19/2017 HUF 2017 KEK, TSUKUBA

19

real-time monitoring

LSF RTM

slide-20
SLIDE 20

Oct/19/2017 HUF 2017 KEK, TSUKUBA

20

HPSS / GHI monitoring

We are monitoring many parameters / status of HPSS / GHI

☐notification mechanism for detecting service down ☐server performance (ES/Kibana) ☐syslog / ZABBIX

HPSS / GHI log information is not well defined.

☐We took much time to analyze system trouble. ☐CR is continuously raised, …

HPSS / GHI monitoring with ES / Kibana (Planning)

☐# staging files / min. , # tape mounts / min. , # staging requests vs. #tape in read ☐ IOM status (time of IOM longest job, so far not for each thread)

100,000 200,000 300,000 400,000 500,000 9 / 1 9 / 8 9 / 1 5 9 / 2 2 9 / 2 9 1 / 6 1 / 1 3 1 / 2 1 / 2 7 1 1 / 3 1 1 / 1 1 1 / 1 7 1 1 / 2 4 1 2 / 1 1 2 / 8 1 2 / 1 5 1 2 / 2 2 1 2 / 2 9

s e c

IOM Longest Job

hgi01 hgi02 hgi03 hgi04

slide-21
SLIDE 21

Oct/19/2017 HUF 2017 KEK, TSUKUBA

21

10 20 30 40 50 60 1 month 2 month 3 month 4 month 5 month 6 month 7 month 8 month 9 month 10 month 11 month 12 month 13 month 14 month 15 month 16 month 17 month 18 month 19 month 20 month 21 month 22 month 23 month 24 month 25 month 26 month 27 month 28 month 29 month 30 month 31 month 32 month 33 month 34 month 35 month 36 month 37 month 38 month 39 month 40 month 41 month 42 month 43 month 44 month 45 month 46 month 47 month 48 month 49 month 50 month 51 month 52 month 53 month

別紙:運⽤開始からのテープ障害発⽣件数の推移(合算)

現⾏システム 旧システム

media errors history –old vs new -

Previous system New system 4 years

slide-22
SLIDE 22

Oct/19/2017 HUF 2017 KEK, TSUKUBA

22

Media error investigation

Under investigation : we do not have final conclusion yet. Operation for media errors:

☐ Try repack, read w/ hpss_stage ☐ data recovery from cache disk ☐ most of data were recovered. ☐ operational cost is not negligible.

Analysis of drive dump record at IBM Lab (Japan), Tucson Media check by Fuji Film

☐ Most of read errors are not reproduced. ☐ Drive firmware improve error corrections.

A couple of problems found in HPSS. Problem-1 : Last file cannot be read in case that a drive error happens at migration.

☐ Fix on FM writing for the last file ☐ local_mod will be applied soon.

slide-23
SLIDE 23

Oct/19/2017 HUF 2017 KEK, TSUKUBA

23

HPSS fix on seeking when migration

Seek time Distance Before 23 min 8,400 m After 44 sec 420 m

location to 39,802 FMs

Problem-2: Some tapes run very long distance. HPSS Fix: Before:

for (1=1 to 38) space 1024 FMs

After:

space 38*1024 FMs

p Migration speed could degrade. p local_mod will be applied soon. End of Life Alert: p TS1140 : w.r.t. amount of read/write p TS1150 : w.r.t. running distance Media alerts increased.

slide-24
SLIDE 24

Oct/19/2017 HUF 2017 KEK, TSUKUBA

24

Summary

☐The new KEKCC system was launched in September 2016. ☐Increase computing resources based on requirements of experiments ☐CPU : 10K cores (x2.5), Disk : 13PB (x1.8), Tape : 70PB (x4.3) ☐Storage requirements are very important for the coming experiments in KEK. ☐Large capacity, High speed, High scalability ☐Tape system is still important technology for us, not only hardware but software (HSM) points of view. ☐GHI is a solution for HSM for large scale of data processing. ☐Scalable data management is a challenge for next 10 years. ☐Belle II experiment will start in 2018. ☐Scale out the system to Exascale ☐Coherent data management in data processing cycle (data taking, archive, processing, preservation…) ☐Data migration as a potential concern