Oct/19/2017 HUF 2017 KEK, TSUKUBA
1
Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba
Share Our Experience
HUF 2017 KEK site report Share Our Experience Koichi Murakami - - PowerPoint PPT Presentation
HUF 2017 KEK site report Share Our Experience Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba 1 Oct/19/2017 HUF 2017 KEK, TSUKUBA KEK Diversity in accelerator based sciences High Energy Accelerator Research Organization Pursuing
Oct/19/2017 HUF 2017 KEK, TSUKUBA
1
Koichi Murakami (KEK/CRC) HUF 2017 KEK, Tsukuba
Share Our Experience
Oct/19/2017 HUF 2017 KEK, TSUKUBA
2
KEK
High Energy Accelerator Research Organization
Diversity in accelerator based sciences
技術の波及
KEK
Photon factory X-ray as a probe J-PARC MLF neutron and m as a probe
Pursuing fundamental laws of nature
Basic science
Technical development and its applications
Superconducting accelerator Energy recovery linac
Material science and its applications
T2K neutrino exp. SuperKEKB and Belle II COMET J-PARC Hadron hall
Accelerator- based BNCT
Pursuing origin of function in materials
4
Oct/19/2017 HUF 2017 KEK, TSUKUBA
3
SuperKEKB/Belle II is 40 times more powerful machine compered to the previous B factory experiment, KEKB/Belle.
Assumptions: 9 months / year 20 days / month
Goal
Oct/19/2017 HUF 2017 KEK, TSUKUBA
4
Work / Batch Servers
Lenovo NextScale 358nodes 10,024 cores 55TB memoryGPFS Disk Belle II front-end disk
IBM TS3500 / TS1150 Tape : 70 PB (max)Grid EMI servers Belle II front-end servers
IB 4xFDR IBM ESS x 8 10 PB IBM ESS 600 TB FW SRX3400 Nexsus 7018 10 GbE Lenovo x3550 M5 36 nodes Lenovo x3550 M5 5 nodes HSM Cache Disk DDN SFA14K 3PB
HSM
KEKCC 2016
SX6518 40 GbE
System Resources
CPU : 10,024 cores
p Intel Xeon E5-2697v3 (2.6GHz, 14cores) x 2 358 nodes p 4GB/core (8,000 cores) / 8GB/core (2,000 cores) (for app. use) p 236 kHS06 / site
Disk : 10PB (GPFS) + 3PB (HSM cache) Interconnect : IB 4xFDR Tape : 70 PB (max cap.) HSM data : 11 PB data, 220 M files, 5,200 tapes Total throughput : 100 GB/s (Disk, GPFS), 50 GB/s (HSM, GHI) JOB scheduler : Platfrom LSF v10.1
Facility Tour on Friday
Oct/19/2017 HUF 2017 KEK, TSUKUBA
5
DDN SFA 12K TS3500 HPSS/GHI servers GPFS (GHI) : 3PB Total throughput : > 50 GB/s
Oct/19/2017 HUF 2017 KEK, TSUKUBA
6
Tape Library
p IBM TS3500 (13 racks) p Max. capacity : 70 PB
Tape Drive
p TS1150:54 drives p TS1140 : 12 drives (for media conversion)
Tape Media
p JD : 10TB, 360 MB/s p JC5:7TB, 300 MB/s (reformatted)
p JC4 : 4TB, 250 MB/s p Reformatting was done in background for 10 months (expected).
p Users (experiment groups) pay tape media they use.
IBM TS3500
Oct/19/2017 HUF 2017 KEK, TSUKUBA
7
HPSS
p We have used HPSS as HSM system for last 15+ years. p 1st layer : GGPS DDN 3PB + 2nd layer : IBM Tape
GHI, GPFS + HPSS
p GPFS parallel file system staging area p Perfect coherence with GPFS access (POSIX I/O) p KEKCC is the pioneer of GHI customers (since 2012). p Data access with high I/O performance and good usability. p Same access speed as GPFS, once data staged p No HPSS client API, no changes in user codes p small file aggregation helps tape performance for small data
Oct/19/2017 HUF 2017 KEK, TSUKUBA
8
Component Qty. HPSS Core Server 1 HPSS Disk Mover 4 HPSS Tape Mover 3 Mover Storage 600 TB
2 Billion GHI IOM 6 GHI Session Server 3 Software Version HPSS 7.4.3 p2 efix1 GHI 2.5.0 p1 GPFS 4.2.0.1 OS (HPSS nodes) RHEL 6.7 OS (GHI nodes) RHEL7.1
Oct/19/2017 HUF 2017 KEK, TSUKUBA
9
Raw data
p Experimental data from detectors, transferred to storage system in real-time. p 2GB/s, sustained for Belle II experiment p x5 the amount of simulation data p Migration to tape, processing to DST, then purged p “Semi-Cold” data (tens to hundreds PB) p Reprocessed sometimes
DST ( Data Summary Tapes )
p “Hot data” ( ~ tens PB) p Data processing to make physics data p Data shared with various ways (GRID access)
Physics summary data
p Handy data set for reducing physics results (N- tuple data)
Requirements for storage system
p High availability (considering electricity cost for operating acc.) p Scalability up to hundreds PB p Data-intensive processing w/ high I/O performance p Hundreds MB/s I/O for many concurrent accesses (Nx10k) from jobs p Local jobs and GRID jobs (distributed analysis) p Data portability to GRID services (POSIX access)
Oct/19/2017 HUF 2017 KEK, TSUKUBA
10
Separated GPFS clusters
☐ GPFS disk system (10PB) and GHI GPFS system (3PB)
☐ using GPFS remote cluster mount ☐ ITO stability and system management (maintenance, updates,..)
COS supports mixed media types.
☐ Can mix different types of tape media as RW COS ☐ JB/JC/JD
Purge policy changed for small size files
☐ number of small files is huge, but less impact on disk space ☐ do not purge small size of files
☐ < 8 MB to < 40 MB / 100 MB ☐ depends on file size distributions in file system
Oct/19/2017 HUF 2017 KEK, TSUKUBA
11
Improve GHI migration
☐ old : listing all migration files, then migrate one time:
☐ Single migration requests for >100 k files overflows the hpss queues, migration stalled.
☐ new : migration by 10k files with ghi_backup
Oct/19/2017 HUF 2017 KEK, TSUKUBA
12
HSM service on the old system
☐ 3-days downtime for system migration (backup of the current / restore in the new) ☐ Keep GPFS disk mount (read-only) for 2 weeks before the new system
☐ Only staged data on disk is accessible.
System migration
☐ 8.5 PB data, 170 M files, 5,000 tapes ☐ 3-days work on Aug / 15 – 17, 2016
☐ Move physical tapes from the current to new tape library ☐ DB2 migration using QRep ☐ GHI backup and restore
☐ Staging is necessary in the new system.
☐ Admin staging for important data
Take checksum for tape data
☐ 6 months work for higher priority data ☐ Taken directly from tapes (tape-ordered, htared file for small files, as hpss file)
☐ 200 MB/s in average, 4,000 vols.
☐ Store checksum and timestamp into GPFS UDA
Oct/19/2017 HUF 2017 KEK, TSUKUBA
13
“Overload request on staging”
☐ All data is in the status of purged in the new system due to system migration. ☐ We did not take system downtime for data staging.
☐ data staging in operation : both admin and user staging
☐ GHI staging priority
☐ Initially : user staging > admin staging (ghi_stage, tape-ordered) ☐ Admin staging was not processed.-> user staging piled up (Bad spiral) ☐ Heavy load staging process ☐ hits bugs!! ☐ identify some bad points ☐ patches applied
☐ Thoughts on data migration
☐ enable D2D migration for staged data, late biding D/T data? ☐ Runtime conversion between GPFS and HPPSS could help (3.0.0)
Oct/19/2017 HUF 2017 KEK, TSUKUBA
14
What is the bottle neck? Staging performance in long-term view ☐Sep – Dec / 2016 ☐staged files / min. (hourly averaged, GHI) ☐We can see : ☐spikes of admin staging (HPSS cache -> GHI, thousands files /min) ☐Continuous staging important data for three months (Sep-Nov) ☐Low staging performance in some periods (< 5/min, see next)
5 10 15 20 25 30 35 40 1 6 / 9 / 1 1 6 / 9 / 8 1 6 / 9 / 1 5 1 6 / 9 / 2 2 1 6 / 9 / 2 9 1 6 / 1 / 6 1 6 / 1 / 1 3 1 6 / 1 / 2 1 6 / 1 / 2 7 1 6 / 1 1 / 3 1 6 / 1 1 / 1 1 6 / 1 1 / 1 7 1 6 / 1 1 / 2 4 1 6 / 1 2 / 1 1 6 / 1 2 / 8 1 6 / 1 2 / 1 5 1 6 / 1 2 / 2 2 1 6 / 1 2 / 2 9
#
F i l e / m i n
Stage Speed
fs01 fs02 SUM
Sep Oct Nov Dec
40
CORES*DAYS 50K 100K 150K 200K 250K 2016/04 2016/05 2016/06 2016/07 2016/08 2016/09 2016/10 2016/11 2016/12 2017/01 2017/02 2017/03 Belle Belle2 Grid Had T2K CMB ILC OthersCPU usage
Sep
Oct/19/2017 HUF 2017 KEK, TSUKUBA
15
0.5 1 1.5 2 2.5 3 3.5 4 9 / 1 7 9 / 1 7 4 9 / 1 7 8 9 / 1 7 1 2 9 / 1 7 1 6 9 / 1 7 2 9 / 1 8 9 / 1 8 4 9 / 1 8 8 9 / 1 8 1 2 9 / 1 8 1 6 9 / 1 8 2 9 / 1 9 9 / 1 9 4 9 / 1 9 8 9 / 1 9 1 2 9 / 1 9 1 6 9 / 1 9 2 9 / 2
( m
n t / m i n )
Mount Speed
TS1140 TS1150 SUM
☐ We have 54 TS1150 drives, but... ☐ Tape mounts are limited up to 4 tapes / min. ☐ TS3500 Library accessor spec. : 15 sec / (u)mount. –> 60 / 15 = 4 tapes /min. ☐well consistent with observation ☐ ~ 4 files / min staging in case of continuous requests on different tape medias
4
Tape mounts / min
Oct/19/2017 HUF 2017 KEK, TSUKUBA
16
10 20 30 40 50 60 70 80 9 / 1 9 / 8 9 / 1 5 9 / 2 2 9 / 2 9 1 / 6 1 / 1 3 1 / 2 1 / 2 7 1 1 / 3 1 1 / 1 1 1 / 1 7 1 1 / 2 4 1 2 / 1 1 2 / 8 1 2 / 1 5
読み出し 要求数、 読み出し テープ 数
読み出し テープ数 読みだし 要求数 10 20 30 40 50 60 70 80 1 1 / 5 1 1 / 5 2 1 1 / 5 4 1 1 / 5 6 1 1 / 5 8 1 1 / 5 1 1 1 / 5 1 2 1 1 / 5 1 4 1 1 / 5 1 6 1 1 / 5 1 8 1 1 / 5 2 1 1 / 5 2 2 1 1 / 6 1 1 / 6 2 1 1 / 6 4 1 1 / 6 6 1 1 / 6 8 1 1 / 6 1 1 1 / 6 1 2 1 1 / 6 1 4 1 1 / 6 1 6 1 1 / 6 1 8 1 1 / 6 2 1 1 / 6 2 2 1 1 / 7 1 1 / 7 2 1 1 / 7 4 1 1 / 7 6 1 1 / 7 8 1 1 / 7 1 1 1 / 7 1 2 1 1 / 7 1 4 1 1 / 7 1 6 1 1 / 7 1 8 1 1 / 7 2 1 1 / 7 2 2 1 1 / 8
読み出し 要求数、 読み出し テープ 数
読み出し テープ数 読みだし 要求数
#Readout requests vs. #Tapes in read
3 days (short time) #Readout requests #Tapes in read
Oct/19/2017 HUF 2017 KEK, TSUKUBA
17
When number of staging requests concentrate to few tapes, we observed performance degradation on data staging. Matter with HPSS tape staging queue ☐ HPSS supports tape-order recall,… ☐ given a list of staging requests, optimize by tape-order ☐ not real-time optimization? ☐ TOR/ROA of 7.5.1 can improve?
Oct/19/2017 HUF 2017 KEK, TSUKUBA
18
Optimization on data recall
☐ Reduce tape mount frequency & TOR / ROA ☐ tool : ghi_stage (admin staging) with tape-order recall ☐ CR for ghi_stage priority : done ☐ghi_stage > user staging (initial : ghi_stage < user staging)
Plans
☐ Scenarios for bulk staging ☐Usecase 1: User gives stage-file/dir lists. ☐gather user staging request (polling) -> ghi_stage in background ☐Usecase 2 : Co-work on Job scheduler (LSF) : ☐data prefetching before job dispatch : LSF provides a scheme for this purpose. ☐once data staged, then job dispatched.
☐ Quaid could help
Oct/19/2017 HUF 2017 KEK, TSUKUBA
19
LSF RTM
Oct/19/2017 HUF 2017 KEK, TSUKUBA
20
We are monitoring many parameters / status of HPSS / GHI
☐notification mechanism for detecting service down ☐server performance (ES/Kibana) ☐syslog / ZABBIX
HPSS / GHI log information is not well defined.
☐We took much time to analyze system trouble. ☐CR is continuously raised, …
HPSS / GHI monitoring with ES / Kibana (Planning)
☐# staging files / min. , # tape mounts / min. , # staging requests vs. #tape in read ☐ IOM status (time of IOM longest job, so far not for each thread)
100,000 200,000 300,000 400,000 500,000 9 / 1 9 / 8 9 / 1 5 9 / 2 2 9 / 2 9 1 / 6 1 / 1 3 1 / 2 1 / 2 7 1 1 / 3 1 1 / 1 1 1 / 1 7 1 1 / 2 4 1 2 / 1 1 2 / 8 1 2 / 1 5 1 2 / 2 2 1 2 / 2 9
s e c
IOM Longest Job
hgi01 hgi02 hgi03 hgi04
Oct/19/2017 HUF 2017 KEK, TSUKUBA
21
10 20 30 40 50 60 1 month 2 month 3 month 4 month 5 month 6 month 7 month 8 month 9 month 10 month 11 month 12 month 13 month 14 month 15 month 16 month 17 month 18 month 19 month 20 month 21 month 22 month 23 month 24 month 25 month 26 month 27 month 28 month 29 month 30 month 31 month 32 month 33 month 34 month 35 month 36 month 37 month 38 month 39 month 40 month 41 month 42 month 43 month 44 month 45 month 46 month 47 month 48 month 49 month 50 month 51 month 52 month 53 month
別紙:運⽤開始からのテープ障害発⽣件数の推移(合算)
現⾏システム 旧システム
Previous system New system 4 years
Oct/19/2017 HUF 2017 KEK, TSUKUBA
22
Under investigation : we do not have final conclusion yet. Operation for media errors:
☐ Try repack, read w/ hpss_stage ☐ data recovery from cache disk ☐ most of data were recovered. ☐ operational cost is not negligible.
Analysis of drive dump record at IBM Lab (Japan), Tucson Media check by Fuji Film
☐ Most of read errors are not reproduced. ☐ Drive firmware improve error corrections.
A couple of problems found in HPSS. Problem-1 : Last file cannot be read in case that a drive error happens at migration.
☐ Fix on FM writing for the last file ☐ local_mod will be applied soon.
Oct/19/2017 HUF 2017 KEK, TSUKUBA
23
Seek time Distance Before 23 min 8,400 m After 44 sec 420 m
location to 39,802 FMs
Problem-2: Some tapes run very long distance. HPSS Fix: Before:
for (1=1 to 38) space 1024 FMs
After:
space 38*1024 FMs
p Migration speed could degrade. p local_mod will be applied soon. End of Life Alert: p TS1140 : w.r.t. amount of read/write p TS1150 : w.r.t. running distance Media alerts increased.
Oct/19/2017 HUF 2017 KEK, TSUKUBA
24
☐The new KEKCC system was launched in September 2016. ☐Increase computing resources based on requirements of experiments ☐CPU : 10K cores (x2.5), Disk : 13PB (x1.8), Tape : 70PB (x4.3) ☐Storage requirements are very important for the coming experiments in KEK. ☐Large capacity, High speed, High scalability ☐Tape system is still important technology for us, not only hardware but software (HSM) points of view. ☐GHI is a solution for HSM for large scale of data processing. ☐Scalable data management is a challenge for next 10 years. ☐Belle II experiment will start in 2018. ☐Scale out the system to Exascale ☐Coherent data management in data processing cycle (data taking, archive, processing, preservation…) ☐Data migration as a potential concern