RIKEN Center for Computational Science
Operation of the K computer and the facility Fumiyoshi Shoji - - PowerPoint PPT Presentation
Operation of the K computer and the facility Fumiyoshi Shoji - - PowerPoint PPT Presentation
Computer simulations create the future Operation of the K computer and the facility Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science RIKEN Center for Computational
RIKEN Center for Computational Science
An announcement of the K computer’s shutdown
- 2019/1/31
https://www.r-ccs.riken.jp/en/topics/20190131.html
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
design & construction (K computer)
- fficial operation
design & construction (facility) early access
- ver 8 years !
I moved to RIKEN and joined the early phase of the K project
RIKEN Center for Computational Science
K computer and achievements
- The K computer:
- developed by collaboration between RIKEN
and FUJITSU in a Japanese national project.
- designed to aim for a general-purpose
computing.
- no accelerators
- broad memory/interconnect bandwidth
- Achievements:
– TOP500 list :No.1 at Jun. and Nov. 2011.(#18 in the latest list)
- The world’s first supercomputer achieved over 10PF HPL performance.
– Graph500 list :No.1 at Jun. 2014, Jul. 2015 – Nov. 2018. – HPCG results :No.1 in Nov. 2016 – Nov. 2017.(#3 in the latest list) – Gordon Bell prize :Winner in 2011 and 2012 – The other remarkable results for science and engineering
- See http://www.r-ccs.riken.jp/en/
System overview
4 Local File System(LFS) (11PB)
Global File System(GFS) (40PB)
Control & Management network
Frontend Servers
Internet
I/O nodes The K computer Compute nodes
6D mesh/torus network (Tofu)
Pre/Post Servers
Users Global I/O network
Management Servers
Control Servers
# of CPU Memory capacity 82,944 1.27PiB
Facility overview (power supply)
5 Power Generators Substation 30MW
Power supply company
Gas Turbine Power Generator 5MW Gas Turbine Power Generator 5MW
Gas supply company
K computer Air handlers etc. Storages & servers active/ standby Chillers
Total power consumption:14-16MW
11-12MW 3-5MW
Facility overview (Co-generation system)
6
Co-generation system enables to achieve higher energy efficiency On the other hand, due to tight connection between power generator and chiller, facility operation is much more complicated.
Gas
Gas turbine power generator (5MW x 2)
~30% ~45% ~25%
unusable waste heat
Electricity
Chiller type Quantity Cooling capability (USRt) Cooling capability (MW) Power consumption (kW) Absorption 4 1,700 5.98 273 Centrifugal 2 1,400 4.93 901 Centrifugal 1 700 2.46 389
Absorption chillers steam
heat
for air condition for cooling
RIKEN Center for Computational Science
Statistics
- 2012/9/28 – 2019/2/3 (6 years and 4 months)
# of projects 649 # of (real) users 3,570 # of processed jobs 3,491,472 Total used Node Hours 3,389,123,489(*)
(*) 73.5% for 6years4months
RIKEN Center for Computational Science
91.9% 94.7% 93.3% 94.0% 95.0% 93.9% 94.6% 6.3% 4.0% 3.2% 3.7% 3.8% 4.2% 3.7% 1.8% 1.2% 3.5% 2.3% 1.2% 1.9% 1.3% 61.2% 75.9% 75.6% 75.3% 78.9% 77.8% 76.5%
50% 60% 70% 80% 90% 100%
FY2012 (Sep.-Mar.) FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)
Availability Scheduled maintenance Irregular system down Job filling rate
Yearly availability & job filling rate
- Availability =
365d×24h − scheduled maintenance 9 irregular system down 365d×24h Job ?illing rate =
@ABC DEFC GHCB IJ KAI
Available time
- Availability rate higher level (~95%)
- Irregular system down is suppressed to less than 2% in the last 3 years
- Considering that direct interconnection between nodes and a block-
wise job allocation, job filling rate is at a sufficiently higher level. 365d x 24h
RIKEN Center for Computational Science
50 100 150 200 250 300 350 FY2012 (Sep.-Mar.) FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)
LFS GFS job scheduler MPI misc
Irregular system down
- 91.9%
50% 60% 70% 80% 90% 100%
FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)
Availability Scheduled maintenance Irregular system down Job filling rate
- File system failures (GFS & LFS) are dominant irregular system down
- We changed our mind to give priority to resuming service earlier than
investigating the cause of failures since FY2015.
- Misc. in FY2018 includes failure of power supply facility due to terrible rain
and wind by typhoon (8/20) and power outage by thunder (6/8). down time (hours)
thunder: 43.0h typhoon: 46.4h
RIKEN Center for Computational Science
Improvements (PUE)
- 1.32
1.34 1.36 1.38 1.40 1.42 1.44 1.46 1.48 1.50 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Sep-12 Dec-12 Mar-13 Jun-13 Sep-13 Dec-13 Mar-14 Jun-14 Sep-14 Dec-14 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16 Sep-16 Dec-16 Mar-17 Jun-17 Sep-17 Dec-17 Mar-18 Jun-18 Sep-18 Dec-18
Computing resources Cooling(chillers, etc.) Cooling(air handlers, etc.) PUE(#49)
Power consumption & PUE(Power Usage Effectiveness)
- Optimization of air cooling operation (2012-2013)
- Optimization of power generator and chillers (2018-)
RIKEN Center for Computational Science
Improvements (Power capping)
- To avoid penalty when power consumption exceeds the upper limit
2 4 6 8 10 12 14 16 18 20 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00
Site total K computer only
Typical power consumption history (4/14 9:00- 14:00)
- Large and rapid change of power consumption 14 -> 18MW and 18 ->
14MW very quickly.
- It is concerned impacts for power and cooling facilities.
power consumption (MW) idle full node job running (elapsed time ~ 2h30m) idle
4MW up 4MW down
- Preview process for large scale job (more than 40% of full system):
- User who want to execute large scale job must execute a small version (10% of full
system) of the large scale job before large scale mode period.
- We evaluate the power consumption profile of the job and estimate the upper power
consumption and decide to admit to execute the job or not.
- Prepare large power consumption:
- If the estimated power consumption exceed the limit, we also consider to activate 2nd
power generator during the job is running.
- Safety valve:
- If power consumption excess occur unfortunately, the monitoring system will work and
the job will be killed automatically.
RIKEN Center for Computational Science
Improvements (others)
- for active user support
- based on data analysis of automatically corrected job profiling data, user
support team can identify and approach users who have potential of performance improvement.
- “micro” queue
- job queue for small job to fill spatial and temporal scheduling gap.
- “Waiting for K”
- command which provides estimated waiting time between submit to run.
- “ksub”
- command which allow to submit many jobs larger than system limit.
- “K pre-post cloud”
- An OpenStack based pre-post environment for various user needs
- “R-CCS software center”
- An activity to support development and promotion of outstanding
software made in R-CCS.
- etc.
RIKEN Center for Computational Science
Towards to operation/services of Post-K
- Increase an effective usage rate
- to increase job filling rate +10%, we should consider
rational node allocation and “charge” roles
- to increase availability and decrease PUE, we have to
improve efficiency and quality of the operation by including automation based on data analysis
- Improve service quality
- commit to construct software eco-system
- collaborate with service providers
Numerous users/projects from various fields of science and engineering come to Post-K We are now discussing about operation of Post-K
RIKEN Center for Computational Science