Operation of the K computer and the facility Fumiyoshi Shoji - - PowerPoint PPT Presentation

operation of the k computer and the facility
SMART_READER_LITE
LIVE PREVIEW

Operation of the K computer and the facility Fumiyoshi Shoji - - PowerPoint PPT Presentation

Computer simulations create the future Operation of the K computer and the facility Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science RIKEN Center for Computational


slide-1
SLIDE 1

RIKEN Center for Computational Science

Computer simulations create the future

Fumiyoshi Shoji (Division Director) Operations and Computer Technologies Div. RIKEN Center for Computational Science

Operation of the K computer and the facility

slide-2
SLIDE 2

RIKEN Center for Computational Science

An announcement of the K computer’s shutdown

  • 2019/1/31

https://www.r-ccs.riken.jp/en/topics/20190131.html

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

design & construction (K computer)

  • fficial operation

design & construction (facility) early access

  • ver 8 years !

I moved to RIKEN and joined the early phase of the K project

slide-3
SLIDE 3

RIKEN Center for Computational Science

K computer and achievements

  • The K computer:
  • developed by collaboration between RIKEN

and FUJITSU in a Japanese national project.

  • designed to aim for a general-purpose

computing.

  • no accelerators
  • broad memory/interconnect bandwidth
  • Achievements:

– TOP500 list :No.1 at Jun. and Nov. 2011.(#18 in the latest list)

  • The world’s first supercomputer achieved over 10PF HPL performance.

– Graph500 list :No.1 at Jun. 2014, Jul. 2015 – Nov. 2018. – HPCG results :No.1 in Nov. 2016 – Nov. 2017.(#3 in the latest list) – Gordon Bell prize :Winner in 2011 and 2012 – The other remarkable results for science and engineering

  • See http://www.r-ccs.riken.jp/en/
slide-4
SLIDE 4

System overview

4 Local File System(LFS) (11PB)

Global File System(GFS) (40PB)

Control & Management network

Frontend Servers

Internet

I/O nodes The K computer Compute nodes

6D mesh/torus network (Tofu)

Pre/Post Servers

Users Global I/O network

Management Servers

Control Servers

# of CPU Memory capacity 82,944 1.27PiB

slide-5
SLIDE 5

Facility overview (power supply)

5 Power Generators Substation 30MW

Power supply company

Gas Turbine Power Generator 5MW Gas Turbine Power Generator 5MW

Gas supply company

K computer Air handlers etc. Storages & servers active/ standby Chillers

Total power consumption:14-16MW

11-12MW 3-5MW

slide-6
SLIDE 6

Facility overview (Co-generation system)

6

Co-generation system enables to achieve higher energy efficiency On the other hand, due to tight connection between power generator and chiller, facility operation is much more complicated.

Gas

Gas turbine power generator (5MW x 2)

~30% ~45% ~25%

unusable waste heat

Electricity

Chiller type Quantity Cooling capability (USRt) Cooling capability (MW) Power consumption (kW) Absorption 4 1,700 5.98 273 Centrifugal 2 1,400 4.93 901 Centrifugal 1 700 2.46 389

Absorption chillers steam

heat

for air condition for cooling

slide-7
SLIDE 7

RIKEN Center for Computational Science

Statistics

  • 2012/9/28 – 2019/2/3 (6 years and 4 months)

# of projects 649 # of (real) users 3,570 # of processed jobs 3,491,472 Total used Node Hours 3,389,123,489(*)

(*) 73.5% for 6years4months

slide-8
SLIDE 8

RIKEN Center for Computational Science

91.9% 94.7% 93.3% 94.0% 95.0% 93.9% 94.6% 6.3% 4.0% 3.2% 3.7% 3.8% 4.2% 3.7% 1.8% 1.2% 3.5% 2.3% 1.2% 1.9% 1.3% 61.2% 75.9% 75.6% 75.3% 78.9% 77.8% 76.5%

50% 60% 70% 80% 90% 100%

FY2012 (Sep.-Mar.) FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)

Availability Scheduled maintenance Irregular system down Job filling rate

Yearly availability & job filling rate

  • Availability =

365d×24h − scheduled maintenance 9 irregular system down 365d×24h Job ?illing rate =

@ABC DEFC GHCB IJ KAI

Available time

  • Availability rate higher level (~95%)
  • Irregular system down is suppressed to less than 2% in the last 3 years
  • Considering that direct interconnection between nodes and a block-

wise job allocation, job filling rate is at a sufficiently higher level. 365d x 24h

slide-9
SLIDE 9

RIKEN Center for Computational Science

50 100 150 200 250 300 350 FY2012 (Sep.-Mar.) FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)

LFS GFS job scheduler MPI misc

Irregular system down

  • 91.9%
94.7% 93.3% 94.0% 95.0% 93.9% 94.6% 6.3% 4.0% 3.2% 3.7% 3.8% 4.2% 3.7% 1.8% 1.2% 3.5% 2.3% 1.2% 1.9% 1.3% 61.2% 75.9% 75.6% 75.3% 78.9% 77.8% 76.5%

50% 60% 70% 80% 90% 100%

FY2012 FY2013 FY2014 FY2015 FY2016 FY2017 FY2018 (Apr.-Jan.)

Availability Scheduled maintenance Irregular system down Job filling rate

  • File system failures (GFS & LFS) are dominant irregular system down
  • We changed our mind to give priority to resuming service earlier than

investigating the cause of failures since FY2015.

  • Misc. in FY2018 includes failure of power supply facility due to terrible rain

and wind by typhoon (8/20) and power outage by thunder (6/8). down time (hours)

thunder: 43.0h typhoon: 46.4h

slide-10
SLIDE 10

RIKEN Center for Computational Science

Improvements (PUE)

  • 1.32

1.34 1.36 1.38 1.40 1.42 1.44 1.46 1.48 1.50 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Sep-12 Dec-12 Mar-13 Jun-13 Sep-13 Dec-13 Mar-14 Jun-14 Sep-14 Dec-14 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16 Sep-16 Dec-16 Mar-17 Jun-17 Sep-17 Dec-17 Mar-18 Jun-18 Sep-18 Dec-18

Computing resources Cooling(chillers, etc.) Cooling(air handlers, etc.) PUE(#49)

Power consumption & PUE(Power Usage Effectiveness)

  • Optimization of air cooling operation (2012-2013)
  • Optimization of power generator and chillers (2018-)
slide-11
SLIDE 11

RIKEN Center for Computational Science

Improvements (Power capping)

  • To avoid penalty when power consumption exceeds the upper limit

2 4 6 8 10 12 14 16 18 20 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00 10 20 30 40 50 00

Site total K computer only

Typical power consumption history (4/14 9:00- 14:00)

  • Large and rapid change of power consumption 14 -> 18MW and 18 ->

14MW very quickly.

  • It is concerned impacts for power and cooling facilities.

power consumption (MW) idle full node job running (elapsed time ~ 2h30m) idle

4MW up 4MW down

  • Preview process for large scale job (more than 40% of full system):
  • User who want to execute large scale job must execute a small version (10% of full

system) of the large scale job before large scale mode period.

  • We evaluate the power consumption profile of the job and estimate the upper power

consumption and decide to admit to execute the job or not.

  • Prepare large power consumption:
  • If the estimated power consumption exceed the limit, we also consider to activate 2nd

power generator during the job is running.

  • Safety valve:
  • If power consumption excess occur unfortunately, the monitoring system will work and

the job will be killed automatically.

slide-12
SLIDE 12

RIKEN Center for Computational Science

Improvements (others)

  • for active user support
  • based on data analysis of automatically corrected job profiling data, user

support team can identify and approach users who have potential of performance improvement.

  • “micro” queue
  • job queue for small job to fill spatial and temporal scheduling gap.
  • “Waiting for K”
  • command which provides estimated waiting time between submit to run.
  • “ksub”
  • command which allow to submit many jobs larger than system limit.
  • “K pre-post cloud”
  • An OpenStack based pre-post environment for various user needs
  • “R-CCS software center”
  • An activity to support development and promotion of outstanding

software made in R-CCS.

  • etc.
slide-13
SLIDE 13

RIKEN Center for Computational Science

Towards to operation/services of Post-K

  • Increase an effective usage rate
  • to increase job filling rate +10%, we should consider

rational node allocation and “charge” roles

  • to increase availability and decrease PUE, we have to

improve efficiency and quality of the operation by including automation based on data analysis

  • Improve service quality
  • commit to construct software eco-system
  • collaborate with service providers

Numerous users/projects from various fields of science and engineering come to Post-K We are now discussing about operation of Post-K

slide-14
SLIDE 14

RIKEN Center for Computational Science

Thank you for your attention