ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Birds-of-Feather: ECP Center and Application Monitoring - Working - - PowerPoint PPT Presentation
Birds-of-Feather: ECP Center and Application Monitoring - Working - - PowerPoint PPT Presentation
Birds-of-Feather: ECP Center and Application Monitoring - Working Group Day: Thursday, February 6, 2020 Time: 1:30pm 3:00pm CT Room: Founders I ORNL is managed by UT-Battelle, LLC for the US Department of Energy Recent Advances Are
2 2
Recent Advances Are Disrupting The Status Quo…
H
- m
- g
e n e
- u
s P r
- c
e s s
- r
s H
- m
- g
e n e
- u
s M e m
- r
i e s
3 3
Where Data-Driven Decisions Make A Difference…
- Improved feedback to Application Developers on how their jobs performed
(e.g. others with your application signature have used the XXXX library)
- Improved feedback to CS Researchers on how to improve the software environment
(e.g., link compiler and memory data)
- Improved Job Scheduler to decide what runs on our machines
- Improved feedback to Operations (e.g. Chiller Management)
- Improved feedback to Planners on the characteristics of our workload
(e.g., prefer 5% faster memory over 12% faster interconnect)
- Improved feedback to Vendors on how we use systems
(e.g., 22% of jobs use GPUs in an XXXXX manner)
- Security & better quantification our outputs: better ways to identify applications (avoid
inappropriate usage like malware or bit coin mining) and answer questions about science hours, utilization, and so on.
- Plus many, many other uses
4 4
But There Are Barriers to Overcome…
- A tremendous amount of data is currently collected within our
computer center, but it is artificially separated by knowledge domains
–
Sysadmin data
–
Data available from tools and performance counters
–
Resilience & health of system data
–
Workload characteristics & resource usage details for future procurements,
–
And so on…
- The artificially separated domains form a high barrier to
understanding & a wealth of information is currently largely untapped
Currently users are not aware of most of the data & do not have adequate access Data available from various environment tools is underutilized System designers don’t have adequate access to how the systems are used
“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
– Sherlock Holmes (Sir Arthur Conan Doyle) Ops Users CS Research Planning Security What We Have… SEPARATE STOVEPIPE SYSTEMS
5 5
Risk ID WBS
Risk*
(* Taken From ECP Risk Register v6.1)
Adding Data Component To Support Solution
Mi Mitigat ation
- n
(best)
Mi Mitigat ation
- n
(worst)
10000 2.2
If Aurora has lower than anticipated aggregate memory capacity, then some projects maybe be unable to run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2.
Track memory utilization
$1.0M $1.5M 10001 2.2
If Frontier has lower than anticipated aggregate memory capacity, then some projects maybe be unable to run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2.
Track memory utilization of apps
$1.0M $1.5M 10007 2.3.1
If MPICH does not meet the other ECP subproject performance needs, for example, in interactions between CPU and GPU, directly between GPUs, in latency hiding, etc., then this will be a significant impact to the
- verall project as vendor implementations of MPI often rely heavily on MPICH
Monitor MPI performance
$2.0M $3.0M 10008 2.3.1
If OpenMP 4.5/5.0 does not meet the other ECP subproject needs, there will be a significant impact to the
- verall project as OpenMP is a widely used mechanism for achieving good node-level performance.
Track OpenMP utilization in applications & plugins to tools API
$1.6M $2.4M 10009 2.3.3
If sparse linear solvers fail to perform well at scale and on multi-node architectures or otherwise don't meet ECP application needs, several ECP applications will be at risk of not being able to meet their KPPs.
Track Libraries utilization and how they scale
$1.6M $2.4M 10010 2.3.3
If dense linear algebra kernels fail to perform well on multi-node architectures, several ECP applications will be at risk of not being able to meet their KPPs
Track libraries implementation
$0.8M $1.2M 10012 2.3.5
If ST products are perceived as, or in fact are, inferior or overly complex, AD performance could suffer and ST products will not be adopted.
Track utilization of ST products
10016 2.2
Aurora or Frontier HW or SW has defects (e.g. bugs).
Monitor error bugs and aggregate them until a threshold is met
10018 2.2
Language features used by applications perform poorly or are not fully supported on Frontier and/or Aurora
Compiler Tracking of language standard utilization and possibly performance.
$0.6M $0.9M 10019 2.3.2
If vendor software does not provide required functionality or performance, Applications and/or ST products may not perform as required.
Track Application Utilization
10022 2.3.2
If we do not have a Fortran compiler on Aurora that supports OpenMP target offload capabilities, then we will not be able to compile applications.
Track Fortran utilization
$1.0M $1.5M 10023 2.3.2
If we do not have a Fortran compiler on Frontier that supports OpenMP target offload capabilities, then we will not be able to compile applications.
Track Fortran + OpenMP utilization
$1.0M $1.5M 10025 2.3
If vendors produce new high-performance programming models for next generation architectures e.g. HIP or SYCL, instead of ST-supported models, ST products or functionality may be underused.
Track SYCL, PM utilization
10032 2.3
If ST products do not function, meet performance targets, or support key system capabilities at full system size, then dependent AD and ST codes will not meet goals. Because there are no effective proxy systems, these issues are revealed late in the ECP project.
Track how ST products are used
10046 2.4
If the Facilities do not provide reliable, timely access to the systems for integration of ECP ST, ECP AD products, and/or resources in support of ECP efforts, then this will delay demonstration of KPP's.
Track utilization for information sharing with Facilities