NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
NSF Future of High Performance Computing Bill Kramer NSF Workshop - - PowerPoint PPT Presentation
NSF Future of High Performance Computing Bill Kramer NSF Workshop - - PowerPoint PPT Presentation
NSF Future of High Performance Computing Bill Kramer NSF Workshop on the Future of High Performance Computing Washington DC December 4, 2009 Why Sustained Performance is the Critical Focus Rela#onship between Peak,
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Why Sustained Performance is the Critical Focus
- Memory Wall
- Limitation on computation speed caused by
the growing disparity between processor speed and memory latency and bandwidth
- From 1986 to 2000, processor speed
increased at an annual rate of 55%, while memory speed improved by only 10% per year
- Issue
- Memory latency and bandwidth limitations
within processor make it difficult to achieve major fraction of peak performance of chip
- Latency and bandwidth limitations of
communication fabric make it difficult to scale science and engineering applications to large numbers of processors
¡-‑ ¡ ¡ ¡ ¡ ¡5.00 ¡ ¡ ¡10.00 ¡ ¡ ¡15.00 ¡ ¡ ¡20.00 ¡ ¡ 1997 ¡ 1999 ¡ 2001 ¡ 2002 ¡ 2005 ¡ 2007 ¡ 2008 ¡
Ra#o ¡Linpack/SSP ¡for ¡ NERSC ¡Systems ¡
Ra-o ¡ Linpack/SSP ¡ for ¡NERSC ¡ Systems ¡ 0 ¡ 50 ¡ 100 ¡ 150 ¡ 200 ¡ 250 ¡ 300 ¡ 350 ¡ 400 ¡ 1997 ¡ 1999 ¡ 2001 ¡ 2002 ¡ 2005 ¡ 2007 ¡ 2008 ¡ TF/s ¡
Rela#onship ¡between ¡Peak, ¡ Linpack ¡and ¡Sustained ¡ Performance ¡Using ¡SSP ¡
Peak ¡(TF) ¡ Linpack ¡(TF) ¡ Normalized ¡SSP ¡(TF) ¡ 2
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Recommendation
- Adopt a longer term focus, rather than the three to 5 year focus, which is really just the useful
lifetime of a single system.
- Achieving and using an Exascale systems, or the equivalent of 10s of 100 Petascale systems, will span 15
years and a progression of resource deployments.
- NSF will be well served to create a 15 year funding program the combines the total cost of acquiring,
supporting and using the resources.
- This strategy should include creating a supporting facility infrastructure that allow efficient technology refresh
to be quickly deployed and integrated with the existing resources.
- To enable effective resource insertion, NSF should separate the selection of organizations that
provision and support HPC resources from the resource selection itself.
- The current NSF practice of issuing separate solicitations that combine an organization as a service provider
and a sole system choice for each resource refreshment leads to sub-optimization that can result in neither the most effective organization nor the best value technology.
- Focus on true application sustained performance.
- Using something like “Sustained System Performance” to determine the best value resource solutions will
enable NSF to have the most cost effective computing environments for the computational science communities.
- The use of state of the practice open, best value procurements that enable comparing technology choices on
sustained performance but allow vendors flexibility.
- NSF should take the lead in redefining the debate – away from simple metrics and TOP500 and towards
meaningful measures for science.
3
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Recommendation
- NSF should follow the industry trend to concentrate its computational and data
storages resources at a few locations that can then make long term investments that are amortized over a series of technology refreshments.
- These locations should be determined by the ability of the organization(s) to manage large
scale, early release systems, support an evolving computational science community, cost effective extreme scale infrastructure, ability to attract and engage to world class computer science and computational science staff.
- The NSF should develop an appropriate balance of ‘production quality’ and
‘experimental’ resources.
- Production quality means systems from well known architectures (albeit they may be early
deliver versions of new generations) with proven Performance, Effectiveness, Reliability, Consistency and Usability for the primary mission of use by for computational science.
- “Experimental’ resources are those that have potential to be disruptive technology leading to
significant (~10x) performance and/or price performance improvements.
- The mission of these types of systems is clearly different and would have different missions.
- A typical investment strategy might be 85% production/15% experimental.
- NSF should establish a “best practice” review of both US fund resources, and
international funding programs.
- NSF should invest in “performance based design” for all application areas.
4
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Geographic Distribution of PRACs Leaders
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Recommendation
- NSF should separate the provisioning of a national science network from mid-ware
software and/or compute and storage resource provisioning.
- A national science network that serves the extreme scale computational data resources, major
communities of computational and data scientists, major observational and experimental resources needs a long term roadmap that has consistent funding and a plan to technology
- insertion. A model for such a plan can be found in the DOE’s ESnet program among others.
- NSF should likewise have a sustained program for distributed (aka cloud) middle ware
software creation and support.
- This support needs to be synchronized with the computational, data and networking
components of the NSF strategy, but needs to be an independent program component.
- NSF should support expanded development and evolution of extreme scale system
software aligned with the IESP roadmap.
- There are contract arrangements that can assure both high quality systems and
services and innovation and advanced technology in whatever balance NSF needs.
- Performance and Rewards based contracts
- Deployment Project Management and On-going operational assessments ala ITIL
- Example - agreement 6 year base term, renewable for up to a total of 16 years
- Automatic and well as discretionary extensions that benefit both NSF and providing organizations
6
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
ADDITIONAL SLIDES
7
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
A Generalized Sustained System Performance (SSP) Framework
- Is an effective and flexible way to evaluate systems
- Determined the Sustained System Performance for each phase of each system
1. Establish a set of performance tests that reflect the intended work the system will do
- Can be any number of tests as long as they have a common measure of performance
2. A test consists of a code and a problem set 3. Establish the amount work (ops) the test needs to do for a fixed concurrency or a fixed problem set 4. Time each test execution – use wall clock time 5. Determine the amount of work done for a given scalable unit (node, socket, core, task, thread, interface, etc.)
- Work = Total operations/total time/number of scalable units used for the test
6. Composite the work per scalable unit for all tests
- Composite functions based on circumstances and test selection criteria
- Can be weighed or not as desired
7. Determine the SSP of a system at any time period by multiplying the composite work per scalable unit by the number of scalable units in the system
12/4/09 8
NSF Workshop on the Future of High Performance Computing • Washington DC December 4, 2009
Examples of Using the (SSP) Framework
- Test a system upon delivery, use to select a system, etc.
- Determine the Potency of the system - how well will the system perform the expected work over
some time period
- Potency is the sum, over the specified time, of the product of a system’s SSP and the time
period of that SSP over some time period
- Different SSPs for different periods
- Different SSPs for different types of computation units (heterogeneous)
- Determine the Cost of systems
- Cost can be any resource units ($, Watts, space…) and with any complexity (Initial, TCO,…)
- Determine the Value of the system
- Value is the potency divide by a cost function
- If needed, compare the value of different system alternatives or compare against expectations
12/4/09 9