Linux Kernel Co-Scheduling For Bulk Synchronous Parallel - - PowerPoint PPT Presentation

linux kernel co scheduling for bulk synchronous parallel
SMART_READER_LITE
LIVE PREVIEW

Linux Kernel Co-Scheduling For Bulk Synchronous Parallel - - PowerPoint PPT Presentation

Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications ROSS 2011 Tucson, AZ Ter erry J y Jones ones Oak Ridge National Laboratory 1 Managed by UT-Battelle Terry Jones ROSS 2011 for the U.S. Department of Energy Outline


slide-1
SLIDE 1

1 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications

ROSS 2011 Tucson, AZ

Ter erry J y Jones

  • nes

Oak Ridge National Laboratory

slide-2
SLIDE 2

2 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Outline

  • Motivation
  • Approach & Research
  • Design Attributes
  • Achieving Portable Performance
  • Measurements
  • Conclusion & Acknowledgements
slide-3
SLIDE 3

3 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

We’re Experiencing an Architectural Renaissance

Increased
Core
Counts 
 Disrup1ve
Technologies


  • Factors To Change
  • Moore’s Law -- Number of transistors per IC

double every 24 months

  • No Power Headroom -- Clock speed will not

increase (and may decrease) because of Power Power
α
Voltage2
*
Frequency
 Power
α
Frequency
 Power
α
Voltage3


Increased Transistor Density

slide-4
SLIDE 4

4 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

A Key Component of the Colony Project Adaptive System Software For Improved Resiliency and Performance Approach
 Objec;ves

 Impact


  • Provide technology to make portable scalability a reality.
  • Remove the prohibitive cost of full POSIX APIs and

full-featured operating systems.

  • Enable easier leadership-class level scaling for domain

scientists through removing key system software barriers.

  • Automatic and adaptive load-balancing plus fault

tolerance.

  • High performance peer-to-peer and overlay

infrastructure.

  • Address issues with Linux to provide the familiarity

and performance needed by domain scientists.

  • Full-featured environments allow for a full range of

programming development tools including debuggers, memory tools, and system monitoring tools that depend

  • n separate threads or other POSIX API.
  • Automatic load balancing helps correct problems

associated with long running dynamic simulations.

  • Coordinated scheduling removes the negative impact of

OS jitter from full-featured system software.

Collaborators


Terry Jones, Project PI Laxmikant Kalé, UIUC PI José Moreira, IBM PI

Challenges


  • Computational work often includes large amounts of

state which places additional demands on successful work migration schemes.

  • For widespread acceptance from the Linux community,

the effort to validate and incorporate HPC originated advancements into the Linux kernel must be minimized.

slide-5
SLIDE 5

5 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Motivation – App Complexity

  • Linux
  • > Familiar
  • > Open Source
  • > Support for common system calls
  • Support for daemons & threading packages
  • > Debugging strategies
  • > Asynchronous strategies
  • Support for administrative monitoring
  • OS Scalability
  • > Eliminate OS Scalability Issues Through Parallel Aware Scheduling

Don’t Limit Development Environment

slide-6
SLIDE 6

6 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

The Need For Coordinated Scheduling

Bulk
 Synchronous
 Programs


slide-7
SLIDE 7

7 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

The Need For Coordinated Scheduling

  • Permit
Full
Linux
Func1onality

  • Eliminate
Problema1c
OS
Noise

  • Metaphor:
Cars
and
Coordinated
Traffic
Lights

slide-8
SLIDE 8

8 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

What About …

  • Core Specialization
  • Minimalist OS
  • Will Apps Always Be Bulk Synchronous?
  • Yeah, but itʼs Linux
slide-9
SLIDE 9

9 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

  • The Tau team at the University of

Oregon has reported 23% to 32% increase in runtime for parallel applications running at 1024 nodes and 1.6% operating system noise 


  • Ferreira, Bridges and Brightwell

confirmed that a 1000 Hz 25µs noise interference (an amount measured on a large-scale commodity Linux cluster) can cause a 30% slowdown in application performance on ten thousand nodes 


Time



Node1a
 Node1b
 Node1c
 Node1d
 Node2a
 Node2b
 Node2c
 Node2d
 Node1a
 Node1b
 Node1c
 Node1d
 Node2a
 Node2b
 Node2c
 Node2d


Time



HPC
Colony
Technology
–

 Coordinated
Scheduling


slide-10
SLIDE 10

10 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Goals

  • Portable Performance
  • Make OS Noise a non-issue for bulk-synchronous codes
  • Permit sysadmin best practices
slide-11
SLIDE 11

11 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Proof of Concept – Blue Gene / L

Allreduce

0.1 1 10 100 1000 10000 1024 2048 4096 8192

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise) GLOB

1 10 100 1000 10000 1024 2048 4096 8192

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)

Core Counts (cont.) Scaling with Noise (Noise level @ serial task takes 30% longer)

slide-12
SLIDE 12

12 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Approach

  • Introduces two new process flags & two new tunables
  • total time of epoch
  • percentage to parallel app (percentage of blue from co-schedule figure)
  • Dynamically turned on or off with new system call
  • Tunables are adjusted through use of a second new system call

Salient Features

  • Utilizes a new clock synchronization scheme
  • Uses existing fair round-robin scheduler for both epochs
  • Permits needed flexibility for time-out based and/or latency sensitive apps
slide-13
SLIDE 13

13 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Results

slide-14
SLIDE 14

14 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Results

slide-15
SLIDE 15

15 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

…and in conclusion…

  • For Further Info
  • contact:
Terry
Jones
trj@ornl.gov

  • hOp://www.hpc‐colony.org

  • hOp://charm.cs.uiuc.edu

  • Partnerships and Acknowledgements
  • Synchronized Clock work done by Terry Jones and Gregory Koenig
  • DOE Office of Science – major funding provided by FastOS 2
  • Colony Team
slide-16
SLIDE 16

16 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Extra Viewgraphs

slide-17
SLIDE 17

17 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011 Improved
Clock
Synchroniza1on
Algorithms


  • Achievement


Developed
a
new
clock
synchroniza1on
algorithm.
The
new
algorithm
is
a
high
precision
 design
suitable
for
large
leadership‐class
machines
like
Jaguar.
Unlike
most
high‐precision
 algorithms
which
reach
their
precision
in
a
post‐mortem
 analysis
aWer
the
applica1on
has
completed,
the
new
ORNL
developed
algorithm
rapidly
 provides
precise
results
during
run1me.



  • Relevance

  • To
the
Sponsor;

  • Makes
more
effec1ve
use
of
OLCF
and
ALCF
systems
possible.

  • To
the
Laboratory,
Directorate,
and
Division
Missions;
and

  • Demonstrates
capabili1es
in
cri1cal
system
soWware
for
leadership‐class
machines.

  • To
the
Computer
Science
Research
Community.

  • High
precision
global
synchronized
clock
of
growing
interest
to
system
soWware


needs
including
parallel
analysis
tools,
file
systems,
and
coordina1on
strategies.



  • Demonstrates
techniques
for
high‐precision
coupled
with
guaranteed
answer
at


run1me.


Sponsor:
DOE
ASCR
 FWP
ERKJT17


slide-18
SLIDE 18

18 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Test Setup

Compute Nodes

18,688 nodes (12 Opteron cores per node)

Commodity Network

InfiniBand Switches (3000+ ports)

Gateway Nodes

192 nodes (2 Opteron cores per node)

Storage Nodes

192 nodes (8 Xeon cores per node)

Enterprise Storage

48 Controllers (DataDirect S2A9900) Jaguar XT5 SeaStar2+ 3D Torus 9.6 Gbit/sec SION InfiniBand 16 Gbit/sec InfiniBand 16 Gbit/sec Serial ATA 3.0 Gbit/sec

slide-19
SLIDE 19

19 Managed by UT-Battelle for the U.S. Department of Energy

Terry Jones – ROSS 2011

Test Setup (continued)

ping pong latency ~5.0 µsecs