von Neumann's bottleneck von Neumann machine One control unit that - - PDF document

von neumann s bottleneck von neumann machine one control
SMART_READER_LITE
LIVE PREVIEW

von Neumann's bottleneck von Neumann machine One control unit that - - PDF document

von Neumann's bottleneck von Neumann machine One control unit that connects memory and processor Lecture 2: The connection between processor and memory is a bottleneck Performance Limitations Memory Bottleneck &


slide-1
SLIDE 1 1

Lecture 2:

Performance Limitations & Performance Metrics

2

von Neumann's bottleneck

  • von Neumann machine

– One control unit that connects memory and processor – The connection between processor and memory is a bottleneck Memory Control Unit Processing Unit

Instruction/Data Bus

Bottleneck

3

Non-von computer

  • Non von

– P-processors, Q-memories, R-control units,

  • ne network

– Can perform PT instructions per second minus

  • verhead
  • where T are number of instructions per

second

Processor Memory Processor Memory Processor Memory 4

Speedup

  • ts, time to execute the best serial

algorithm on one processor

  • t(1), time to execute the parallel algorithm
  • n one processor
  • tp = t(n), time to execute the parallel

algorithm on n processors

p

t t n n S

1

) speedup( ) ( = =

5

What limits the performance?

  • Available parallelism
  • Load balancing

– some processors do more work than others – some work while others are idle (nothing to do) – queue (waiting) to external resource

  • Extra work

– handling the parallelism – communication

6

Amdahl's Law

The speed of a computer is limited by its serial part

  • Given that

– f is the serial part of the code – fts is the time to compute the serial part of the program – (1-f) ts/n is the time to compute the parallel part

slide-2
SLIDE 2 7

Amdahl's Law - implications

4 8 12 16 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 n S(n)

f =0 f =0,05 f =0,1 f =0,2

8

Gustavson-Barsis’ Law

The parallel fraction of the problem is scalable - increases with problem size

  • Observation, Amdahl's law make the assumption that

(1- f) is independent of n, which it in most cases are not

  • New Law:
  • Assume that

– Parallelism can be used to increase the parallel part of the problem – Each processor computes both a serial (s) and a parallel (p) part

9

Gustavson-Barsis’ Law - implications

4 8 12 16 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 n S(n) f =0 f =0,2 10

The nature of Parallel Programs

  • Embarrassingly parallel

– speedup(p)= p – Matrix addition, compilation of independent subroutines

  • Divide and Conquer

– speedup(p) ~ p/log2 p – Binary tree: adding p numbers, merge sort

  • Communication bound parallelism

– cost = latency + n/bandwidth - overlap – May even be slower on more processors – Matrix computations where whole structures must be communicated, parallel program on LAN.

11

Some measures of performance

  • Number of floating point operations per second

– flop/s, Mflop/s, Gflop/s, Tflop/s

  • Speedup curves: x-axis #processors, y-axis speedup

– Speedup s = t1/t(p). In teory limited by the number of processors.

  • Scaled Speedup

– Increases the problem size linjearly with the #processors

  • Other measures

– Time, Problem size, Processor usage – Efficiency – Scalability

  • Several users on the system

– Throughput, (Jobs/sec)

12

Benchmark Performance

  • Used to measure a system's capacity in

different aspects.

  • For example floating point speed, I/O-speed,

speedup for some core routines, ...

  • Benchmark suite: A collection of special-

character-benchmarks

  • Synthetic benchmark: A small benchmark that

imitates a real application with respect to data structures and number of statements.

slide-3
SLIDE 3 13

Classical Benchmarks

  • Whetstone (synthetic, numerical)
  • Dhrystone (synthetic, integer)
  • Linpack (Solves a 100x100 system, Mflop/s)
  • Gemm-based
  • Livermore Loops (a number of loops)
  • Perfect Club

www.netlib.org och www.netlib.org/benchmark/linpackjava

14

Parallel Benchmarks

  • Linpack (LU), Performance numbers by

solving of linear equation systems

  • NAS Kernels (7 FORTRAN routins,

fluid dynamics)

  • Livermore Loops (FORTRAN code)
  • SLALOM (sci. comp., how much can be

computed in one minute)

15

Benchmark using linear equation systems

  • The result reflects

– Performance by solving dense equation systems – Arithmetic in full precision

  • Four different values

– Linpack benchmark for matrices of size 100 – TPP: Solves systems of the order of 1000 (no restrictions on method or implementation) – Theoretical peak performance – HPC: Highly parallel computing

16

LINPACK 100

  • Matrices of size 100
  • No changes of the FORTRAN-code are

allowed

  • High percentage of floating point
  • perations

– routines: (SD)GEFA & (SD)GESL – LU with partial pivoting and backward substitution

  • Column oriented algorithms
17

LINPACK 1000 (TovardsPeakPerformace)

  • Matrices of size 1000
  • Allowed to change and substitute algorithm and

software – Must use the same driver program and have the same result and accuracy – 2n3/3 + O(2n2) operations

  • Gives an upper limit of the performance

– “the manufacturer guaranties that programs will not exceed this speed”

18

Theoretical peak performance

  • Number of flops that can be computed during

a specified time

  • Based on the cycle time of the machine
  • Example: Cray Y-MP/8, cycle time 6ns

m(2 operations/1 cycle) * (1 cycle/6ns) = 333 Mflop/s

  • Example: POWER3-200, cycle time 5ns

m(2x2 operationer/1 cycle) * (1 cycle/5ns) = 800 Mflop/s

slide-4
SLIDE 4 19

Highly Parallel Computing

  • The result reflects the problem size
  • Rules

– Solve systems of linear equations – Allow the problem size to vary – Use 2n3/3 + O(2n2) operations (independent of method)

  • Result

– Rmax: maximal measured performance in Gflops – Nmax: the size for that problem – N1/2: the size where half of Rmax is achieved – Rpeak: theoretical peak performance www.netlib.org/benchmark/hpl/

20

Top 500 Supercomputer sites

  • A lista with the 500 most powerful

computer systems

  • The computers are ranked by their

LINPACK benchmark performance

  • The list shows Rmax, Nmax, N1/2, Rpeak,

#processors

mhttp://www.top500.org

21

NAS parallel benchmark

  • Imitates computations and data movements in

Computational fluid dynamics (CFD)

– Five parallel kernels & three simulated application benchmarks – The problems are algorithmically specified

  • Three classes of problem (the main diffence is

the problem size)

– Sample code, Class A och Class B (in increasing size)

  • Result

– Time in seconds – Compared to Cray Y-MP/1

22

NAS, The kernel benchmarks

  • Embarrassingly parallel

– Performance without communication

  • Multigrid

– Structured ”long-distance-communication”

  • Conjugated gradient

– Non structured ”long-distance-communication”, unstructured matrix-vector-operations

  • 3d fft

– ”long-distance-communication”

  • Integer sort

– Integer computations and communication

23

NAS, Simulated kernels

  • Pseudo-applications without the

”difficulties” existing in real CFD

– LU solver – Pentadiagonal solver – Block-triangular solver

24

Twelve Ways to Fool the Masses

  • Quote 32-bit performance results and compare it with
  • thers 64-bit results
  • Present inner kernel performance figures as the

performance of entire application

  • Quietly employ assembly code and compare your results

with others C or Fortran implementations

  • Scale up the problem size with the number of processors

but fail to disclose this fact

  • Quote performance results linearly projected to a full

system

  • Compare with an old code on an obsolete system
  • Compare your results against scalar, un-optimized code on

Cray

slide-5
SLIDE 5 25

Twelve Ways to Fool the Masses (2)

  • Base Mflop operation counts on the parallel

implementation, not on the best sequential algorithm

  • Quote performance in terms of processor utilization,

parallel speedup or Mflops/dollar (Peak not sustained)

  • Measure parallel run time on a dedicated system but

measure conventional run times in a busy environment

  • If all these fails, show pretty pictures and animated

videos, and don’t talk about performance

  • D. H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results
  • n Parallel Computers," in Proceedings of Supercomputing '91, Nov. 1991, pp. 4-7.
(finns bl a på www.pdc.kth.se/training/twelve-ways.html) 26

Example of a perfomance graph

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 N Mflops/s Factorization on IBM POWER3 BC, without data transformation BC, including time for data transformation LAPACK DPOTRF LAPACK DPPTRF 27

Example of a performance graph

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 N Mflops/s Factorization on IBM POWER2 BPC, without data transformation BPC, including time for data transformation LAPACK DPOTRF LAPACK DPPTRF 28

Räkneuppgifter

  • Hur mycket speedup kan man få enligt Amdahl’s lag

när n->stort? Applicera slutsatsen på ett program med 5% seriell del!

  • Vid en testkörning på en CPU tog ett program 30

sek för storlek n. För fyra CPU tog det 20 sek. Hur lång tid kräver samma program på 10 CPU enl. Amdahl’s lag? Vad blir speedup? Hur lång tid enl. Gustafson-Barsis’ lag för 10 n och 10 CPU? Vad blir speedup?

  • En parallelldator består av 16 CPU med vardera 800

MIPS topprestanda. Vad blir prestandan (i MIPS)

  • m de instruktioner som skall utföras är 10% seriell

kod, 20% parallell för 2 CPU och 70% fullt parallelliserbar?

29

Tentamen 040116, uppgift 1.

  • Vad är parallella beräkningar?
  • Vid design av parallella program är

följande begrepp viktiga: datapartitionering, kornighet och

  • lastbalans. Förklara dessa begrepp!
  • Beskriv Flynn’s klassindelning. Ge exempel

på maskiner som hör till varje klass!

slide-6
SLIDE 6

Lecture 2:

Performance Limitations & Performance Metrics

slide-7
SLIDE 7

von Neumann's bottleneck

  • von Neumann machine

– One control unit that connects memory and processor – The connection between processor and memory is a bottleneck Memory Control Unit Processing Unit

Instruction/Data Bus

Bottleneck

slide-8
SLIDE 8

Non-von computer

  • Non von

– P-processors, Q-memories, R-control units,

  • ne network

– Can perform PT instructions per second minus

  • verhead
  • where T are number of instructions per

second

Processor Memory Processor Memory Processor Memory

slide-9
SLIDE 9

Speedup

  • ts, time to execute the best serial

algorithm on one processor

  • t(1), time to execute the parallel algorithm
  • n one processor
  • tp = t(n), time to execute the parallel

algorithm on n processors

p

t t n n S

1

) speedup( ) ( = =

slide-10
SLIDE 10

What limits the performance?

  • Available parallelism
  • Load balancing

– some processors do more work than others – some work while others are idle (nothing to do) – queue (waiting) to external resource

  • Extra work

– handling the parallelism – communication

slide-11
SLIDE 11

Amdahl's Law

The speed of a computer is limited by its serial part

  • Given that

– f is the serial part of the code – fts is the time to compute the serial part of the program – (1-f) ts/n is the time to compute the parallel part

slide-12
SLIDE 12

Amdahl's Law - implications

4 8 12 16 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 n S(n)

f =0 f =0,05 f =0,1 f =0,2

slide-13
SLIDE 13

Gustavson-Barsis’ Law

The parallel fraction of the problem is scalable - increases with problem size

  • Observation, Amdahl's law make the assumption that

(1- f) is independent of n, which it in most cases are not

  • New Law:
  • Assume that

– Parallelism can be used to increase the parallel part of the problem – Each processor computes both a serial (s) and a parallel (p) part

slide-14
SLIDE 14

Gustavson-Barsis’ Law - implications

4 8 12 16 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 n S(n)

f =0 f =0,2

slide-15
SLIDE 15

The nature of Parallel Programs

  • Embarrassingly parallel

– speedup(p)= p – Matrix addition, compilation of independent subroutines

  • Divide and Conquer

– speedup(p) ~ p/log2 p – Binary tree: adding p numbers, merge sort

  • Communication bound parallelism

– cost = latency + n/bandwidth - overlap – May even be slower on more processors – Matrix computations where whole structures must be communicated, parallel program on LAN.

slide-16
SLIDE 16

Some measures of performance

  • Number of floating point operations per second

– flop/s, Mflop/s, Gflop/s, Tflop/s

  • Speedup curves: x-axis #processors, y-axis speedup

– Speedup s = t1/t(p). In teory limited by the number of processors.

  • Scaled Speedup

– Increases the problem size linjearly with the #processors

  • Other measures

– Time, Problem size, Processor usage – Efficiency – Scalability

  • Several users on the system

– Throughput, (Jobs/sec)

slide-17
SLIDE 17

Benchmark Performance

  • Used to measure a system's capacity in

different aspects.

  • For example floating point speed, I/O-speed,

speedup for some core routines, ...

  • Benchmark suite: A collection of special-

character-benchmarks

  • Synthetic benchmark: A small benchmark that

imitates a real application with respect to data structures and number of statements.

slide-18
SLIDE 18

Classical Benchmarks

  • Whetstone (synthetic, numerical)
  • Dhrystone (synthetic, integer)
  • Linpack (Solves a 100x100 system, Mflop/s)
  • Gemm-based
  • Livermore Loops (a number of loops)
  • Perfect Club

www.netlib.org och www.netlib.org/benchmark/linpackjava

slide-19
SLIDE 19

Parallel Benchmarks

  • Linpack (LU), Performance numbers by

solving of linear equation systems

  • NAS Kernels (7 FORTRAN routins,

fluid dynamics)

  • Livermore Loops (FORTRAN code)
  • SLALOM (sci. comp., how much can be

computed in one minute)

slide-20
SLIDE 20

Benchmark using linear equation systems

  • The result reflects

– Performance by solving dense equation systems – Arithmetic in full precision

  • Four different values

– Linpack benchmark for matrices of size 100 – TPP: Solves systems of the order of 1000 (no restrictions on method or implementation) – Theoretical peak performance – HPC: Highly parallel computing

slide-21
SLIDE 21

LINPACK 100

  • Matrices of size 100
  • No changes of the FORTRAN-code are

allowed

  • High percentage of floating point
  • perations

– routines: (SD)GEFA & (SD)GESL – LU with partial pivoting and backward substitution

  • Column oriented algorithms
slide-22
SLIDE 22

LINPACK 1000 (TovardsPeakPerformace)

  • Matrices of size 1000
  • Allowed to change and substitute algorithm and

software – Must use the same driver program and have the same result and accuracy – 2n3/3 + O(2n2) operations

  • Gives an upper limit of the performance

– “the manufacturer guaranties that programs will not exceed this speed”

slide-23
SLIDE 23

Theoretical peak performance

  • Number of flops that can be computed during

a specified time

  • Based on the cycle time of the machine
  • Example: Cray Y-MP/8, cycle time 6ns

m(2 operations/1 cycle) * (1 cycle/6ns) = 333 Mflop/s

  • Example: POWER3-200, cycle time 5ns

m(2x2 operationer/1 cycle) * (1 cycle/5ns) = 800 Mflop/s

slide-24
SLIDE 24

Highly Parallel Computing

  • The result reflects the problem size
  • Rules

– Solve systems of linear equations – Allow the problem size to vary – Use 2n3/3 + O(2n2) operations (independent of method)

  • Result

– Rmax: maximal measured performance in Gflops – Nmax: the size for that problem – N1/2: the size where half of Rmax is achieved – Rpeak: theoretical peak performance www.netlib.org/benchmark/hpl/

slide-25
SLIDE 25

Top 500 Supercomputer sites

  • A lista with the 500 most powerful

computer systems

  • The computers are ranked by their

LINPACK benchmark performance

  • The list shows Rmax, Nmax, N1/2, Rpeak,

#processors

mhttp://www.top500.org

slide-26
SLIDE 26

NAS parallel benchmark

  • Imitates computations and data movements in

Computational fluid dynamics (CFD)

– Five parallel kernels & three simulated application benchmarks – The problems are algorithmically specified

  • Three classes of problem (the main diffence is

the problem size)

– Sample code, Class A och Class B (in increasing size)

  • Result

– Time in seconds – Compared to Cray Y-MP/1

slide-27
SLIDE 27

NAS, The kernel benchmarks

  • Embarrassingly parallel

– Performance without communication

  • Multigrid

– Structured ”long-distance-communication”

  • Conjugated gradient

– Non structured ”long-distance-communication”, unstructured matrix-vector-operations

  • 3d fft

– ”long-distance-communication”

  • Integer sort

– Integer computations and communication

slide-28
SLIDE 28

NAS, Simulated kernels

  • Pseudo-applications without the

”difficulties” existing in real CFD

– LU solver – Pentadiagonal solver – Block-triangular solver

slide-29
SLIDE 29

Twelve Ways to Fool the Masses

  • Quote 32-bit performance results and compare it with
  • thers 64-bit results
  • Present inner kernel performance figures as the

performance of entire application

  • Quietly employ assembly code and compare your results

with others C or Fortran implementations

  • Scale up the problem size with the number of processors

but fail to disclose this fact

  • Quote performance results linearly projected to a full

system

  • Compare with an old code on an obsolete system
  • Compare your results against scalar, un-optimized code on

Cray

slide-30
SLIDE 30

Twelve Ways to Fool the Masses (2)

  • Base Mflop operation counts on the parallel

implementation, not on the best sequential algorithm

  • Quote performance in terms of processor utilization,

parallel speedup or Mflops/dollar (Peak not sustained)

  • Measure parallel run time on a dedicated system but

measure conventional run times in a busy environment

  • If all these fails, show pretty pictures and animated

videos, and don’t talk about performance

  • D. H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results
  • n Parallel Computers," in Proceedings of Supercomputing '91, Nov. 1991, pp. 4-7.

(finns bl a på www.pdc.kth.se/training/twelve-ways.html)

slide-31
SLIDE 31

Example of a perfomance graph

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 N Mflops/s Factorization on IBM POWER3 BC, without data transformation BC, including time for data transformation LAPACK DPOTRF LAPACK DPPTRF

slide-32
SLIDE 32

Example of a performance graph

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 N Mflops/s Factorization on IBM POWER2 BPC, without data transformation BPC, including time for data transformation LAPACK DPOTRF LAPACK DPPTRF

slide-33
SLIDE 33

Räkneuppgifter

  • Hur mycket speedup kan man få enligt Amdahl’s lag

när n->stort? Applicera slutsatsen på ett program med 5% seriell del!

  • Vid en testkörning på en CPU tog ett program 30

sek för storlek n. För fyra CPU tog det 20 sek. Hur lång tid kräver samma program på 10 CPU enl. Amdahl’s lag? Vad blir speedup? Hur lång tid enl. Gustafson-Barsis’ lag för 10 n och 10 CPU? Vad blir speedup?

  • En parallelldator består av 16 CPU med vardera 800

MIPS topprestanda. Vad blir prestandan (i MIPS)

  • m de instruktioner som skall utföras är 10% seriell

kod, 20% parallell för 2 CPU och 70% fullt parallelliserbar?

slide-34
SLIDE 34

Tentamen 040116, uppgift 1.

  • Vad är parallella beräkningar?
  • Vid design av parallella program är

följande begrepp viktiga: datapartitionering, kornighet och

  • lastbalans. Förklara dessa begrepp!
  • Beskriv Flynn’s klassindelning. Ge exempel

på maskiner som hör till varje klass!