Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) - - PowerPoint PPT Presentation

outline outline
SMART_READER_LITE
LIVE PREVIEW

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) - - PowerPoint PPT Presentation

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) Sparse Linear Algebra 4.) Unstructured Grids 5.) Conclusion 1 N-Body Methods Overview N-Body Methods Particle simulation Large number of simple entities


slide-1
SLIDE 1

1

Outline Outline

 1.) N-Body Methods  2.) Dynamic Programming  3.) Sparse Linear Algebra  4.) Unstructured Grids  5.) Conclusion

slide-2
SLIDE 2

N-Body Methods Overview

slide-3
SLIDE 3

N-Body Methods

  • Particle simulation

– Large number of simple entities – Each entity is dependent on each other – Regularly structured FP computations

  • Xeon Phi offers acceptable performance

– Regularity leads to good auto-vectorization – Cache misses result in very high memory latency

  • L1 misses mostly lead to L2 misses
slide-4
SLIDE 4

N-Body platform comparison

Source: [3]

slide-5
SLIDE 5

Dynamic Programming Overview

slide-6
SLIDE 6

Dynamic Programming

  • Problem is divided into smaller problems

– Each smaller (and simpler) problem is solved first – Results are combined to compute larger problems – Often used for sequence alignment algorithms

  • Example: DNA Sequence Alignment

– Xeon Phi 5110P vs. 2 GPU configurations ( GeForce GTX 480, K20c )

  • Outperforms GTX 480
  • Reaches 74,3% performance compared to the K20c

– Computation heavily based on peak integer performance

  • Xeon Phi has ~ 57,3 % peak integer performance of the K20c
slide-7
SLIDE 7

Sparse Linear Algebra Overview

slide-8
SLIDE 8

Sparse Linear Algebra

  • Matrices with scattered non-zero entries
  • Linear system solvers, Eigen solvers
  • Different matrix structures

– Patterns can arise – Irregularities

  • False predictions
  • Vector registers contain zero entries
slide-9
SLIDE 9

Case Study: Sparse MatrixVector Operations

  • Large sparse matrix A
  • Dense vector x
  • Multiply-Add
  • Simple instructions, large quantitiy
  • Parallel execution
  • Access patterns

Image Source: [2]

slide-10
SLIDE 10

Bottleneck Considerations

Bandwidth limits Bandwidth limits

Memory bandwidth Memory bandwidth

SIMD efficiency

Vector register utilization

SIMD efficiency

Vector register utilization

Computation bandwidth Computation bandwidth

Core utilization Core utilization

Memory latency Memory latency

slide-11
SLIDE 11

Algorithm Showcase

  • CSR / SELLPACK

format

– Column blocks – Finite-Window sorting

  • Adaptive load

balancing

Image Source: [1]

slide-12
SLIDE 12

Evaluation: Effects of single improvements

Source: [1]

slide-13
SLIDE 13

Evaluation: Platform comparison

Source: [1]

slide-14
SLIDE 14

Matrices from the UFL Sparse Matrix Collection (/20)

Matrix 8 has much more non-zero entries than the rest.

Evaluation: Platform comparison

[Gflops/s] Based on data from [2]

slide-15
SLIDE 15

Conclusion: Sparse Linear Algebra on Xeon Phi

  • High potential in sparse linear algebra

– Wide vector registers – High number of cores

  • Main problem points:

– Irregular access patterns – Sparse data structures – Memory latency high on L2 cache misses

  • Performance extremely dependent on data
slide-16
SLIDE 16

1 6

Unstructured Grids Unstructured Grids

Image Source: [4]

slide-17
SLIDE 17

Image Source: [5]

slide-18
SLIDE 18

Image Source: [6]

slide-19
SLIDE 19

1 9

Unstructured Grids Unstructured Grids

  • Partitioning the grid (few connections)
  • Irregular acces-patterns

=> bad for auto-vectorization

  • Software prefetching grid cells
  • Ideal prefetch amount:

MM latency * MM bandwidth

slide-20
SLIDE 20

2

Unstructured Grids Unstructured Grids

 Data races can not be determined

statically

 „loop over edges and accessing data on

edges“ => data races

slide-21
SLIDE 21

2 1

Airfoil Benchmark Airfoil Benchmark

  • 2D inviscid airfoil code
  • considered bandwith bound
  • 2,800,000 cell mesh

Image Source: [7]

slide-22
SLIDE 22

2 2

Price $ # Cores Mem-Bandwidth double GFlops Last Level Cache

500 1000 1500 2000 2500 3000 3500

2 × Xeon E5-2680 Xeon Phi 5110P Tesla K40

Competitors Competitors

slide-23
SLIDE 23

2 3

Airfoil on Xeon Phi Airfoil on Xeon Phi

 At the highest level:

Message Passing Protocol (MPI)

 At the lower level:

OpenMP + Vector intrinsics

slide-24
SLIDE 24

2 4

Xeon Phi Benchmark graph Xeon Phi Benchmark graph

Image Source: [8]

slide-25
SLIDE 25

2 5

Benchmark graph Benchmark graph

Image Source: [8]

slide-26
SLIDE 26

2 6

Unstructured Grids conclusion Unstructured Grids conclusion

 Benchmark probably not fair (price)  DO manual Vectorization:

Speedup 1.7-1.82

 Use MPI +

+ OpenMP

slide-27
SLIDE 27

2 7

Who did what Who did what

Philipp Bartels:

 Architecture introduction: slide 18 and 1 to 8  Presentation team x2: slide 1 and 16 to 27

Eugen Seljutin:

 Presentation team x2: slide 2 to 15

slide-28
SLIDE 28

2 8

Credits Credits

[1] Xing Liu et al.: Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors

[2] Erik Saule et al.: Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

[3] Konstantinos Krommydas et al.: On the Characterization of OpenCL Dwarfs on Fixed and Reconfigurable Platforms

[4] http://en.wikipedia.org/wiki/Unstructured_grid

[5] http://view.eecs.berkeley.edu/wiki/Unstructured_Grids

[6] https://cfd.gmu.edu/~jcebral/gallery/vis04/index.html

slide-29
SLIDE 29

2 9

Credits Credits

[7] http://en.wikipedia.org/wiki/File:Airfoil_with_flow.png

[8] I. Z. Reguly et al.: Vectorizing Unstructured Mesh Computations for Many-core Architecture