Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung - - PowerPoint PPT Presentation

▶

Sep 10, 2022 109 likes •334 views

Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung Park, Padma Raghavan Laplacian Mesh Smoothing Iterative process used to improve the quality of 2D meshes. 0 Choose an internal non-visited vertex 1 Move it to the barycenter of

SLIDE 1

Locality-Aware Laplacian Mesh Smoothing

Guillaume Aupy, Jeonghyung Park, Padma Raghavan

SLIDE 2

Laplacian Mesh Smoothing

Iterative process used to improve the quality of 2D meshes.

0 Choose an internal non-visited vertex 1 Move it to the barycenter of its neighbors 2 Pick its lowest-quality non visited neighbor, GOTO 1. If set is empty, GOTO 0.

GOAL: Mesh quality (edge-length ratio) is measured as: 1 |triangles|

triangles

min edge max edge

SLIDE 3

Data Locality

High-level view of a socket of Intel Westmere-EX processor

◮ Data for computation is stored in

cache

◮ If it is not: cache miss

(additional costs) Cache are governed by Least Recently Used (LRU) algorithm. → Measure for data: Reuse Distance

Data Locality

Spatial: Reuse within a cache line. Temporal: Reuse of a node already in cache

SLIDE 4

Data Locality in LMS

Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.

Orderings:

SLIDE 5

Data Locality in LMS

Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.

Quick check:

2 4 6 8 10

Index of access (x10 )

2.5 2 1.5 1 0.5

ReuseDistance (x10⁵)

Random ordering:

exec. time 10.3s

0.5 1 1.5 2 2.5

ReuseDistance (x10 )

2.5 2 1.5 1 0.5

Index of access (x10 )

Original ordering:

exec. time 7.6s

0.5 1 1.5 2 2.5

R e u s e D i s t a n c e ( x 1 )

2.5 2 1.5 1 0.5

Index of access (x10 )

BFS ordering: exec. time 6.59s

SLIDE 6

This work

100 1000 10000 100000 200 400 600 800

Time steps Reuse Distance

←Reuse distance profile of the LMS algorithm on a Carabiner mesh.

SLIDE 7

This work

100 1000 10000 100000 200 400 600 800

Time steps Reuse Distance

←Reuse distance profile of the LMS algorithm on a Carabiner mesh.

Conjecture:

Access pattern for LMS can be controlled by the initial qualities of each nodes in the mesh. A re-ordering based

n the initial iteration should work well.

SLIDE 8

Mesh reordering scheme RDR

◮ From a given node already ordered: sort all its unordered

neighbors by increasing quality

◮ Append to the list of already ordered nodes ◮ Mark the node processed. Iterate from unprocessed

neighbor with worse quality.

SLIDE 9

Evaluation

◮ Meshes are generated by Triangle [Shewchuk’02] ◮ LMS is done with Mesquite [Brewer et al’03].

Comparison are made with respect to:

◮ ORI: original ordering given by Mesquite ◮ BFS: breadth first search ordering [Strout+Hovland’05]

SLIDE 10

Experimental Setup

Runs done on an Intel Westmere-EX: 4 eight-cores processors (up to 32 concurrent threads). Cache Size Latency (cycles per access) L1 (P) 32K 4 L2 (P) 256K 10 L3 (S) 24M 38-170 Mem ∞ 175-290

SLIDE 11

Results

M1 M2 M3 M4 M5 M6 M7 M8 M9 1 2 3 4 5 6 7 8

Execution Time: 1 core

ORI BFS RDR

←Results on one core (seconds).

SLIDE 12

Results

20 40 60 80 10 20 30

Number of Cores Mean Speedup

rdering
ri

bfs rdr

←Mean speedup versus TORI(1)

SLIDE 13

SLIDE 14

Cache Performance

Using the PAPI software, we can measure cache performance. Cache performance results on one core when reorderings were applied. Better orderings will be characterized by better cache perfor-

mance. Can we find better orderings (or show that we cannot)?

M1 M2 M3 M4 M5 M6 M7 M8 M9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Miss Rate(%)

ORI BFS RDR M1 M2 M3 M4 M5 M6 M7 M8 M9 10 20 30 40 50 60

Miss Rate(%)

ORI BFS RDR M1 M2 M3 M4 M5 M6 M7 M8 M9 10 20 30 40 50 60

Miss Rate(%)

ORI BFS RDR

SLIDE 15

First-Order approx.

By tracing all data accesses, we can measure the reuse-distance

f all accesses.

Assuming each node is 66 bytes1, in a 24MB L3 cache, misses

ccur for all accesses with a RD greater than 372k (FOA).

1coordinates (two floats), connectivity (5/6 long) and fixed/boundary

state (integer).

SLIDE 16

First-Order approx.

By tracing all data accesses, we can measure the reuse-distance

f all accesses.

Assuming each node is 66 bytes1, in a 24MB L3 cache, misses

ccur for all accesses with a RD greater than 372k (FOA).

Quantiles #accesses mesh Ordering 50% 75% 90% 100% ORI 8 52 1,168 1,924,021 carabiner BFS 1 11 99 1,923,989 15,566,520 RDR 1 4 6 1,942 ORI 8 43 642 1,767,468 crake BFS 1 11 80 1,767,488 14,226,264 RDR 1 4 6 3,903 ORI 7 39 306 1,819,234 dialog BFS 1 10 79 1,803,850 14,614,336 RDR 1 5 11 6,198

1coordinates (two floats), connectivity (5/6 long) and fixed/boundary

state (integer).

SLIDE 17

SLIDE 18

FOA (II)

We know:

◮ L3 misses are due to external factors ◮ We can compute the application Reuse-Distance ◮ We have access to PAPI cache misses

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

SLIDE 19

FOA (II)

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

Estim. max number
f elements (x103)

mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2

SLIDE 20

FOA (II)

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

Estim. max number
f elements (x103)

mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2

SLIDE 21

Reordering cost

1 (bfs) 1(ori) 2 (bfs) 2 (ori) 4 (bfs) 4 (ori) 8 (bfs) 8 (ori) 16 (bfs) 16 (ori) 24 (bfs) 24 (ori) 32 (bfs) 32 (ori) −20 −10 10 20 30 40 50 gain in execution time (%) Number of cores

← Gain with scalability the performance gain is

Talgo(x)−TRDR(x) Talgo(x)

, for algo being either ORI of BFS and x being the number of cores. Reordering is roughly the cost of one iteration of the algorithm. Basically adds you one iteration and saves you between 10 and 40%. Only worth it if you expect some iterations (> 3).

SLIDE 22

Conclusion

Reordering strategies are known to be an efficient way to improve your data-locality (and hence your pereformance). Simple conjecture: each iteration of LMS follows roughly the same execution order.

Locality-Aware Laplacian Mesh Smoothing

Guillaume Aupy, Jeonghyung Park, Padma Raghavan

Laplacian Mesh Smoothing

Iterative process used to improve the quality of 2D meshes.

0 Choose an internal non-visited vertex 1 Move it to the barycenter of its neighbors 2 Pick its lowest-quality non visited neighbor, GOTO 1. If set is empty, GOTO 0.

GOAL: Mesh quality (edge-length ratio) is measured as: 1 |triangles|

min edge max edge

Data Locality

High-level view of a socket of Intel Westmere-EX processor

◮ Data for computation is stored in

cache

◮ If it is not: cache miss

(additional costs) Cache are governed by Least Recently Used (LRU) algorithm. → Measure for data: Reuse Distance

Data Locality

Spatial: Reuse within a cache line. Temporal: Reuse of a node already in cache

Data Locality in LMS

Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.

Orderings:

Data Locality in LMS

Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.

Quick check:

Random ordering:

Original ordering:

BFS ordering: exec. time 6.59s

This work

←Reuse distance profile of the LMS algorithm on a Carabiner mesh.

This work

←Reuse distance profile of the LMS algorithm on a Carabiner mesh.

Conjecture:

Access pattern for LMS can be controlled by the initial qualities of each nodes in the mesh. A re-ordering based

Mesh reordering scheme RDR

◮ From a given node already ordered: sort all its unordered

neighbors by increasing quality

◮ Append to the list of already ordered nodes ◮ Mark the node processed. Iterate from unprocessed

neighbor with worse quality.

Evaluation

◮ Meshes are generated by Triangle [Shewchuk’02] ◮ LMS is done with Mesquite [Brewer et al’03].

Comparison are made with respect to:

◮ ORI: original ordering given by Mesquite ◮ BFS: breadth first search ordering [Strout+Hovland’05]

Experimental Setup

Runs done on an Intel Westmere-EX: 4 eight-cores processors (up to 32 concurrent threads). Cache Size Latency (cycles per access) L1 (P) 32K 4 L2 (P) 256K 10 L3 (S) 24M 38-170 Mem ∞ 175-290

Results

←Results on one core (seconds).

Results

←Mean speedup versus TORI(1)

Cache Performance

Using the PAPI software, we can measure cache performance. Cache performance results on one core when reorderings were applied. Better orderings will be characterized by better cache perfor-

First-Order approx.

By tracing all data accesses, we can measure the reuse-distance

Assuming each node is 66 bytes1, in a 24MB L3 cache, misses

state (integer).

First-Order approx.

By tracing all data accesses, we can measure the reuse-distance

Assuming each node is 66 bytes1, in a 24MB L3 cache, misses

Quantiles #accesses mesh Ordering 50% 75% 90% 100% ORI 8 52 1,168 1,924,021 carabiner BFS 1 11 99 1,923,989 15,566,520 RDR 1 4 6 1,942 ORI 8 43 642 1,767,468 crake BFS 1 11 80 1,767,488 14,226,264 RDR 1 4 6 3,903 ORI 7 39 306 1,819,234 dialog BFS 1 10 79 1,803,850 14,614,336 RDR 1 5 11 6,198

state (integer).

FOA (II)

We know:

◮ L3 misses are due to external factors ◮ We can compute the application Reuse-Distance ◮ We have access to PAPI cache misses

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

FOA (II)

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2

FOA (II)

We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.

mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2

Reordering cost

← Gain with scalability the performance gain is

Talgo(x)−TRDR(x) Talgo(x)

, for algo being either ORI of BFS and x being the number of cores. Reordering is roughly the cost of one iteration of the algorithm. Basically adds you one iteration and saves you between 10 and 40%. Only worth it if you expect some iterations (> 3).

Conclusion

Reordering strategies are known to be an efficient way to improve your data-locality (and hence your pereformance). Simple conjecture: each iteration of LMS follows roughly the same execution order.

◮ Simple reordering strategy based on this; ◮ We give an intuition that it may be hard to get better

reordering strategies