Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung - - PowerPoint PPT Presentation
Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung - - PowerPoint PPT Presentation
Locality-Aware Laplacian Mesh Smoothing Guillaume Aupy , Jeonghyung Park, Padma Raghavan Laplacian Mesh Smoothing Iterative process used to improve the quality of 2D meshes. 0 Choose an internal non-visited vertex 1 Move it to the barycenter of
1
Laplacian Mesh Smoothing
Iterative process used to improve the quality of 2D meshes.
0 Choose an internal non-visited vertex 1 Move it to the barycenter of its neighbors 2 Pick its lowest-quality non visited neighbor, GOTO 1. If set is empty, GOTO 0.
GOAL: Mesh quality (edge-length ratio) is measured as: 1 |triangles|
- triangles
min edge max edge
2
Data Locality
High-level view of a socket of Intel Westmere-EX processor
◮ Data for computation is stored in
cache
◮ If it is not: cache miss
(additional costs) Cache are governed by Least Recently Used (LRU) algorithm. → Measure for data: Reuse Distance
Data Locality
Spatial: Reuse within a cache line. Temporal: Reuse of a node already in cache
3
Data Locality in LMS
Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.
Orderings:
3
Data Locality in LMS
Hypothesis: Cache misses play an important role in the LMS algorithm. →[Strout+Hovland 04] Data-ordering of irregular HPC applications impact the performance.
Quick check:
2 4 6 8 10
Index of access (x10 )
2.5 2 1.5 1 0.5
ReuseDistance (x10⁵)
Random ordering:
- exec. time 10.3s
0.5 1 1.5 2 2.5
ReuseDistance (x10 )
2.5 2 1.5 1 0.5
Index of access (x10 )
Original ordering:
- exec. time 7.6s
0.5 1 1.5 2 2.5
R e u s e D i s t a n c e ( x 1 )
2.5 2 1.5 1 0.5
Index of access (x10 )
BFS ordering: exec. time 6.59s
4
This work
100 1000 10000 100000 200 400 600 800
Time steps Reuse Distance
←Reuse distance profile of the LMS algorithm on a Carabiner mesh.
4
This work
100 1000 10000 100000 200 400 600 800
Time steps Reuse Distance
←Reuse distance profile of the LMS algorithm on a Carabiner mesh.
Conjecture:
Access pattern for LMS can be controlled by the initial qualities of each nodes in the mesh. A re-ordering based
- n the initial iteration should work well.
5
Mesh reordering scheme RDR
◮ From a given node already ordered: sort all its unordered
neighbors by increasing quality
◮ Append to the list of already ordered nodes ◮ Mark the node processed. Iterate from unprocessed
neighbor with worse quality.
6
Evaluation
◮ Meshes are generated by Triangle [Shewchuk’02] ◮ LMS is done with Mesquite [Brewer et al’03].
Comparison are made with respect to:
◮ ORI: original ordering given by Mesquite ◮ BFS: breadth first search ordering [Strout+Hovland’05]
7
Experimental Setup
Runs done on an Intel Westmere-EX: 4 eight-cores processors (up to 32 concurrent threads). Cache Size Latency (cycles per access) L1 (P) 32K 4 L2 (P) 256K 10 L3 (S) 24M 38-170 Mem ∞ 175-290
8
Results
M1 M2 M3 M4 M5 M6 M7 M8 M9 1 2 3 4 5 6 7 8
Execution Time: 1 core
ORI BFS RDR
←Results on one core (seconds).
8
Results
20 40 60 80 10 20 30
Number of Cores Mean Speedup
- rdering
- ri
bfs rdr
←Mean speedup versus TORI(1)
9
10
Cache Performance
Using the PAPI software, we can measure cache performance. Cache performance results on one core when reorderings were applied. Better orderings will be characterized by better cache perfor-
- mance. Can we find better orderings (or show that we cannot)?
M1 M2 M3 M4 M5 M6 M7 M8 M9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Miss Rate(%)
ORI BFS RDR M1 M2 M3 M4 M5 M6 M7 M8 M9 10 20 30 40 50 60
Miss Rate(%)
ORI BFS RDR M1 M2 M3 M4 M5 M6 M7 M8 M9 10 20 30 40 50 60
Miss Rate(%)
ORI BFS RDR
11
First-Order approx.
By tracing all data accesses, we can measure the reuse-distance
- f all accesses.
Assuming each node is 66 bytes1, in a 24MB L3 cache, misses
- ccur for all accesses with a RD greater than 372k (FOA).
1coordinates (two floats), connectivity (5/6 long) and fixed/boundary
state (integer).
11
First-Order approx.
By tracing all data accesses, we can measure the reuse-distance
- f all accesses.
Assuming each node is 66 bytes1, in a 24MB L3 cache, misses
- ccur for all accesses with a RD greater than 372k (FOA).
Quantiles #accesses mesh Ordering 50% 75% 90% 100% ORI 8 52 1,168 1,924,021 carabiner BFS 1 11 99 1,923,989 15,566,520 RDR 1 4 6 1,942 ORI 8 43 642 1,767,468 crake BFS 1 11 80 1,767,488 14,226,264 RDR 1 4 6 3,903 ORI 7 39 306 1,819,234 dialog BFS 1 10 79 1,803,850 14,614,336 RDR 1 5 11 6,198
1coordinates (two floats), connectivity (5/6 long) and fixed/boundary
state (integer).
12
13
FOA (II)
We know:
◮ L3 misses are due to external factors ◮ We can compute the application Reuse-Distance ◮ We have access to PAPI cache misses
We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.
13
FOA (II)
We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.
- Estim. max number
- f elements (x103)
mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2
13
FOA (II)
We can estimate the “real” number of data elements that fit a cache: Assuming that there are nX LX misses, then the nX accesses with the largest reuse distance are the one that missed.
- Estim. max number
- f elements (x103)
mesh Ordering L1 L2 L3 ORI 13.2 21.3 330 carabiner BFS 10.2 21.2 1060 RDR 1.6 1.88 1.94 ORI 24.6 40.9 198 crake BFS 18.3 39.2 986 RDR 3.4 3.77 3.9 ORI 59 87.7 108 dialog BFS 53.2 89.3 157 RDR 5.84 6.05 6.2
14
Reordering cost
1 (bfs) 1(ori) 2 (bfs) 2 (ori) 4 (bfs) 4 (ori) 8 (bfs) 8 (ori) 16 (bfs) 16 (ori) 24 (bfs) 24 (ori) 32 (bfs) 32 (ori) −20 −10 10 20 30 40 50 gain in execution time (%) Number of cores