a preliminary investigation on optimizing charm for
play

A Preliminary Investigation on Optimizing Charm++ for Homogeneous - PowerPoint PPT Presentation

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop Motivation Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2


  1. A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop

  2. Motivation � Clusters are built from multicore chips � 4 cores/node on BG/P � 8 cores/node on Abe (2 Intel quad-core chips) � 16 cores/node on Ranger (4 AMD quad-core chips) � … � Charm has a building version for SMP node for many years � Not tuned � So, what are the issues for getting high performance? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  3. Start with a kNeighbor benchmark � A synthetic kNeighbor benchmark � Each element communicates with its neighbors in K-stride (wrap- around), and then neighbors send back an acknowledge. � An iteration: all elements finish the above communication � Environment � A smp node with 2 Xeon quadcores, only use 7 cores � Ubuntu 7.04; gcc 4.2 � Charm: net-linux-amd64-smp vs. net-linux-amd64 � 1 element/core, K=3 Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  4. Performance at first glance 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  5. Outline � Examine the communication model in Charm++ between the Non-SMP and SMP layers � Describe current optimizations for SMP step by step � Talk about a different approach to utilize multicore � Conclude with the future work Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  6. Communication model for the multicore Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  7. Possible overheads in SMP version � Locks � Overusing locks to ensure correctness � Locks in message queues � … � False sharing � Some per thread data structures are allocated together in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  8. Reducing the usage of locks By examining the source codes, finding overuse of locks � � Narrower sections enclosed by locks 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Relaxed lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  9. Overhead in message queues A micro-benchmark � 20000 to show the 1. Each producer 18000 overhead in message produces 10K items per iteration queues 16000 2. One iteration: 14000 � N producers, 1 avg iter tim e (us) consumer cosumes all consumer 12000 items � lock vs. memory 10000 fence + atomic 8000 operation (fetch- 6000 and-increment) 4000 � 1 queue vs. N 2000 queues 0 1 2 3 4 5 6 7 8 number of producers multiQ-fence singleQ-fence+atomic op singleQ-lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  10. Applying multi Q + Fence Less than 2% � 0.8 improvement � Much less contention 0.75 compared with the 0.7 micro-benchmark i t e r a t i o n t i m e ( m s ) 0.65 0.6 0.55 0.5 0.45 0.4 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock SMP-Relaxed lock-multiQ-Fence Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  11. Big overhead in msg allocation � We noticed that: � We used our own default memory module � Every memory allocation is protected by a lock � Provide some useful functionalities in Charm++ system (a historic reason not using other memory modules) � memory footprint information, memory debugger � Isomalloc Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  12. Switching to OS memory module 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock-SingleQ-Fence SMP-Reduced lock overhead We don’t lose the aforementioned functionalities by recent updates ☺ Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  13. Identifying false sharing overhead � Another micro-benchmark � Each element repeatedly sends itself a message, but each time the message is reused (i.e., not allocating a new message) � Benchmark timing of 1000 iterations � Use Intel VTune performance analysis tool � Focusing on the cache misses caused by “Invalidate” in the MESI coherence protocol � Declaring variables with “__thread” specifier will make them thread private Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  14. Performance for the micro-benchmark � Parameters: 1 element/core, 7 cores � Before: 1.236 us per iteration � After: 0.913 us per iteration Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  15. Adding the gains from removing false sharing Around 1% improvement � 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Optimized Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  16. Rethinking communication model � Posix-shared memory layer � No threads, every core still runs a process � Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication) Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  17. Performance comparison 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Optimized Posix Shared Memory Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

  18. Future work � Other platform � BG/P � Optimize the posix shared memory version � Effects on real applications � For NAMD, initial result shows that SMP helps up to 24 nodes on Abe � Any other communication models � Adaptive one? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend