A Preliminary Investigation on Optimizing Charm++ for Homogeneous - - PowerPoint PPT Presentation

a preliminary investigation on optimizing charm for
SMART_READER_LITE
LIVE PREVIEW

A Preliminary Investigation on Optimizing Charm++ for Homogeneous - - PowerPoint PPT Presentation

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop Motivation Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2


slide-1
SLIDE 1

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines

Chao Mei 05/02/2008 The 6th Charm++ Workshop

slide-2
SLIDE 2

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Motivation

Clusters are built from multicore chips

4 cores/node on BG/P 8 cores/node on Abe (2 Intel quad-core chips) 16 cores/node on Ranger (4 AMD quad-core chips) …

Charm has a building version for SMP node for many years

Not tuned

So, what are the issues for getting high performance?

slide-3
SLIDE 3

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Start with a kNeighbor benchmark

A synthetic kNeighbor benchmark

Each element communicates with its neighbors in K-stride (wrap-

around), and then neighbors send back an acknowledge.

An iteration: all elements finish the above communication

Environment

A smp node with 2 Xeon quadcores, only use 7 cores Ubuntu 7.04; gcc 4.2 Charm: net-linux-amd64-smp vs. net-linux-amd64 1 element/core, K=3

slide-4
SLIDE 4

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance at first glance

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) iteration time (ms) Non-SMP SMP

slide-5
SLIDE 5

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Outline

Examine the communication model in Charm++ between the

Non-SMP and SMP layers

Describe current optimizations for SMP step by step Talk about a different approach to utilize multicore Conclude with the future work

slide-6
SLIDE 6

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Communication model for the multicore

slide-7
SLIDE 7

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Possible overheads in SMP version

Locks

Overusing locks to ensure correctness Locks in message queues …

False sharing

Some per thread data structures are allocated together in an

array form: e.g. each element in “CmiState state[numThds]” belongs to a thread

slide-8
SLIDE 8

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Reducing the usage of locks

  • By examining the source codes, finding overuse of locks

Narrower sections enclosed by locks

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) iteration time (ms) Non-SMP SMP SMP-Relaxed lock

slide-9
SLIDE 9

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Overhead in message queues

  • A micro-benchmark

to show the

  • verhead in message

queues

N producers, 1

consumer

lock vs. memory

fence + atomic

  • peration (fetch-

and-increment)

1 queue vs. N

queues

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 1 2 3 4 5 6 7 8 number of producers avg iter tim e (us) multiQ-fence singleQ-fence+atomic op singleQ-lock

  • 1. Each producer

produces 10K items per iteration

  • 2. One iteration:

consumer cosumes all items

slide-10
SLIDE 10

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Applying multi Q + Fence

  • Less than 2%

improvement

Much less contention

compared with the micro-benchmark

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) i t e r a t i o n t i m e ( m s ) Non-SMP SMP-Relaxed lock SMP-Relaxed lock-multiQ-Fence

slide-11
SLIDE 11

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Big overhead in msg allocation

We noticed that: We used our own default memory module

Every memory allocation is protected by a lock Provide some useful functionalities in Charm++ system (a historic

reason not using other memory modules)

memory footprint information, memory debugger Isomalloc

slide-12
SLIDE 12

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Switching to OS memory module

We don’t lose the aforementioned functionalities by recent updates ☺

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) iteration time (ms) Non-SMP SMP-Relaxed lock-SingleQ-Fence SMP-Reduced lock overhead

slide-13
SLIDE 13

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Identifying false sharing overhead

Another micro-benchmark

Each element repeatedly sends itself a message, but each time

the message is reused (i.e., not allocating a new message)

Benchmark timing of 1000 iterations

Use Intel VTune performance analysis tool

Focusing on the cache misses caused by “Invalidate” in the

MESI coherence protocol

Declaring variables with “__thread” specifier will make them

thread private

slide-14
SLIDE 14

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance for the micro-benchmark

Parameters: 1 element/core, 7 cores Before: 1.236 us per iteration After: 0.913 us per iteration

slide-15
SLIDE 15

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Adding the gains from removing false sharing

  • Around 1% improvement

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) iteration time (ms) Non-SMP SMP SMP-Optimized

slide-16
SLIDE 16

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Rethinking communication model

Posix-shared memory layer No threads, every core still runs a process Inter-core message passing doesn’t go through NIC, but

through memory copy (inter-process communication)

slide-17
SLIDE 17

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance comparison

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) iteration time (ms) Non-SMP SMP-Optimized Posix Shared Memory

slide-18
SLIDE 18

Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Future work

Other platform

BG/P

Optimize the posix shared memory version Effects on real applications

For NAMD, initial result shows that SMP helps up to 24 nodes

  • n Abe

Any other communication models

Adaptive one?