Optimizing Collective Communication on Multicores Rajesh Nishtala 1 - - PowerPoint PPT Presentation

optimizing collective communication on multicores
SMART_READER_LITE
LIVE PREVIEW

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 - - PowerPoint PPT Presentation

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1 University of California, Berkeley (2009) 1 / 57 Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael


slide-1
SLIDE 1

Optimizing Collective Communication on Multicores

Rajesh Nishtala1 Katherine Yelick1

1University of California, Berkeley

(2009)

1 / 57

slide-2
SLIDE 2

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael L.Scott (1991)

2 / 57

slide-3
SLIDE 3

PGAS Languages

◮ Focus on Partitioned Global Address Space languages

3 / 57

slide-4
SLIDE 4

Partitioned Addresspace

  • ne address space

T1 T2

...

Tn

4 / 57

slide-5
SLIDE 5

One Sided Communication

T1 T2

...

Tn write read

5 / 57

slide-6
SLIDE 6

PGAS Languages

◮ UPC, Unified Parallel C ◮ CAF, Co-array Fortran ◮ Titanium, a Java dialect

6 / 57

slide-7
SLIDE 7

Context

◮ The gap between processors and memory systems is still

enormous

7 / 57

slide-8
SLIDE 8

http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png 8 / 57

slide-9
SLIDE 9

◮ Today: processors don’t get faster, but we see more and more

processors on a single chip

9 / 57

slide-10
SLIDE 10

Processor GHz Cores (Threads) Sockets Intel Clovertown 2.66 8 (8) 2 AMD Barcelona 2.3 32 (32) 8 Sun Niagara 2 1.4 32 (256) 4

Table: Experimental Platforms

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

10 / 57

slide-11
SLIDE 11

Sun Niagara 2

http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/ rechnersysteme/beschreibung_der_hpc_systeme/ultrasparc_t2/ rba/ultrasparc_t2_architectural_details/?lang=de 11 / 57

slide-12
SLIDE 12

◮ The number of processors on a chip grows at an exponential

pace

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

12 / 57

slide-13
SLIDE 13

Intel Single-Chip Cloud Computer (48 Cores)

http://techresearch.intel.com/ProjectDetails.aspx?Id=1 13 / 57

slide-14
SLIDE 14

◮ Communication in its most general form is the movement of data

within cores, between cores or within memory systems

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

14 / 57

slide-15
SLIDE 15

RAM RAM RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU communication network

15 / 57

slide-16
SLIDE 16

Collective Communication

◮ Communication-intensive problems often involve global

communication

16 / 57

slide-17
SLIDE 17

Broadcast

1 2 3 4 ! ! !

17 / 57

slide-18
SLIDE 18

Gather

1 2 3 4 ! ! !

18 / 57

slide-19
SLIDE 19

◮ These operations are thought of as collective communication

  • perations

19 / 57

slide-20
SLIDE 20

Example: Sum of Vector Elements

1 2 3 4 5 6 7 8 9 10

20 / 57

slide-21
SLIDE 21

Example: Sum of Vector Elements

◮ Create workers

1 2 3 4 5 6 7 8 9 10 W1 W2 W3 W4 W5

21 / 57

slide-22
SLIDE 22

Example: Sum of Vector Elements

◮ Every worker sums up it’s part of the vector

1 2 3 4 5 6 7 8 9 10 3 7 11 15 19

22 / 57

slide-23
SLIDE 23

Example: Sum of Vector Elements

◮ The main thread gathers the partial results and sums them up

1 2 3 4 5 6 7 8 9 10 3 7 11 15 19 55

23 / 57

slide-24
SLIDE 24

Example: Sum of Vector Elements

Pseudocode (main thread):

double [ ] vector = read_vector ( ) ; Thread [ ] workers = spwan_workers ( ) ; start_workers ( workers ) ; double r e s u l t = c a l c u l a t e _ r e s u l t ( workers ) ; 24 / 57

slide-25
SLIDE 25

Example: Sum of Vector Elements

Pseudocode (main thread):

double [ ] vector = read_vector ( ) ; Thread [ ] workers = spwan_workers ( ) ; start_workers ( workers ) ; wait_until_everything_finished(workers); double r e s u l t = c a l c u l a t e _ r e s u l t ( workers ) ; 25 / 57

slide-26
SLIDE 26

Barrier

◮ Synchronization method for a group of threads ◮ A thread can only continue it’s execution after every thread has

called the barrier

26 / 57

slide-27
SLIDE 27

1 2 3 4 5 6 7 8 9 10 3 7 11 15 19 55

27 / 57

slide-28
SLIDE 28

Collective Communication Operation

“... group of threads works together to perform a global communication operation ...”

28 / 57

slide-29
SLIDE 29

Reduce

◮ Divide a problem into smaller subproblems ◮ Every thread contributes it’s part to the solution ◮ Example: Calculate the smallest entry of a vector

29 / 57

slide-30
SLIDE 30

Flat vs. Tree

◮ For communication among threads, different topologies can be

used

30 / 57

slide-31
SLIDE 31

Flat Topology

◮ Example: we have a reduce operation ◮ in the end the main thread Wmain has to wait for every worker

thread W1,...,W7 Wmain W1 W2 W3 W4 W5 W6 W7

31 / 57

slide-32
SLIDE 32

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain

32 / 57

slide-33
SLIDE 33

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain

33 / 57

slide-34
SLIDE 34

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain Wmain

34 / 57

slide-35
SLIDE 35

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain Wmain Wmain

35 / 57

slide-36
SLIDE 36

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain Wmain Wmain Wmain

36 / 57

slide-37
SLIDE 37

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain Wmain Wmain Wmain Wmain

37 / 57

slide-38
SLIDE 38

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain Wmain Wmain Wmain Wmain Wmain Wmain

38 / 57

slide-39
SLIDE 39

Tree Topology

◮ Example: we have a reduce operation ◮ in the end the main thread Wmain has to wait for every worker

thread W1,...,W7 Wmain W1 W2 W3 W4 W5 W6 W7

39 / 57

slide-40
SLIDE 40

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain W2 W4 W6

40 / 57

slide-41
SLIDE 41

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain W2 W4 W6 Wmain W4

41 / 57

slide-42
SLIDE 42

Wmain W1 W2 W3 W4 W5 W6 W7 Wmain W2 W4 W6 Wmain W4 Wmain

42 / 57

slide-43
SLIDE 43

Analysis

Figure: Barrier Performance

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

43 / 57

slide-44
SLIDE 44

Barrier Implementation

#define N 4 pthread_t threads [N ] ; vo l a ti l e int ready [N ] ; vo l a ti l e int go [N ] ; void b a r r i e r ( int id ) { i f ( id == 0) { / / wait f o r each thread for ( int i = 1; i < N; i ++) while ( ready [ i ] == 0 ) ; / / reset the ready flags for ( int i = 0; i < N; i ++) ready [ i ] = 0; / / signal each thread for ( int i = 0; i < N; i ++) go [ i ] = 1; } else { ready [ id ] = 1; / / wait u n t i l thread i s signalled while ( go [ id ] == 0 ) ; go [ id ] = 0; } } 44 / 57

slide-45
SLIDE 45

Experiment: Barrier Implementation

45 / 57

slide-46
SLIDE 46

◮ Strict synchronization: Data movement can only start after all

threads have entered the collective and must be completed before the first thread exits the collective

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

46 / 57

slide-47
SLIDE 47

Strict Synchronization

v1 v2 v3 v4 v5 v6 v7 v8 T1 T2 T3 T4 T5 T6 T7

47 / 57

slide-48
SLIDE 48

Loosening Synchronization Requirements

◮ Loose synchronization: Data movement can begin as soon as

any thread has entered the collective and continue until the last thread leaves the collective

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

48 / 57

slide-49
SLIDE 49

Loose Synchronization

v1 v2 v3 v4 v5 v6 v7 v8 T1 T2 T3 T4 T5 T6 T7

49 / 57

slide-50
SLIDE 50

v1 v2 v3 v4 v5 v6 v7 v8 v1 v2 v3 v4 v5 v6 v7 v8 v1 v2 v3 v4 v5 v6 v7 v8 T1 T2 T3 T4 T5 T6 T7

50 / 57

slide-51
SLIDE 51

v1 v2 v3 v4 v5 v6 v7 v8 v1 v2 v3 v4 v5 v6 v7 v8 T1 T2 T3 T4 T5 T6 T7

51 / 57

slide-52
SLIDE 52

v1 v2 v3 v4 v5 v6 v7 v8 T1 T2 T3 T4 T5 T6 T7

52 / 57

slide-53
SLIDE 53

(32 cores, 256 threads)

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

53 / 57

slide-54
SLIDE 54

(8 cores, 8 threads)

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

54 / 57

slide-55
SLIDE 55

(32 cores, 32 threads)

Nishtala, R., Yelick, K. Optimizing Collective Communication on Multicores

55 / 57

slide-56
SLIDE 56

Summary

◮ Best strategy depends on the hardware and on the problem ◮ Using a library that can automatically adapt to a given situation

can bring a great performance improvement, since hand tuning takes far too long

56 / 57

slide-57
SLIDE 57

Words on the Paper

◮ Very high level ◮ Description of the problem without concrete solution ◮ No implementation ◮ Plots aren’t always clear and precise

57 / 57