Environments Costas Busch Louisiana State University (Joint work - - PowerPoint PPT Presentation

environments
SMART_READER_LITE
LIVE PREVIEW

Environments Costas Busch Louisiana State University (Joint work - - PowerPoint PPT Presentation

Transactional Memory Schedulers for Diverse Distributed Computing Environments Costas Busch Louisiana State University (Joint work with Gokarna Sharma) WTTM 2013 1 Multiprocessor Systems Tightly-Coupled Systems Multicore processors


slide-1
SLIDE 1

Transactional Memory Schedulers for Diverse Distributed Computing Environments Costas Busch

Louisiana State University

(Joint work with Gokarna Sharma) WTTM 2013

1

slide-2
SLIDE 2
  • Tightly-Coupled Systems
  • Multicore processors
  • Multilevel Cache
  • Distributed Network Systems
  • Interconnection Network
  • Asymmetric communication
  • Non-Uniform Memory Access Systems

(NUMA)

  • Partially symmetric

Communication

Multiprocessor Systems

2

slide-3
SLIDE 3

Scheduling Transactions

Contention Management Determines:

  • when to start a transaction
  • when to retry after abort
  • how to avoid conflicts

3

slide-4
SLIDE 4

Efficiency Metrics

  • Makespan
  • Time to complete all transactions
  • Abort per commit ratio
  • Energy
  • Communication cost
  • Time and Energy
  • Networked systems
  • Load Balancing
  • Time and Energy
  • NUMA and networked systems

4

slide-5
SLIDE 5

Inspiration from Network Problems

Packet scheduling techniques

Helps to schedule transactions in multicores

Mobile object tracking in sensor networks

Helps to schedule transactions in networked systems

Oblivious routing in networks

Helps to load balance transaction schedules in NUMA

5

slide-6
SLIDE 6

Presentation Outline

➢ 1. Tightly-Coupled Systems

  • 2. Distributed Networked Systems
  • 3. NUMA
  • 4. Future Directions

6

slide-7
SLIDE 7

Scheduling in Tightly-Coupled Systems

One-shot scheduling problem

– M transactions, a single transaction per thread – s shared resources – Best bound proven to be achievable is O(s)

7

1 2 3 M

Transactions

Threads

Transactions

Makespan

slide-8
SLIDE 8
  • Problem Complexity: directly related to vertex

coloring

  • NP-Hard to approximate an optimal vertex

coloring

  • Can we do better under the limitations of

coloring reduction?

8

transaction transaction shared resource

slide-9
SLIDE 9

Inspiration

9

Packet routing and job-shop scheduling in O(congestion+dilation) steps (1994)

  • F. T. Leighton , Bruce M. Maggs , Satish B. Rao

Congestion (C) = max edge utilization Dilation (N) = max path length

slide-10
SLIDE 10
  • A M × N window W

– M threads with a sequence of N transactions per thread – collection of N one-shot transaction sets

Execution Window Model

10

1 2 3 N N M

1 2 3 M Transactions

. . . . . .

Threads

Packet = thread Path Length (N) = sequence of thread’s transactions Congestion (C)= conflicts of thread’s transactions

Analogy:

O(C + N log(MN))

Makespan

slide-11
SLIDE 11

Intuition

Random delays help conflicting transactions shift inside the window Initially each thread is low priority After random delay expires a thread becomes high priority

11

N N’

Random interval

1 2 3 N M 1 2 3 N N M . . .

slide-12
SLIDE 12

How it works: Frames

12

1 2 3 N M N q1 ϵ [0, α1-1], α1 = C1 / log(MN)

C=maxi Ci, 1 ≤ i ≤ M

F11 F3N

Thread 1 Thread 2 Thread 3 Thread M

F1N F12

Makespan = (C / log(MN) + Number of frames) × Frame Size = (C / log(MN) + N) × Frame Size First frame of Thread 1 where T11 executes Second frame of Thread 1 where T12 executes

Frame size = O(log(MN))

=O (C + N log(MN))

slide-13
SLIDE 13

Challenges

  • Unit length Transactions
  • C: may not be known

– Try to guess it for each transaction – Use random priorities within frame

  • N: what window size is good?

– Dynamically try different window sizes

DISC 2010 - 24th International Symposium on Distributed Computing 13

slide-14
SLIDE 14

Presentation Outline

  • 1. Tightly-Coupled Systems

➢ 2. Distributed Networked Systems

  • 3. NUMA
  • 4. Future Directions

14

slide-15
SLIDE 15

Distributed Transactional Memory

  • Transactions run on network nodes
  • They ask for shared objects distributed over the network

for either read or write

  • They appear to execute atomically
  • The reads and writes on shared objects are supported

through three operations:

 Publish  Lookup  Move

15

slide-16
SLIDE 16

16

Owner node Suppose the object ξ is at node and is a requesting node ξ Requesting node

Suppose transactions are immobile and the objects are mobile

slide-17
SLIDE 17

17

Read-only copy Main copy Lookup operation ξ ξ

Replicates the object to the requesting node

slide-18
SLIDE 18

18

Read-only copy Main copy Lookup operation ξ ξ

Replicates the object to the requesting nodes

Read-only copy ξ

slide-19
SLIDE 19

19

Main copy Invalidated Move operation ξ ξ

Relocates the object explicitly to the requesting node

slide-20
SLIDE 20

20

Invalidated Move operation ξ

Relocates the object explicitly to the requesting node

Main copy ξ Invalidated ξ

slide-21
SLIDE 21

Related Work

Protocol Stretch Network Kind Runs on

Arrow [DISC’98] O(SST)=O(D) General Spanning tree Relay [OPODIS’0 9] O(SST)=O(D) General Spanning tree Combine [SSS’10] O(SOT)=O(D) General Overlay tree Ballistic [DISC’05] O(log D) Constant- doubling dimension Hierarchical directory with independent sets Spiral [IPDPS’12] O(log2 n log D) General Hierarchical directory with sparse covers

➢ D is the diameter of the network kind ➢ S* is the stretch of the tree used

slide-22
SLIDE 22

Inspiration

Concurrent online tracking of mobile users (1991) Awerbuch, B., Peleg, D.

  • A distributed directory scheme to minimize cost
  • f moving objects
  • Total communication cost is proportional to the distances of

positions of moving objects

  • Uses a hierarchical clustering of the network
  • sparse partitions

22

slide-23
SLIDE 23

23

Hierarchical clustering Spiral Approach: Network graph

slide-24
SLIDE 24

24

Hierarchical clustering Spiral Approach:

Alternative representation as a hierarchy tree with leader nodes

slide-25
SLIDE 25

25

At the lowest level (level 0) every node is a cluster

Directories at each level cluster, downward pointer if object locality known

slide-26
SLIDE 26

26

Owner node

root

A Publish operation

➢ Assume that is the creator of which invokes the Publish operation ➢ Nodes know their parent in the hierarchy

ξ ξ

slide-27
SLIDE 27

27

root

Send request to the leader

slide-28
SLIDE 28

28

root

Continue up phase Sets downward pointer while going up

slide-29
SLIDE 29

29

root

Continue up phase Sets downward pointer while going up

slide-30
SLIDE 30

30

root

Root node found, stop up phase

slide-31
SLIDE 31

31

root

A successful Publish operation Predecessor node ξ

slide-32
SLIDE 32

32

Requesting node Predecessor node

root

Supporting a Move operation

➢ Initially, nodes point downward to object owner (predecessor node) due to Publish operation ➢ Nodes know their parent in the hierarchy

ξ

slide-33
SLIDE 33

33

Send request to leader node of the cluster upward in hierarchy

root

slide-34
SLIDE 34

34

Continue up phase until downward pointer found

root

Sets downward path while going up

slide-35
SLIDE 35

35

Continue up phase

root

Sets downward path while going up

slide-36
SLIDE 36

36

Continue up phase

root

Sets downward path while going up

slide-37
SLIDE 37

37

Downward pointer found, start down phase

root

Discards path while going down

slide-38
SLIDE 38

38

Continue down phase

root

Discards path while going down

slide-39
SLIDE 39

39

Continue down phase

root

Discards path while going down

slide-40
SLIDE 40

40

Predecessor reached, object is moved from node to node

root

Lookup is similar without change in the directory structure and only a read-only copy of the object is sent

slide-41
SLIDE 41

41

Distributed Queue

root

u u tail head

slide-42
SLIDE 42

42

Distributed Queue

root

u u tail head v v

slide-43
SLIDE 43

43

root

u v w

Distributed Queue

u tail head v w

slide-44
SLIDE 44

44

root

u v w

Distributed Queue

tail head v w

slide-45
SLIDE 45

45

root

u v w

Distributed Queue

tail head w

slide-46
SLIDE 46

46

Spiral avoids deadlocks

Label all the parents in each level and visit them in the

  • rder of the labels.

2 1

A

  • bject

parent(A)

B

3

Level k Level k-1 Level k+1 From root parent(B)

5 2 4 4

Parent set B Parent set A

slide-47
SLIDE 47

Spiral Hierarchy

  • (O(log n), O(log n))-sparse cover hierarchy constructed

from O(log n) levels of hierarchical partitions

 Level 0, each node belongs to exactly one cluster  Level h, all the nodes belong to one cluster with root r  Level 0 < i < h, each node belongs to exactly O(log n) clusters which are labeled different

47

Cluster Overlaps Cluster Diameter stretch

slide-48
SLIDE 48

Spiral Hierarchy

  • How to find a predecessor node?

 Via spiral paths for each leaf node u by visiting parent leaders of all the clusters that contain u from level 0 to the root level

The hierarchy guarantees: (1) For any two nodes u,v, their

spiral paths p(u) and p(v) meet at level min{h, log(dist(u,v))+2} (2) length(pi(u)) is at most O(2i log2n)

48

root

u

p(u)

v

p(v) p(w)

w

slide-49
SLIDE 49

Downward Paths

49

root

u

p(u)

root

u v

p(v)

root

u v

p(w)

Deformation of spiral paths after moves

slide-50
SLIDE 50

Analysis: lookup Stretch

50

v w vi

x

Level k Level i O(2k log2n) O(2i log2n) O(2k log n) 2i If there is no Move, a Lookup r from w finds downward path to v in level log(dist(u,v))+2 = O(i) When there are Moves, it can be shown that r finds downward path to v in level k = O(i + log log2n) p(w) p(v)

C(r)/C*(r) = O(2k log2n)+O(2k log n)+O(2i log2n) / 2i-1 = O(log4n)

Canonical path spiral path

slide-51
SLIDE 51

Analysis: move Stretch

51

Level Assume a sequential execution R of l+1 Move requests, where r0 is an initial Publish request.

C*(R) ≥ max1≤k≤h (Sk-1) 2k-1 C(R) ≥ σ

k=1

ℎ (Sk−1) O(2k log2n)

C(R)/C*(R) = σ

k=1

ℎ (Sk−1) O(2k log2n) / max1≤k≤h (Sk-1) 2k-1

= O(log2n. h) max1≤k≤h (Sk-1) 2k-1 / max1≤k≤h (Sk-1) 2k-1 = O(log2n. log D)

h . . . k . . . 2 1

request x

r0 . . . r0 . . . r0 r0 r0 r1 . . r1 r1 r1

u v y w

r2 r2 r2 . . r2 r2 r2 rl-1 rl-1 rl-1 r2 . . rl . . . rl rl rl

. . . Thus,

slide-52
SLIDE 52

Presentation Outline

  • 1. Tightly-Coupled Systems
  • 2. Distributed Networked Systems

➢ 3. NUMA

  • 4. Future Directions

52

slide-53
SLIDE 53

1

u

1

v

2

u

2

v

3

u

3

v

General routing: choose paths from sources to destinations

Routing in DTM: source node of the predecessor request in the total order is the destination of a successor request

slide-54
SLIDE 54

Edge congestion

edge

C

maximum number of paths that use any edge

Node congestion

node

C

maximum number of paths that use any node

slide-55
SLIDE 55

Length of chosen path Length of shortest path

u v

Stretch =

5 . 1 8 12   stretch

shortest path chosen path

slide-56
SLIDE 56

Inspiration: Oblivious Routing

Each request path choice is independent

  • f other request path choices
slide-57
SLIDE 57

Problem Statement

  • Given a d-dimensional mesh and a finite set of
  • perations R ={r0,r1,…,rl} on an object ξ
  • Design a DTM algorithm that:

– Minimizes congestion C = maxe |{i : 𝑞𝑗 ϶ e}| on any edge e – Minimizes total communication cost A(R) = σ𝑗=1

𝑚

|𝑞𝑗|

for all the operations

Limitation: Congestion and stretch cannot be minimized

simultaneously in arbitrary networks

slide-58
SLIDE 58

Multibend DTM

  • Focus on Mesh Neworks (general solution

impossible)

  • For 2-dimensional mesh, MultiBend has both stretch

and (edge) congestion O(log n)

  • For d-dimensional mesh, MultiBend has

stretch O(d log n) and congestion O(d2 log n)

slide-59
SLIDE 59

Type-1 Mesh Decomposition

2-dimensional mesh

slide-60
SLIDE 60

Type-1 Mesh Decomposition

slide-61
SLIDE 61

Type-1 Mesh Decomposition

slide-62
SLIDE 62

Type-2 Mesh Decomposition

slide-63
SLIDE 63

Type-2 Mesh Decomposition

slide-64
SLIDE 64

Decomposition for 23x23 2-dimensional mesh

(i+1,2) (i+1,1) (i,2) (i,1)

Hierarchy levels

slide-65
SLIDE 65

MultiBend Hierarchy

  • Find a predecessor node via multi-bend paths for each

leaf node u

root

u

p(u) p(v)

v

slide-66
SLIDE 66

Load Balancing

  • Through a leader election procedure

– Every time we access the leader of a sub-mesh, we replace it with another leader chosen uniformly at random among its nodes

  • The update cost is low in comparison to the

cost of serving requests

slide-67
SLIDE 67

Analysis on (Edge) Congestion

  • A sub-path uses edge e with probability 2/ml
  • P’: set of paths from M1 to M2 or vice-versa
  • C’(e): Congestion caused by P’ on e
  • E[C’(e)] ≤ 2|P’|/ml
  • B ≥ |P’|/out(M1)
  • ut(M1) ≤ 4ml
  • C* ≥ B

==> E[C’(e)] ≤ 8C*

M2 M1 e ml

Assume M1 is a type-1 submesh

slide-68
SLIDE 68

Presentation Outline

  • 1. Tightly-Coupled Systems
  • 2. Distributed Networked Systems
  • 3. NUMA

➢ 4. Future Directions

68

slide-69
SLIDE 69

Future Directions

  • Distributed Networked systems

Multiple objects

minimize time and communication cost

Fault tolerance Dynamic networks

  • NUMA

Study other network architectures

69