DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core - - PowerPoint PPT Presentation

dws demand aware work stealing in multi programmed multi
SMART_READER_LITE
LIVE PREVIEW

DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core - - PowerPoint PPT Presentation

DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures Quan Chen, Long Zheng, Minyi Guo Shanghai Jiao Tong University, China 1 PMAM 2014 Outline Background Problem & Motivation Demand-aware


slide-1
SLIDE 1

DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures

  • Quan Chen, Long Zheng, Minyi Guo

Shanghai Jiao Tong University, China

  • PMAM 2014

1

slide-2
SLIDE 2

Outline

  • Background
  • Problem & Motivation
  • Demand-aware Work-Stealing (DWS)
  • Evaluation
  • Conclusions

2

slide-3
SLIDE 3

Background

  • Hardware: Multi-core/Many-core Architectures

Scenario: Multiple parallel programs

3

P1 … Pn … Pi …

slide-4
SLIDE 4

Background-parallel programs

  • Traditional parallel programs
  • Hard to adjust the number of threads at runtime

Task-based parallel programs

  • Dynamic task scheduling

4

slide-5
SLIDE 5

Work-sharing

  • 5

Worker 1 Worker 2 Worker 3 Worker 4 Task Task Task Task

Lock the central task pool when getting a task

Task Lock Unlock Central task pool Lock Unlock

slide-6
SLIDE 6

Work-stealing

  • 6

Thread 1 Thread 2 Thread 3 Thread 4 Task Task Task Task Task Task Task Lock Unlock Task Task Task Task

slide-7
SLIDE 7

Problem & Motivation

  • Aggressive feature of work-stealing
  • On a k-core computer, k threads/workers are launched

Existing solutions

  • Time-sharing - ABP yielding mechanism
  • Space-sharing - Equal-partitioning

7

slide-8
SLIDE 8

Time-sharing

  • ABP yielding mechanism
  • If a thread fails to steal a task, it goes to sleep

8

C

Thread 1 Thread 2

Thread 3

Active Sleep

Cache

slide-9
SLIDE 9

Space-sharing

  • Equal-partitioning mechanism

If m programs co-run on a k-core computer,

each program is allocated k/m cores.

9

P1 … Pm … Pi …

slide-10
SLIDE 10

Start from Equal-partitioning Dynamically balance cores at runtime

  • If pi cannot fully-utilized a core, it release the core
  • If pi has too many tasks, it tries to obtain more cores

Demand-aware Work-Stealing (DWS)

  • 10

Runtime Arch. of DWS

Release Obtain

slide-11
SLIDE 11

Stealing algorithm - (Release)

  • A worker decides whether to release its core by itself

11

If a worker fails too many times (T_SLEEP) to steal a new task, it goes to sleep

slide-12
SLIDE 12

Coordinator - (Obtain)

  • The coordinator decides whether to obtain more cores
  • If a program has too many queued tasks, it should try to

get some free cores

12

How Many? Which?

C1: The more queued tasks in a program, the more cores should the program obtain

C2: A program can take its allocated cores back C3: A program cannot obtain the busy cores

slide-13
SLIDE 13

Coordinator - How Many?

  • C1: The more queued tasks in a program, the

more cores should the program obtain

13

Num of active workers Na Num of queued tasks Nb Num of free cores Nf Num of released cores Nr Num of cores expected Nw

How many:

slide-14
SLIDE 14

Coordinator - Which?

  • Nw <= Nf

Nf < Nw <= Nf+Nr (C2) Nw > Nf+Nr (C3)

14

Num of active workers Na Num of queued tasks Nb Num of free cores Nf Num of released cores Nr Num of cores expected Nw

  • Randomly select Nw free cores
  • Select Nf free cores + its (Nw-Nf) released core
  • Nf free cores+its Nr released cores
slide-15
SLIDE 15

Evaluation platform

  • A Dual-socket Quad-core computer with Hyper-

Threading Technology

Each socket is a Quad-Core Intel Xeon E5620

15

Hardware & Configuration Size/Version

L1/L2 cache size (each core) 256 KB/1MB L3 cache size (each socket) 12 MB

Main memory size 32 GB Operation system Linux 2.6.32-38

slide-16
SLIDE 16

Benchmarks

  • 16

Calculate execution time:

slide-17
SLIDE 17

Performance of DWS

  • 17

DWS can significantly improve the performance of the benchmarks

slide-18
SLIDE 18

Effectiveness of the coordinator

  • 18

Without the coordinator, the performance of the benchmarks is degraded

slide-19
SLIDE 19

Impact of T_SLEEP

  • 19

We should choose T_SLEEP = k or 2k on a k-core computer

slide-20
SLIDE 20

Contributions & conclusions

  • A modified work-stealing algorithm that enables a

program to release the under-utilized cores.

  • A coordinator to manage the workers. It enables a

program to grab and use the under-utilized cores released by other programs.

  • We have implemented DWS, which achieves a

performance gain of up to 32.3% in the best cases compared to traditional work-stealing schedulers.

20

slide-21
SLIDE 21

Thanks! Questions?