A Novel Parallel Deadlock Detection Algorithm and Architecture 2 , 2 - - PowerPoint PPT Presentation

a novel parallel deadlock detection algorithm and
SMART_READER_LITE
LIVE PREVIEW

A Novel Parallel Deadlock Detection Algorithm and Architecture 2 , 2 - - PowerPoint PPT Presentation

A Novel Parallel Deadlock Detection Algorithm and Architecture 2 , 2 , Pun H. Shiu 2 , Yudong Yudong Tan Tan 2 , Pun H. Shiu 1 Vincent J. Mooney III 1 Vincent J. Mooney III {ship, ydtan {ship, ydtan, , mooney mooney}@ece.gatech.ed


slide-1
SLIDE 1

April, 2001 CODES 2001

1,2Hardware/Software RTOS Group 1Low Power Compiler Group 1Assistant Professor, 1,2Electrical and Computer Engineering 1Adjunct Assistant Professor, 1College of Computing

Georgia Institute of Technology Atlanta, GA USA

A Novel Parallel Deadlock Detection Algorithm and Architecture

Pun H. Shiu Pun H. Shiu2

2,

, Yudong Yudong Tan Tan2

2,

, Vincent J. Mooney III Vincent J. Mooney III1

1

{ship, {ship, ydtan ydtan, , mooney mooney}@ece.gatech.ed }@ece.gatech.edu u http:// http://codesign codesign. .ece ece. .gatech gatech. .edu edu

1http://crest.

http://crest.ece ece. .gatech gatech. .edu edu

slide-2
SLIDE 2

April, 2001 CODES 2001

Overall Outline

  • Motivation

Motivation -

  • Technology Trends

Technology Trends

  • Background

Background -

  • Deadlock Detection

Deadlock Detection

  • Parallel Algorithm

Parallel Algorithm

  • Parallel Architecture

Parallel Architecture

  • Experimental Results

Experimental Results

  • Conclusion

Conclusion

slide-3
SLIDE 3

April, 2001 CODES 2001

Motivation - Technology Trends

  • Many of today’s chip designs contain 2

Many of today’s chip designs contain 2 processors, e.g., a DSP and a processors, e.g., a DSP and a microcontroller microcontroller

  • Future

Future SoC SoC designs are likely to include designs are likely to include

4

4-

  • 40 heterogeneous processors

40 heterogeneous processors

10

10-

  • 50 on

50 on-

  • chip hardware resources

chip hardware resources

FFT,

FFT, Viterbi Viterbi filter, wireless communication filter, wireless communication

Multithreaded software which dynamically requests

Multithreaded software which dynamically requests and uses the resources and uses the resources

slide-4
SLIDE 4

April, 2001 CODES 2001

  • Ideally, programmers of such future

Ideally, programmers of such future SoC SoC designs would only write deadlock designs would only write deadlock-

  • free code

free code

  • If not, we provide a way to detect deadlock

If not, we provide a way to detect deadlock very fast very fast

  • User can write code to recover from

User can write code to recover from deadlock deadlock

SoC Software

slide-5
SLIDE 5

April, 2001 CODES 2001

Deadlock Detection Unit (DDU)

  • Small & scalable parallel hardware unit

Small & scalable parallel hardware unit

  • Multiple requestors & resources

Multiple requestors & resources

  • In this paper, the only requestors are

In this paper, the only requestors are processors and the only resources processors and the only resources are specialized hardware units like are specialized hardware units like FFT FFT

slide-6
SLIDE 6

April, 2001 CODES 2001

Overall Outline

  • Motivation

Motivation -

  • Technology Trends

Technology Trends

  • Background

Background -

  • Deadlock Detection

Deadlock Detection

  • Parallel Algorithm

Parallel Algorithm

  • Parallel Architecture

Parallel Architecture

  • Experimental Results

Experimental Results

  • Conclusion

Conclusion

slide-7
SLIDE 7

April, 2001 CODES 2001

Background: Deadlock Condition

  • Properties of Resources

Properties of Resources

Mutual Exclusion: A

Mutual Exclusion: Any resource can be held exclusively,

ny resource can be held exclusively, making it unavailable to other processors making it unavailable to other processors

Non

Non-

  • preemption: A

preemption: Any resources can be released only by

ny resources can be released only by the processors holding the resource. the processors holding the resource.

  • Behavior of processors

Behavior of processors

Partial Allocation:

Partial Allocation: a processor may hold some

a processor may hold some resources while the processor requests additional resources. resources while the processor requests additional resources.

Blocked Wait:

Blocked Wait: processor must wait for unavailable

processor must wait for unavailable resources to become available. resources to become available.

Q1 P1 P2 Q2

slide-8
SLIDE 8

April, 2001 CODES 2001

Previous Algorithms’ Run Time

Generally the run time is O(m*n), where Generally the run time is O(m*n), where m is the number of processors and n is m is the number of processors and n is the number of resources. the number of resources.

  • Path Based, O(e), or O(e

Path Based, O(e), or O(e≤ ≤m*n), where m*n), where e is the set of edges. e is the set of edges.

  • Tree Based, O(m*n)

Tree Based, O(m*n)

  • Matrix Based, O(m*n)

Matrix Based, O(m*n)

  • Message Passing Based, O(m*n)

Message Passing Based, O(m*n)

slide-9
SLIDE 9

April, 2001 CODES 2001

Overall Outline

  • Motivation

Motivation -

  • Technology Trends

Technology Trends

  • Background

Background -

  • Deadlock Detection

Deadlock Detection

  • Parallel Algorithm

Parallel Algorithm

  • Parallel Architecture

Parallel Architecture

  • Experimental Results

Experimental Results

  • Conclusion

Conclusion

slide-10
SLIDE 10

April, 2001 CODES 2001

Example

request grant processor resource request grant processor resource

slide-11
SLIDE 11

April, 2001 CODES 2001

Example

Simple path Link nodes Sink node Source node Simple path Sink edge Source edge Link edge

slide-12
SLIDE 12

April, 2001 CODES 2001

Matrix Representation

  • Each row corresponds to a requestor (processor)

Each row corresponds to a requestor (processor)

p

pi

i represents requestor (processor) i

represents requestor (processor) i

  • Each column corresponds to a resource

Each column corresponds to a resource

q

qj

j represents resource j

represents resource j

  • Entries in the matrix

Entries in the matrix

r (

r (r rij

ij) represents a request

) represents a request

g (

g (g gij

ij) represents a grant

) represents a grant

0 represents no action (neither request nor

0 represents no action (neither request nor grant) grant)

slide-13
SLIDE 13

April, 2001 CODES 2001

Properties

  • Proposed Algorithm

Proposed Algorithm

  • Matrix Based

Matrix Based

  • Modified Reduction Technique

Modified Reduction Technique

  • Handling multiple requests, and

Handling multiple requests, and grants at the same time. grants at the same time.

  • Requires simple bit

Requires simple bit-

  • wise

wise boolean boolean

  • perations.
  • perations.
slide-14
SLIDE 14

April, 2001 CODES 2001

SoC Example

g g g g r r p2(VSP) p2(VSP) r r g g p1(DSP) p1(DSP) q3(WI) q3(WI) q2(PCI) q2(PCI) q1( q1(IcP IcP) ) P P\ \Q Q

slide-15
SLIDE 15

April, 2001 CODES 2001

Deadlock and Cycle Relation

g g g g r r p2(VSP) p2(VSP) r r g g p1(DSP) p1(DSP) q3(WI) q3(WI) q2(PCI) q2(PCI) q1( q1(IcP IcP) ) P P\ \Q Q

DSP VSP IcP PCI WI

  • Deadlock ⇒ ∃ cycles
  • Cycles ⇒ ∃ Deadlock

(As shown in the red)

slide-16
SLIDE 16

April, 2001 CODES 2001

[ ] [ ]

      = = =       = 01 01 10 00 10 01 10 01

c c c

M r g g g r r g M

[ ] [ ]

      = = =       = 01 01 10 00 10 01 10 01

c c c

M r g g g r r g M

Matrix Representation

[ ] [ ]

      = = =       = 01 01 10 00 10 01 10 01

c c c

M r g g g r r g M                 =       =       = 1 1 1 1 1 1 1

r r r

M r g

slide-17
SLIDE 17

April, 2001 CODES 2001

        =         ⊕ ⊕ =             =             = 1 1 1 1 1 1 1 1 1 1 1 1 1

right rbo r

XOR M M         =         ⊕ ⊕ =             =             = 1 1 1 1 1 1 1 1 1 1 1 1 1

right rbo r

XOR M M

Matrix Representation: calculation of Mrbo and XORright

slide-18
SLIDE 18

April, 2001 CODES 2001

[ ] [ ] [ ]

1 1 1 1 1 1 01 11 11 01 01 10 00 10 01

below cbo c

= ⊕ ⊕ ⊕ = =       = XOR M M

Matrix Representation: calculation of Mcbo and XORbelow

slide-19
SLIDE 19

April, 2001 CODES 2001

Result of first iteration

  • Based on result, we set all entries in

Based on result, we set all entries in column 3 to zero: column 3 to zero:

[ ]

      = = 1

right below

XOR XOR       = g r r g M

slide-20
SLIDE 20

April, 2001 CODES 2001

Multiple Iterations

  • Continuing in this way, we continue

Continuing in this way, we continue iterating until no more changes iterating until no more changes

  • When finished, if M is all zeros, we have

When finished, if M is all zeros, we have no deadlock; otherwise, we do have no deadlock; otherwise, we do have deadlock deadlock

  • This algorithm requires at most

This algorithm requires at most

2*min(m,n) 2*min(m,n) iterations

iterations

slide-21
SLIDE 21

April, 2001 CODES 2001

Overall Outline

  • Motivation

Motivation -

  • Technology Trends

Technology Trends

  • Background

Background -

  • Deadlock Detection

Deadlock Detection

  • Parallel Algorithm

Parallel Algorithm

  • Parallel Architecture

Parallel Architecture

  • Experimental Results

Experimental Results

  • Conclusion

Conclusion

slide-22
SLIDE 22

April, 2001 CODES 2001

3 Processors/3 Resources: Architecture

slide-23
SLIDE 23

April, 2001 CODES 2001

Overall Outline

  • Motivation

Motivation -

  • Technology Trends

Technology Trends

  • Background

Background -

  • Deadlock Detection

Deadlock Detection

  • Parallel Algorithm

Parallel Algorithm

  • Parallel Architecture

Parallel Architecture

  • Experimental Results

Experimental Results

  • Conclusion

Conclusion

slide-24
SLIDE 24

April, 2001 CODES 2001

Experiments

  • Assumption

Assumption

Software Cycle:

Software Cycle: 83.3 MHz processor

83.3 MHz processor

Hardware Cycle:

Hardware Cycle:

Synthesized from gate

Synthesized from gate-

  • level description

level description

Clock as fast as critical path (e.g., 4.12 ns

Clock as fast as critical path (e.g., 4.12 ns ⇒ ⇒ 242 MHz Clock) 242 MHz Clock)

Clock same as CPU clock 83.3 MHz clock (12 ns cycle time)

Clock same as CPU clock 83.3 MHz clock (12 ns cycle time)

  • Simulation

Simulation

Previous Algorithm:

Previous Algorithm: PowerPC 750 runs .c in Seamless CVE

PowerPC 750 runs .c in Seamless CVE

Proposed Algorithm:

Proposed Algorithm: Synopsys

Synopsys VCS runs .v VCS runs .v

  • ~100

~100 – – 1000 times faster 1000 times faster

99% run time reduction

99% run time reduction

slide-25
SLIDE 25

April, 2001 CODES 2001

Area and Delays of DDU

206 206 50 50 4.12 4.12 14142 14142 2682 2682 50x50 50x50 36.6 36.6 10 10 3.66 3.66 622 622 162 162 10x10 10x10 17.57 17.57 7 7 2.51 2.51 455 455 102 102 7x7 7x7 11.05 11.05 5 5 2.21 2.21 264 264 73 73 5x5 5x5 1.82 1.82 2 2 0.91 0.91 186 186 49 49 2x3 2x3 Worst Worst Case Case Custom Custom Clk Clk (ns) (ns) Worst Worst Case Case (# steps) (# steps) Delay/ Delay/ Step Step (ns) (ns) Area Area AMI AMI 0.3u 0.3u Lines Lines

  • f
  • f

Verilog Verilog |P| |P| Times Times |Q| |Q| 600ns 600ns 120ns 120ns 84ns 84ns 60ns 60ns 24ns 24ns Worst Worst Case Case 83.3Mhz 83.3Mhz (ns) (ns)

slide-26
SLIDE 26

April, 2001 CODES 2001

Hardware vs. Software Performance

Number of Cycles Number of Edges

slide-27
SLIDE 27

April, 2001 CODES 2001

Example: Lookup Service

slide-28
SLIDE 28

April, 2001 CODES 2001

Example SoC Architecture

slide-29
SLIDE 29

April, 2001 CODES 2001

Event Sequence of the Example

FFT is granted to MPC750 FFT is granted to MPC750-

  • 2.

2. e5 e5 t5 t5 FFT is released by MPC750 FFT is released by MPC750-

  • 1

1 e4 e4 t4 t4 MPC750 MPC750-

  • 2 requests FFT, MPEG.

2 requests FFT, MPEG. e3 e3 t3 t3 MPC750 MPC750-

  • 3 requests FFT, PCI; PCI is

3 requests FFT, PCI; PCI is granted to MPC750 granted to MPC750-

  • 3 immediately.

3 immediately. e2 e2 t2 t2 MPC750 MPC750-

  • 1 requests FFT, MPEG are

1 requests FFT, MPEG are granted to MPC750 granted to MPC750-

  • 1 immediately

1 immediately e1 e1 t1 t1 Events Events Event No. Event No. Time Time

slide-30
SLIDE 30

April, 2001 CODES 2001

Adjacency Matrices

slide-31
SLIDE 31

April, 2001 CODES 2001

Sequence of Events

slide-32
SLIDE 32

April, 2001 CODES 2001

Deadlock Detection Time and Total Execution Time

Method of Deadlock Detection Detection Time ∆ (cycles) t5 + ∆ Software 16,038 23,261 DDU 2 7,225

% 9 . 68 261 , 23 7,225

  • 23,261

Soverall = =

slide-33
SLIDE 33

April, 2001 CODES 2001

Conclusion

  • Deadlock Detection Unit

Deadlock Detection Unit

  • very small area, even for 50x50

very small area, even for 50x50

O

Osw

sw(m*n) to

(m*n) to O Ohw

hw(min(m,n)) speedup

(min(m,n)) speedup

Linearly scalability in min(m,n)

Linearly scalability in min(m,n)

Handle simultaneous requests/grants

Handle simultaneous requests/grants

  • DDU can be used by multiprocessor

DDU can be used by multiprocessor SoC sofware SoC sofware code to detect deadlock code to detect deadlock quickly and then, for example, release quickly and then, for example, release resources to get out of deadlock resources to get out of deadlock

slide-34
SLIDE 34

April, 2001 CODES 2001

Future Work

  • Integrate DDU into an RTOS

Integrate DDU into an RTOS

Monitor DDU output

Monitor DDU output

DDU API

DDU API

Extend to handle multiple “blocked wait”

Extend to handle multiple “blocked wait” threads on one CPU: threads on one CPU: RTOS on each processor

RTOS on each processor aggregates requests which have the blocked wait property aggregates requests which have the blocked wait property ⇒ ⇒ each aggregate group is represented by a unique each aggregate group is represented by a unique “ “processor processor” ” row row

Try different recovery schemes

Try different recovery schemes

Perhaps some hardware assist in recovery

Perhaps some hardware assist in recovery