A System- -on on- -a a- -Chip Lock Chip Lock A System Cache - - PowerPoint PPT Presentation

a system on on a a chip lock chip lock a system cache
SMART_READER_LITE
LIVE PREVIEW

A System- -on on- -a a- -Chip Lock Chip Lock A System Cache - - PowerPoint PPT Presentation

A System- -on on- -a a- -Chip Lock Chip Lock A System Cache with Task Preemption Cache with Task Preemption Support Support By By Bilge S. Bilge S. Akgul Akgul, , Jaehwan Jaehwan Lee and Lee and Vincent J. Mooney Vincent J.


slide-1
SLIDE 1

A System A System-

  • on
  • n-
  • a

a-

  • Chip Lock

Chip Lock Cache with Task Preemption Cache with Task Preemption Support Support

By By Bilge S. Bilge S. Akgul Akgul, , Jaehwan Jaehwan Lee and Lee and Vincent J. Mooney Vincent J. Mooney Georgia Institute of Technology Georgia Institute of Technology School of Electrical and Com puter Engineering School of Electrical and Com puter Engineering

slide-2
SLIDE 2

Outline Outline

  • Introduction

Introduction

  • Background

Background

  • Lock Synchronization Problem s

Lock Synchronization Problem s

  • Our Methodology

Our Methodology

  • Hardware and Software Designs

Hardware and Software Designs

  • Experim ents and Results

Experim ents and Results

  • Conclusion

Conclusion

slide-3
SLIDE 3

Introduction Introduction

  • Multi

Multi-

  • processor shared memory

processor shared memory SoC SoC

  • Intertask

Intertask/ /interprocess interprocess synchronization synchronization

  • Lock synchronization overheads

Lock synchronization overheads

  • Lock delay, lock latency

Lock delay, lock latency

  • Memory bandwidth consumption

Memory bandwidth consumption

  • Aim:

Aim:

  • Reduce overheads

Reduce overheads

  • Improve Real Time (RT) predictability

Improve Real Time (RT) predictability

slide-4
SLIDE 4

Background Background

  • Critical Section

Critical Section

Code section where shared data between multiple

Code section where shared data between multiple execution units is accessed execution units is accessed

E.g., multiple readers and multiple writers

E.g., multiple readers and multiple writers

A lock is necessary to guarantee the consistency of

A lock is necessary to guarantee the consistency of shared data (e.g., global variables) shared data (e.g., global variables)

  • Lock Delay

Lock Delay

Time between release and acquisition of a lock

Time between release and acquisition of a lock

  • Lock Latency

Lock Latency

Time to acquire a lock in the absence of contention

Time to acquire a lock in the absence of contention

slide-5
SLIDE 5

Problems Problems

  • Ensuring mutual exclusiveness

Ensuring mutual exclusiveness

  • Communication bandwidth consumption

Communication bandwidth consumption

  • Eliminate busy

Eliminate busy-

  • wait problems

wait problems

  • Busy

Busy-

  • wait: If lock is busy, processors spin on

wait: If lock is busy, processors spin on memory bus memory bus

  • Effective lock hand off necessary

Effective lock hand off necessary

  • Fair

Fair

  • Predictive

Predictive

slide-6
SLIDE 6

Previous Work Previous Work

  • Spin

Spin-

  • lock alternatives (

lock alternatives (Anderson ’90

Anderson ’90)

)

Spin

Spin-

  • on
  • n-
  • read (spin on cache), delays in spin

read (spin on cache), delays in spin-

  • loops

loops

  • Queue based software locks

Queue based software locks

Array based queuing (

Array based queuing (Anderson ’90

Anderson ’90)

)

MCS locks (

MCS locks (Mellor

Mellor-

  • Crummey

Crummey, Scott ‘91 , Scott ‘91)

)

LH and M locks (

LH and M locks (Ladin

Ladin, , Hagerston Hagerston, Magnusson , Magnusson ’ ’94 94)

)

  • Queue based hardware locks

Queue based hardware locks

QOLBY (

QOLBY (Kagi

Kagi ’ ’99 99)

) – – makes use of collocation makes use of collocation

  • Cache

Cache-

  • based locks (

based locks (Ramachandran

Ramachandran’ ’96 96)

)

Memory consistency model

Memory consistency model

New cache design, extra cache states for locks

New cache design, extra cache states for locks

slide-7
SLIDE 7

Methodology Methodology

  • Custom hardware unit: SoC Lock Cache

Custom hardware unit: SoC Lock Cache

  • Utilize advantages of

Utilize advantages of SoC SoC Design Design

  • Short Critical Sections covered in DATE

Short Critical Sections covered in DATE ’ ’01 01

  • Critical Sections may be long or short

Critical Sections may be long or short

  • Support preemption of tasks when necessary

Support preemption of tasks when necessary

  • Hardware

Hardware-

  • interrupt triggered notification

interrupt triggered notification

  • Lock requests handled on a processor

Lock requests handled on a processor-

  • by

by-

  • processor basis

processor basis

  • Separate the lock variables according to the

Separate the lock variables according to the critical section lengths critical section lengths

slide-8
SLIDE 8

Memory P1 PN Arbit rat ion Logic SoC Lock Cache P2

SoC Lock Cache Hardware SoC Lock Cache Hardware Mechanism Mechanism

SoC Lock Cache

slide-9
SLIDE 9

Methodology Methodology

  • Multiple application tasks

Multiple application tasks

  • Atalanta

Atalanta-

  • RTOS

RTOS

  • Multi

Multi-

  • processor set

processor set-

  • up

up with MPC750s with MPC750s

  • SoCLC

SoCLC provides lock provides lock synchronization among synchronization among processors processors

MPC750 MPC750 SoC Lock Cache Atalanta-RTOS

Application Software (Tasks)

Extension MPC750 MPC750 Software Hardware

slide-10
SLIDE 10

Seamless CVE from

Mentor Graphics

4 MPC750s SoC Lock Cache

Unit (SoCLC)

Shared Memory Interface Logic

Hardware Simulation Set Hardware Simulation Set-

  • up

up

slide-11
SLIDE 11

CS access Task 1 :CS access Task 2 : Try to access CS Task 3 preempt Interrupt Task 2 : Try to access CS Busy-Wait Task 3 CS access Task 1 : CS access

Processor 1 Processor 2

Tasks Execution Time Improvement

In the case of long Critical Sections, non-preemptive synchronization causes inefficient CPU utilization among tasks.

Software

Processor 1 Processor 2

Context Sw and ISR

slide-12
SLIDE 12

Software Software

  • Assume 64 tasks

Assume 64 tasks

  • Each lock keeps a lock

Each lock keeps a lock-

  • wait table of 64

wait table of 64-

  • bit entries

bit entries

  • Expandable to > 64

Expandable to > 64

  • Tables accessed by ISR

Tables accessed by ISR

Lock n Lock n … Lock 4 Lock 4 Lock 3 Lock 3 Lock 2 Lock 2 Lock 1 Lock 1

Lock-wait table 2

56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 8 9 10 10 11 11 12 12 13 13 14 14 15 15 1 2 3 4 5 6 7 56 56 57 57 58 58 59 59 60 60 61 61 62 62 63 63 48 48 49 49 50 50 51 51 52 52 53 53 54 54 55 55 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 8 9 10 10 11 11 12 12 13 13 14 14 15 15 1 2 3 4 5 6 7

Lock-wait table 1

slide-13
SLIDE 13

task1 task2 task3 task4

Software Software

Interrupt

Lock_longCS Read_lock Remove task from ready table return from Lock_longCS Execute long CS UnLock Context Switch New task Execute ISR, Interrupt Handler

Free?

Execution without holding lock Holding lock Fail to acquire lock Release lock

PE1 PE2

slide-14
SLIDE 14

Experiments Experiments

  • With

With Atalanta Atalanta RTOS RTOS

  • With 4 MPC750s

With 4 MPC750s

  • Database Example

Database Example application (run application (run with 40 tasks) with 40 tasks)

Database Application (database object flow)

Client Server Shared Memory Client address space Server address space client local memory server local memory shared data

slide-15
SLIDE 15

Experiments Experiments

Example Database Application Transactions

Observed Observed Performance Performance Improvement with Improvement with Lock Cache Unit Lock Cache Unit

  • 100% speedup in lock

100% speedup in lock delay delay

  • 32% speedup in lock

32% speedup in lock latency latency

  • 27% speedup in total

27% speedup in total execution time execution time

slide-16
SLIDE 16

Experiments Experiments

29M 29M 23,590 23,590 908 908

With With SoCLC SoCLC 1.27x 1.27x 2.00x 2.00x 1.32x 1.32x Speedup Speedup

1200 1200

Lock Lock Latency Latency ( (clk clk cycles) cycles)

36.9M 36.9M

  • Exe. Time
  • Exe. Time

( (clk clk cycles) cycles)

47,264 47,264

Lock Delay Lock Delay ( (clk clk cycles) cycles) Without Without SoCLC SoCLC

  • Atalanta RTOS
  • 40 tasks
  • 4 PEs

Long CS lock results

slide-17
SLIDE 17

Experiments Experiments

102 102 32 32

With With SoCLC SoCLC 87.6x 87.6x 27x 27x Speedup Speedup

884 884

Lock Lock Latency Latency ( (clk clk cycles) cycles)

8936 8936

Lock Delay Lock Delay ( (clk clk cycles) cycles) Without Without SoCLC SoCLC

  • Atalanta RTOS
  • 40 tasks
  • 4 PEs

Sm all CS lock results

slide-18
SLIDE 18

Synthesis of Synthesis of SoCLC SoCLC

14,456 14,456 T= 256 T= 256 L= 128 L= 128 10,717 10,717 T= 192 T= 192 L= 64 L= 64 9,015 9,015 T= 160 T= 160 L= 32 L= 32 8,163 8,163 T= 144 T= 144 L= 16 L= 16 S= 128 S= 128 11,174 11,174 T= 192 T= 192 L= 128 L= 128 7,435 7,435 T= 128 T= 128 L= 64 L= 64 5,733 5,733 T= 96 T= 96 L= 32 L= 32 4,881 4,881 T= 80 T= 80 L= 16 L= 16 S= 64 S= 64 Total Total Area Area (gates) (gates) total # total #

  • f
  • f

locks locks long long CS CS locks locks short short CS CS locks locks 9,747 9,747 T= 160 T= 160 L= 128 L= 128 6,008 6,008 T= 96 T= 96 L= 64 L= 64 4,306 4,306 T= 64 T= 64 L= 32 L= 32 3,454 3,454 T= 48 T= 48 L= 16 L= 16 S= 32 S= 32 9,027 9,027 T= 144 T= 144 L= 128 L= 128 5,288 5,288 T= 80 T= 80 L= 64 L= 64 3,586 3,586 T= 48 T= 48 L= 32 L= 32 2,734 2,734 T= 32 T= 32 L= 16 L= 16 S= 16 S= 16 Total Total Area Area (gates) (gates) total # total #

  • f
  • f

locks locks long long CS CS locks locks short short CS CS locks locks

  • TSMC 0.25 micron technology (Synopsys Behavioral Compiler)
slide-19
SLIDE 19

Conclusion Conclusion

  • A hardware mechanism for multi

A hardware mechanism for multi-

  • processor SoC

processor SoC Lock Synchronization: SoC Lock Cache Lock Synchronization: SoC Lock Cache

  • Reduction in lock latency, lock delay

Reduction in lock latency, lock delay

  • 27% overall speedup in an example database

27% overall speedup in an example database application application

  • Support

Support both both long Critical Sections and short long Critical Sections and short Critical Sections Critical Sections

  • Allow context

Allow context-

  • switching of tasks instead of busy

switching of tasks instead of busy-

  • waiting

waiting