An Asymmetric Multi-core Architecture for Accelerating Critical - - PowerPoint PPT Presentation

an asymmetric multi core architecture for accelerating
SMART_READER_LITE
LIVE PREVIEW

An Asymmetric Multi-core Architecture for Accelerating Critical - - PowerPoint PPT Presentation

An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft


slide-1
SLIDE 1

1

An Asymmetric Multi-core Architecture for Accelerating Critical Sections

  • M. Aater Suleman

Advisor: Yale Patt

HPS Research Group The University of Texas at Austin

slide-2
SLIDE 2

2

Acknowledgements

Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft Research, HPS) Eric Sprangle (Intel, HPS) Anwar Rohillah (Intel) Anwar Ghuloum (Intel) Doug Carmean (Intel)

slide-3
SLIDE 3

3

The Asymmetric Chip Multiprocessor (ACMP)

  • Provide one large core and many small cores
  • Accelerate serial part using the large core
  • Execute parallel part on small cores

for high throughput

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

Large core

ACMP Approach

Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core Niagara

  • like

core

“Niagara” Approach

Large core Large core Large core Large core

“Tile-Large” Approach

slide-4
SLIDE 4

4

The 8-Puzzle Problem

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

: :

slide-5
SLIDE 5

5

The 8-Puzzle Problem

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 while(problem not solved) SubProblem = PriorityQ.remove() Solve(SubProblem) if(solved) break NewSubProblems = Partition(SubProblem) PriorityQ.insert(NewSubProblems) Critical Sections

slide-6
SLIDE 6

6

Contention for Critical Sections

t1 t2 t3 t4 t5 t6 t7 t1 t2 t3 t4 t5 t6 t7 Critical Sections execute 2x faster

Thread 1 Thread 2 Thread 3 Thread 4 Thread 1 Thread 2 Thread 3 Thread 4

Critical Section Parallel Idle

slide-7
SLIDE 7

7

MySQL Database

LOCK_openAcquire() foreach (table locked by thread) table.lockrelease() table.filerelease() if (table.temporary) table.close() LOCK_openRelease()

slide-8
SLIDE 8

8

Conventional ACMP

P1 P2 P3 P4

EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect

  • 1. P2 encounters a Critical Section
  • 2. Sends a request for the lock
  • 3. Acquires the lock
  • 4. Executes Critical Section
  • 5. Releases the lock

Core executing critical section

slide-9
SLIDE 9

9

Accelerated Critical Sections (ACS)

P1 P2 P3 P4

EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect Critical Section Request Buffer (CSRB)

  • 1. P2 encounters a Critical Section
  • 2. P2 sends CSCALL Request to CSRB
  • 3. P1 executes Critical Section
  • 4. P1 sends CSDONE signal

Core executing critical section

slide-10
SLIDE 10

10

Architecture Overview

  • ISA extensions

– CSCALL LOCK_ADDR, TARGET_PC – CSRET LOCK_ADDR

  • Compiler/Library inserts CSCALL/CSRET
  • On a CSCALL, the small core:

– Sends a CSCALL request to the large core

  • Arguments: Lock address, Target PC, Stack Pointer, Core ID

– Stalls and waits for CSDONE

  • Large Core

– Critical Section Request Buffer (CSRB) – Executes the critical section and sends CSDONE to the requesting core

slide-11
SLIDE 11

11

“False” Serialization

  • Independent critical sections are used to protect disjoint data
  • Conventional systems can execute independent critical sections

concurrently but ACS can artificially serializes their execution

  • Selective Acceleration of Critical Sections (SEL)

– Augment CSRB with saturating counters which track false serialization CSCALL (A) CSCALL (A) CSCALL (B) Critical Section Request Buffer 4 4 A B 3 2 5

slide-12
SLIDE 12

12

Performance Trade-offs in ACS

  • Fewer concurrent threads

– As number of cores increase

  • Marginal loss in parallel performance decreases
  • More threads Contention for critical sections increases which

makes their acceleration more beneficial

  • Overhead of CSCALL/CSDONE

– Fewer cache misses for the lock variable

  • Cache misses for private data

– Fewer misses for shared data Cache misses reduce if Shared data > Private data – The large core can tolerate cache miss latencies better than small cores

slide-13
SLIDE 13

13

Experimental Methodology

  • Configurations

– One large core is the size of 4 small cores – At chip area equal to N small cores

  • Symmetric CMP (SCMP): N small cores, conventional locking
  • Asymmetric CMP (ACMP): 1 large core, N – 4 small cores,

conventional locking

  • ACS: 1 large core, N – 4 small cores, (N – 4)-entry CSRB.
  • Workloads

– 12 critical section intensive applications from various domains – 7 use coarse-grain locks and 5 use fine-grain locks

  • Simulation parameters:

– x86 cycle accurate processor simulator – Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry, 4-wide, 12-stage – Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide, 5-stage – Private 32 KB L1, private 256KB L2, 8MB shared L3 – On-chip interconnect: Bi-directional ring

slide-14
SLIDE 14

14

Workloads with Coarse-Grain Locks

Chip Area = 16 cores

SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores

Equal-area comparison Number of threads = Best threads

Chip Area = 32 small cores

SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores

slide-15
SLIDE 15

15

Workloads with Fine-Grain Locks

Area = 16 small cores Area = 32 small cores

slide-16
SLIDE 16

16

Equal-Area Comparisons

Number of threads = No. of cores

Chip Area (small cores) Speedup over a small core

slide-17
SLIDE 17

17

ACS on Symmetric CMP

slide-18
SLIDE 18

18

Conclusion

  • ACS reduces average execution time by:

– 34% compared to an equal-area SCMP – 23% compared to an equal-area ACMP

  • ACS improves scalability of 7 of the 12

workloads

  • Future work will examine resource allocation in

ACS in presence of multiple applications