An Asymmetric Multi-core Architecture for Accelerating Critical - PowerPoint PPT Presentation

An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1

Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft Research, HPS) Eric Sprangle (Intel, HPS) Anwar Rohillah (Intel) Anwar Ghuloum (Intel) Doug Carmean (Intel) 2

The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Tile-Large” Approach “Niagara” Approach • Provide one large core and many small cores • Accelerate serial part using the large core • Execute parallel part on small cores for high throughput 3

3 6 2 5 8 1 4 7 5 2 6 5 6 The 8-Puzzle Problem 4 4 8 4 2 8 : : 1 3 7 1 3 7 5 2 6 4 8 1 3 7

The 8-Puzzle Problem 1 4 5 1 2 3 3 2 4 5 6 7 8 6 7 8 while(problem not solved) SubProblem = PriorityQ.remove() Solve(SubProblem) Critical if(solved) Sections break NewSubProblems = Partition(SubProblem) PriorityQ.insert(NewSubProblems) 5

Contention for Critical Sections Critical Section Parallel Thread 1 Thread 2 Idle Thread 3 Thread 4 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Thread 1 Critical Sections Thread 2 execute 2x faster Thread 3 Thread 4 t 1 t 2 t 3 t 4 t 5 t 6 t 7 6

MySQL Database LOCK_open � Acquire() foreach (table locked by thread) table.lock � release() table.file � release() if (table.temporary) table.close() LOCK_open � Release() 7

Conventional ACMP 1. P2 encounters a Critical Section EnterCS() 2. Sends a request for the lock PriorityQ.insert(…) 3. Acquires the lock 4. Executes Critical Section LeaveCS() 5. Releases the lock Core executing critical section P1 P2 P3 P4 Onchip- Interconnect 8

Accelerated Critical Sections (ACS) 1. P2 encounters a Critical Section EnterCS() 2. P2 sends CSCALL Request to CSRB PriorityQ.insert(…) 3. P1 executes Critical Section 4. P1 sends CSDONE signal LeaveCS() Core executing critical section P1 P2 P3 P4 Critical Section Request Buffer (CSRB) Onchip- Interconnect 9

Architecture Overview • ISA extensions – CSCALL LOCK_ADDR , TARGET_PC – CSRET LOCK_ADDR • Compiler/Library inserts CSCALL/CSRET • On a CSCALL, the small core: – Sends a CSCALL request to the large core • Arguments: Lock address, Target PC, Stack Pointer, Core ID – Stalls and waits for CSDONE • Large Core – Critical Section Request Buffer (CSRB) – Executes the critical section and sends CSDONE to the requesting core 10

“False” Serialization • Independent critical sections are used to protect disjoint data • Conventional systems can execute independent critical sections concurrently but ACS can artificially serializes their execution • Selective Acceleration of Critical Sections (SEL) – Augment CSRB with saturating counters which track false serialization CSCALL (A) Critical 3 4 2 A Section CSCALL (A) Request 4 5 B Buffer CSCALL (B) 11

Performance Trade-offs in ACS • Fewer concurrent threads – As number of cores increase • Marginal loss in parallel performance decreases • More threads � Contention for critical sections increases which makes their acceleration more beneficial • Overhead of CSCALL/CSDONE – Fewer cache misses for the lock variable • Cache misses for private data – Fewer misses for shared data Cache misses reduce if Shared data > Private data – The large core can tolerate cache miss latencies better than small cores 12

Experimental Methodology • Configurations – One large core is the size of 4 small cores – At chip area equal to N small cores • Symmetric CMP (SCMP): N small cores, conventional locking • Asymmetric CMP (ACMP): 1 large core, N – 4 small cores, conventional locking • ACS: 1 large core, N – 4 small cores, (N – 4)-entry CSRB. • Workloads – 12 critical section intensive applications from various domains – 7 use coarse-grain locks and 5 use fine-grain locks • Simulation parameters: – x86 cycle accurate processor simulator – Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry, 4-wide, 12-stage – Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide, 5-stage – Private 32 KB L1, private 256KB L2, 8MB shared L3 – On-chip interconnect: Bi-directional ring 13

Workloads with Coarse-Grain Locks Equal-area comparison Number of threads = Best threads Chip Area = 16 cores Chip Area = 32 small cores SCMP = 16 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 12 small cores ACMP/ACS = 1 large and 28 small cores 14

Area = 32 small cores Workloads with Fine-Grain Locks 15 Area = 16 small cores

Equal-Area Comparisons Number of threads = No. of cores Speedup over a small core Chip Area (small cores) 16

ACS on Symmetric CMP 17

Conclusion • ACS reduces average execution time by: – 34% compared to an equal-area SCMP – 23% compared to an equal-area ACMP • ACS improves scalability of 7 of the 12 workloads • Future work will examine resource allocation in ACS in presence of multiple applications 18

An Asymmetric Multi-core Architecture for Accelerating Critical - PowerPoint PPT Presentation

An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

Beyond the Asymmetric Turing Test Fintan Mallory Rethinking, Reworking and Revolutionising The

Asymmetric Inventory Dynamics and Product Market Search Linxi Chen November 29, 2017 1 / 46

Symmetric and Asymmetric Key Cryptography(Part 2) By Radhika B S Contents Asymmetric Key

Phase-guided Thread-to-core Assignment for Improved Utilization of Performance- Asymmetric

GenerOS: An Asymmetric Operating System Kernel for Multi-core Systems Authors: Qingbo Yuan,

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

HPS Can Improve Problem- Solving Ricardo Lopes Coelho Faculdade de Cincias Universidade de

Dynamic Provisioning of Storage Resources: A Case Study with Burst Buffers Franois Tessier ,

E XPERIMENT AT JLAB Stepan Stepanyan JLAB Intensity Frontier Workshop 25-27 April 2013, ANL 2

, w T w u v , u v T

Sinking Fund Sinking Fund Millage In 2016 Voters Approved 5-Year Sinking Fund Millage of 1.6

Reconstruction and identification of hadronic tau at 13 TeV Aniello Spiezia, Taozhe YU CLHCP

Leakage-Resilient Chosen-Ciphertext Secure Public-Key Encryption from Hash Proof System and

Regular Lossy Functions and Applications in Leakage-Resilient Cryptography CT-RSA 2018 April

An Asymmetric Multi-core Architecture for Accelerating Critical - PowerPoint PPT Presentation

An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

Beyond the Asymmetric Turing Test Fintan Mallory Rethinking, Reworking and Revolutionising The

Asymmetric Inventory Dynamics and Product Market Search Linxi Chen November 29, 2017 1 / 46

Symmetric and Asymmetric Key Cryptography(Part 2) By Radhika B S Contents Asymmetric Key

Phase-guided Thread-to-core Assignment for Improved Utilization of Performance- Asymmetric

GenerOS: An Asymmetric Operating System Kernel for Multi-core Systems Authors: Qingbo Yuan,

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Decommissioning: Winds of Change in Offshore Oil &amp; Gas Accelerating NAMEPA &amp; NOIA Winds

Sustainably Faster: Accelerating Sustainably Faster: Accelerating Innovation in Transportation

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen &amp; Maurits van der

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

HPS Can Improve Problem- Solving Ricardo Lopes Coelho Faculdade de Cincias Universidade de

Dynamic Provisioning of Storage Resources: A Case Study with Burst Buffers Franois Tessier ,

E XPERIMENT AT JLAB Stepan Stepanyan JLAB Intensity Frontier Workshop 25-27 April 2013, ANL 2

, w T w u v , u v T

Sinking Fund Sinking Fund Millage In 2016 Voters Approved 5-Year Sinking Fund Millage of 1.6

Reconstruction and identification of hadronic tau at 13 TeV Aniello Spiezia, Taozhe YU CLHCP

Leakage-Resilient Chosen-Ciphertext Secure Public-Key Encryption from Hash Proof System and

Regular Lossy Functions and Applications in Leakage-Resilient Cryptography CT-RSA 2018 April

Decommissioning: Winds of Change in Offshore Oil & Gas Accelerating NAMEPA & NOIA Winds

SSL Accelerating Test Bench SSL accelerating Test Method Stefan Deelen & Maurits van der