Multi-core Design Virendra Singh Associate Professor C omputer A - PowerPoint PPT Presentation

Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in EE-739: Processor Design Lecture 37 (16 April 2013) CADSL

OS Code Vs. User Code OS Code Vs. User Code • Operating systems are usually huge programs that can overwhelm the cache and TLB due to code and data size. • Operating systems may impact branch prediction performance, because of frequent branches and infrequent loops. • OS execution is often brief and intermittent, invoked by interrupts, exceptions, or system calls, and can cause the replacement of useful cache, TLB and branch prediction state for little or no benefit. • The OS may perform explicit cache/TLB invalidation, and other operations not common in user-mode code. 16 Apr 2013 EE-739@IITB 2 CADSL

SPECInt Workload Execution SPECInt Workload Execution Cycle Breakdown Cycle Breakdown • Percentage of execution cycles for OS Kernel instructions: – During program startup: 18%, mostly due to data TLB misses. – Steady state: 5% still dominated by TLB misses. 16 Apr 2013 EE-739@IITB 4 CADSL

Breakdown of kernel time for Breakdown of kernel time for SPECInt95 SPECInt95 16 Apr 2013 EE-739@IITB 5 CADSL

SPECInt95 Dynamic Instruction Mix SPECInt95 Dynamic Instruction Mix • Percentage of dynamic instructions in the SPECInt workload by instruction type. • The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. • A percentage breakdown of branch instructions is also included. • For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken. 16 Apr 2013 EE-739@IITB 6 CADSL

SPECInt95 Total Miss rates & SPECInt95 Distribution of Misses • The miss categories are percentages of all user and kernel misses. • Bold entries signify kernel-induced interference. • User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt). 16 Apr 2013 EE-739@IITB 7 CADSL

Metrics for SPECInt95 with and Metrics for SPECInt95 with and without the Operating System for both without the Operating System for both SMT and Superscalar. SMT and Superscalar. • The maximum issue for integer programs is 6 instructions on the 8-wide SMT, because there are only 6 integer units. 16 Apr 2013 EE-739@IITB 8 CADSL

SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation Thread 1: floating point 16 Apr 2013 EE-739@IITB 9 CADSL

But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM This scenario is Decoder impossible with SMT on Bus BTB and I-TLB a single core(assuming a single integer unit) Thread 1 Thread 2 IMPOSSIBLE 16 Apr 2013 EE-739@IITB 10 CADSL

SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 16 Apr 2013 EE-739@IITB 11 CADSL

Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 16 Apr 2013 EE-739@IITB 12 CADSL

Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 16 Apr 2013 EE-739@IITB 13 CADSL

Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads 16 Apr 2013 EE-739@IITB 14 CADSL

SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 16 Apr 2013 EE-739@IITB 15 CADSL

Comparison: Multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism • SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism 16 Apr 2013 EE-739@IITB 16 CADSL

IPC-Performance of SMT and CMP IPC-Performance of SMT and CMP SPEC95-simulations [ Eggers et al .]. CMP2: 2 processors, 4-issue superscalar 2*(1,4) CMP4: 4 processors, 2-issue superscalar 4*(1,2) SMT: 8-threaded, 8-issue superscalar 1*(8,8) 16 Apr 2013 EE-739@IITB 17 CADSL

The memory hierarchy • If simultaneous multithreading only: – all caches shared • Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others • Memory is always shared 16 Apr 2013 EE-739@IITB 18 CADSL

Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 16 Apr 2013 EE-739@IITB 19 CADSL

The cache coherence problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores 16 Apr 2013 EE-739@IITB 20 CADSL

The cache coherence problem Suppose variable x initially contains 15213 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 21 CADSL

The cache coherence problem Core 1 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 22 CADSL

The cache coherence problem Core 2 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 23 CADSL

The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip assuming Main memory write-through x=21660 caches 16 Apr 2013 EE-739@IITB 24 CADSL

The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory x=21660 16 Apr 2013 EE-739@IITB 25 CADSL

Solutions for cache coherence • This is a general problem with multiprocessors, not limited just to multi-core • There exist many solution algorithms, coherence protocols, etc. • A simple solution: invalidation -based protocol with snooping 16 Apr 2013 EE-739@IITB 26 CADSL

Inter-core bus Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory inter-core bus 16 Apr 2013 EE-739@IITB 27 CADSL

Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. 16 Apr 2013 EE-739@IITB 28 CADSL

Multi-core Design Virendra Singh Associate Professor C omputer A - PowerPoint PPT Presentation

Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Multi-core model checking for biological applications Jaco van de Pol 22 November 2013

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Scalable Multi-Core Model Checking Alfons Laarman ( alfons@laarman.com ), Theory joint work with

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

The Elective Options 7 th Grade Core & Electives Core Choose Core & Elective

CORE COMPETENCIES How does School District 47 help students to self-assess core competencies in:

Xen and the Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim

ECE232: Hardware Organization and Design Lecture 28: More Virtual Memory Adapted from Computer

Malloc & VM Malloc & VM By sseshadr Agenda Agenda Administration d st at o

Solution 2: TLBs We have a large pile of data (i.e., the page table) and we want to access it

Single Address Space o RW RO EX NO o Kernel vfat.o Single Address Space o RW RO EX o

A Configurable TLB Hierarchy for the RISC-V Architecture Nikolaos Charalampos Papadopoulos ,

Hans Amende and Caleb Bahr Dr. W. Lee Powell Jr. Texas Lutheran University They have

Trigger Board a status report 2018-08-27 Marco Roda mroda@liverpool.ac.uk Hardware CTB

Sambuz

Useful Links

Newsletter

Mail Us

Multi-core Design Virendra Singh Associate Professor C omputer A - PowerPoint PPT Presentation

Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Multi-core model checking for biological applications Jaco van de Pol 22 November 2013

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Scalable Multi-Core Model Checking Alfons Laarman ( alfons@laarman.com ), Theory joint work with

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

The Elective Options 7 th Grade Core &amp; Electives Core Choose Core &amp; Elective

CORE COMPETENCIES How does School District 47 help students to self-assess core competencies in:

Xen and the Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim

ECE232: Hardware Organization and Design Lecture 28: More Virtual Memory Adapted from Computer

Malloc &amp; VM Malloc &amp; VM By sseshadr Agenda Agenda Administration d st at o

Solution 2: TLBs We have a large pile of data (i.e., the page table) and we want to access it

Single Address Space o RW RO EX NO o Kernel vfat.o Single Address Space o RW RO EX o

A Configurable TLB Hierarchy for the RISC-V Architecture Nikolaos Charalampos Papadopoulos ,

Hans Amende and Caleb Bahr Dr. W. Lee Powell Jr. Texas Lutheran University They have

Trigger Board a status report 2018-08-27 Marco Roda mroda@liverpool.ac.uk Hardware CTB

Sambuz

Useful Links

Newsletter

Mail Us

The Elective Options 7 th Grade Core & Electives Core Choose Core & Elective

Malloc & VM Malloc & VM By sseshadr Agenda Agenda Administration d st at o