multi core design
play

Multi-core Design Virendra Singh Associate Professor C omputer A - PowerPoint PPT Presentation

Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in


  1. Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in EE-739: Processor Design Lecture 37 (16 April 2013) CADSL

  2. OS Code Vs. User Code OS Code Vs. User Code • Operating systems are usually huge programs that can overwhelm the cache and TLB due to code and data size. • Operating systems may impact branch prediction performance, because of frequent branches and infrequent loops. • OS execution is often brief and intermittent, invoked by interrupts, exceptions, or system calls, and can cause the replacement of useful cache, TLB and branch prediction state for little or no benefit. • The OS may perform explicit cache/TLB invalidation, and other operations not common in user-mode code. 16 Apr 2013 EE-739@IITB 2 CADSL

  3. SPECInt Workload Execution SPECInt Workload Execution Cycle Breakdown Cycle Breakdown • Percentage of execution cycles for OS Kernel instructions: – During program startup: 18%, mostly due to data TLB misses. – Steady state: 5% still dominated by TLB misses. 16 Apr 2013 EE-739@IITB 4 CADSL

  4. Breakdown of kernel time for Breakdown of kernel time for SPECInt95 SPECInt95 16 Apr 2013 EE-739@IITB 5 CADSL

  5. SPECInt95 Dynamic Instruction Mix SPECInt95 Dynamic Instruction Mix • Percentage of dynamic instructions in the SPECInt workload by instruction type. • The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. • A percentage breakdown of branch instructions is also included. • For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken. 16 Apr 2013 EE-739@IITB 6 CADSL

  6. SPECInt95 Total Miss rates & SPECInt95 Distribution of Misses • The miss categories are percentages of all user and kernel misses. • Bold entries signify kernel-induced interference. • User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt). 16 Apr 2013 EE-739@IITB 7 CADSL

  7. Metrics for SPECInt95 with and Metrics for SPECInt95 with and without the Operating System for both without the Operating System for both SMT and Superscalar. SMT and Superscalar. • The maximum issue for integer programs is 6 instructions on the 8-wide SMT, because there are only 6 integer units. 16 Apr 2013 EE-739@IITB 8 CADSL

  8. SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation Thread 1: floating point 16 Apr 2013 EE-739@IITB 9 CADSL

  9. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM This scenario is Decoder impossible with SMT on Bus BTB and I-TLB a single core(assuming a single integer unit) Thread 1 Thread 2 IMPOSSIBLE 16 Apr 2013 EE-739@IITB 10 CADSL

  10. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 16 Apr 2013 EE-739@IITB 11 CADSL

  11. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 16 Apr 2013 EE-739@IITB 12 CADSL

  12. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 16 Apr 2013 EE-739@IITB 13 CADSL

  13. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads 16 Apr 2013 EE-739@IITB 14 CADSL

  14. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 16 Apr 2013 EE-739@IITB 15 CADSL

  15. Comparison: Multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism • SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism 16 Apr 2013 EE-739@IITB 16 CADSL

  16. IPC-Performance of SMT and CMP IPC-Performance of SMT and CMP SPEC95-simulations [ Eggers et al .]. CMP2: 2 processors, 4-issue superscalar 2*(1,4) CMP4: 4 processors, 2-issue superscalar 4*(1,2) SMT: 8-threaded, 8-issue superscalar 1*(8,8) 16 Apr 2013 EE-739@IITB 17 CADSL

  17. The memory hierarchy • If simultaneous multithreading only: – all caches shared • Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others • Memory is always shared 16 Apr 2013 EE-739@IITB 18 CADSL

  18. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 16 Apr 2013 EE-739@IITB 19 CADSL

  19. The cache coherence problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores 16 Apr 2013 EE-739@IITB 20 CADSL

  20. The cache coherence problem Suppose variable x initially contains 15213 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 21 CADSL

  21. The cache coherence problem Core 1 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 22 CADSL

  22. The cache coherence problem Core 2 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 16 Apr 2013 EE-739@IITB 23 CADSL

  23. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip assuming Main memory write-through x=21660 caches 16 Apr 2013 EE-739@IITB 24 CADSL

  24. The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory x=21660 16 Apr 2013 EE-739@IITB 25 CADSL

  25. Solutions for cache coherence • This is a general problem with multiprocessors, not limited just to multi-core • There exist many solution algorithms, coherence protocols, etc. • A simple solution: invalidation -based protocol with snooping 16 Apr 2013 EE-739@IITB 26 CADSL

  26. Inter-core bus Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory inter-core bus 16 Apr 2013 EE-739@IITB 27 CADSL

  27. Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. 16 Apr 2013 EE-739@IITB 28 CADSL

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend