Multi-core Design Virendra Singh Associate Professor C omputer A - - PowerPoint PPT Presentation

multi core design
SMART_READER_LITE
LIVE PREVIEW

Multi-core Design Virendra Singh Associate Professor C omputer A - - PowerPoint PPT Presentation

Multi-core Design Virendra Singh Associate Professor C omputer A rchitecture and D ependable S ystems L ab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in


slide-1
SLIDE 1

CADSL

Multi-core Design

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

EE-739: Processor Design

Lecture 37 (16 April 2013)

slide-2
SLIDE 2

CADSL

OS Code Vs. User Code OS Code Vs. User Code

  • Operating systems are usually huge programs that can
  • verwhelm the cache and TLB due to code and data

size.

  • Operating systems may impact branch prediction

performance, because of frequent branches and infrequent loops.

  • OS execution is often brief and intermittent, invoked by

interrupts, exceptions, or system calls, and can cause the replacement of useful cache, TLB and branch prediction state for little or no benefit.

  • The OS may perform explicit cache/TLB invalidation,

and other operations not common in user-mode code.

16 Apr 2013 EE-739@IITB 2

slide-3
SLIDE 3

CADSL

SPECInt Workload Execution SPECInt Workload Execution Cycle Breakdown Cycle Breakdown

  • Percentage of execution cycles for OS Kernel instructions:

– During program startup: 18%, mostly due to data TLB misses. – Steady state: 5% still dominated by TLB misses.

16 Apr 2013 EE-739@IITB 4

slide-4
SLIDE 4

CADSL

Breakdown of kernel time for Breakdown of kernel time for SPECInt95 SPECInt95

16 Apr 2013 EE-739@IITB 5

slide-5
SLIDE 5

CADSL

SPECInt95 Dynamic Instruction Mix SPECInt95 Dynamic Instruction Mix

  • Percentage of dynamic instructions in the SPECInt workload by instruction type.
  • The percentages in parenthesis for memory operations represent the proportion of

loads and stores that are to physical addresses.

  • A percentage breakdown of branch instructions is also included.
  • For conditional branches, the number in parenthesis represents the percentage of

conditional branches that are taken.

16 Apr 2013 EE-739@IITB 6

slide-6
SLIDE 6

CADSL

SPECInt95 SPECInt95 Total Miss rates & Distribution of Misses

  • The miss categories are percentages of all user and kernel misses.
  • Bold entries signify kernel-induced interference.
  • User-kernel conflicts are misses in which the user thread conflicted with some type
  • f kernel activity (the kernel executing on behalf of this user thread, some other

user thread, a kernel thread, or an interrupt).

16 Apr 2013 EE-739@IITB 7

slide-7
SLIDE 7

CADSL

Metrics for SPECInt95 with and Metrics for SPECInt95 with and without the Operating System for both without the Operating System for both SMT and Superscalar. SMT and Superscalar.

  • The maximum issue for integer programs is 6 instructions on the 8-wide SMT, because

there are only 6 integer units. 16 Apr 2013 EE-739@IITB 8

slide-8
SLIDE 8

CADSL

9

SMT processor: both threads can run concurrently

BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point Thread 2: integer operation

16 Apr 2013 EE-739@IITB

slide-9
SLIDE 9

CADSL

10

But: Can’t simultaneously use the same functional unit

BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2

This scenario is impossible with SMT on a single core(assuming a single integer unit)

IMPOSSIBLE

16 Apr 2013 EE-739@IITB

slide-10
SLIDE 10

CADSL

11

SMT not a “true” parallel processor

  • Enables better threading (e.g. up to 30%)
  • OS and applications perceive each

simultaneous thread as a separate “virtual processor”

  • The chip has only a single copy of each

resource

  • Compare to multi-core:

each core has its own copy of resources

16 Apr 2013 EE-739@IITB

slide-11
SLIDE 11

CADSL

12

Multi-core: threads can run on separate cores

BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus Thread 1 Thread 2

16 Apr 2013 EE-739@IITB

slide-12
SLIDE 12

CADSL

13

BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus Thread 3 Thread 4

Multi-core: threads can run on separate cores

16 Apr 2013 EE-739@IITB

slide-13
SLIDE 13

CADSL

14

Combining Multi-core and SMT

  • Cores can be SMT-enabled (or not)
  • The different combinations:

– Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT

  • The number of SMT threads:

2, 4, or sometimes 8 simultaneous threads

16 Apr 2013 EE-739@IITB

slide-14
SLIDE 14

CADSL

15

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus BTB and I-TLB Decoder

Trace Cache

Rename/Alloc Uop queues Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM BTB

L2 Cache and Control Bus Thread 1 Thread 3 Thread 2 Thread 4

16 Apr 2013 EE-739@IITB

slide-15
SLIDE 15

CADSL

16

Comparison: Multi-core vs SMT

  • Multi-core:

– Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism

  • SMT

– Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism

16 Apr 2013 EE-739@IITB

slide-16
SLIDE 16

CADSL

17

IPC-Performance of SMT and CMP IPC-Performance of SMT and CMP

SPEC95-simulations [Eggers et al.]. CMP2: 2 processors, 4-issue superscalar 2*(1,4) CMP4: 4 processors, 2-issue superscalar 4*(1,2) SMT: 8-threaded, 8-issue superscalar 1*(8,8)

16 Apr 2013 EE-739@IITB

slide-17
SLIDE 17

CADSL

18

The memory hierarchy

  • If simultaneous multithreading only:

– all caches shared

  • Multi-core chips:

– L1 caches private – L2 caches private in some architectures and shared in others

  • Memory is always shared

16 Apr 2013 EE-739@IITB

slide-18
SLIDE 18

CADSL

19

Private vs shared caches

  • Advantages of private:

– They are closer to core, so faster access – Reduces contention

  • Advantages of shared:

– Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system

16 Apr 2013 EE-739@IITB

slide-19
SLIDE 19

CADSL

20

The cache coherence problem

  • Since we have private caches:

How to keep the data consistent across caches?

  • Each core should perceive the memory as a

monolithic array, shared by all the cores

16 Apr 2013 EE-739@IITB

slide-20
SLIDE 20

CADSL

21

The cache coherence problem

Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4 One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

16 Apr 2013 EE-739@IITB

slide-21
SLIDE 21

CADSL

22

The cache coherence problem

Core 1 reads x

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

16 Apr 2013 EE-739@IITB

slide-22
SLIDE 22

CADSL

23

The cache coherence problem

Core 2 reads x

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

16 Apr 2013 EE-739@IITB

slide-23
SLIDE 23

CADSL

24

The cache coherence problem

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

assuming write-through caches

16 Apr 2013 EE-739@IITB

slide-24
SLIDE 24

CADSL

25

The cache coherence problem

Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

16 Apr 2013 EE-739@IITB

slide-25
SLIDE 25

CADSL

26

Solutions for cache coherence

  • This is a general problem with

multiprocessors, not limited just to multi-core

  • There exist many solution algorithms,

coherence protocols, etc.

  • A simple solution:

invalidation-based protocol with snooping

16 Apr 2013 EE-739@IITB

slide-26
SLIDE 26

CADSL

27

Inter-core bus

Core 1 Core 2 Core 3 Core 4 One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory

multi-core chip

inter-core bus

16 Apr 2013 EE-739@IITB

slide-27
SLIDE 27

CADSL

28

Invalidation protocol with snooping

  • Invalidation:

If a core writes to a data item, all other copies of this data item in other caches are invalidated

  • Snooping:

All cores continuously “snoop” (monitor) the bus connecting the cores.

16 Apr 2013 EE-739@IITB

slide-28
SLIDE 28

CADSL

29

The cache coherence problem

Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

16 Apr 2013 EE-739@IITB

slide-29
SLIDE 29

CADSL

30

The cache coherence problem

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

assuming write-through caches INVALIDATED sends invalidation request inter-core bus

16 Apr 2013 EE-739@IITB

slide-30
SLIDE 30

CADSL

31

The cache coherence problem

After invalidation:

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

16 Apr 2013 EE-739@IITB

slide-31
SLIDE 31

CADSL

32

The cache coherence problem

Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

16 Apr 2013 EE-739@IITB

slide-32
SLIDE 32

CADSL

33

Alternative to invalidate protocol: update protocol

Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

assuming write-through caches UPDATED broadcasts updated value inter-core bus

16 Apr 2013 EE-739@IITB

slide-33
SLIDE 33

CADSL

34

Which do you think is better? Invalidation or update?

16 Apr 2013 EE-739@IITB

slide-34
SLIDE 34

CADSL

35

Invalidation vs update

  • Multiple writes to the same location

– invalidation: only the first time – update: must broadcast each write (which includes new variable value)

  • Invalidation generally performs better:

it generates less bus traffic

16 Apr 2013 EE-739@IITB