[PPT] - Shared Symmetric Memory Systems Computer Architecture J. Daniel PowerPoint Presentation

SLIDE 1

Shared Symmetric Memory Systems

Computer Architecture

J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/37

SLIDE 2

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/37

SLIDE 3

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

Increasing importance of multiprocessors

There is a decrease in silicon and energy efficiency as more ILP is exploited.

Cost of silicon and energy grows faster than performance.

Increasing interest in high performance servers.

Cloud computing, software as a service, . . .

Data intensive applications growth.

Huge amounts of data on the Internet. Big data analytics.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/37

SLIDE 4

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

TLP: Thread level parallelism

TLP implies the existence of multiple program counters.

Assumes MIMD. Generalized use of TLP outside scientific computing is relatively recent. New applications:

Embedded applications. Desktop. High-end servers.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/37

SLIDE 5

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

Multiprocessors

A multiprocessor is a computer consisting of highly coupled processors with:

Coordination and use typically controlled by a single

perating system.

Memory sharing through a single shared memory space.

Software models:

Parallel processing: Coupled set of cooperating threads. Request processing: Independent process execution

riginated by users.

Multiprogramming: Independent execution of multiple applications.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/37

SLIDE 6

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

Most common approach:

From 2 to tenths of processors. Shared memory.

Implies shared memory. Does not necessarily imply a single physical memory.

Alternatives:

CMP (Chip Multi Processors) or multi-core. Multiple chips.

Each one may (or may not) be multi-core.

Multicomputer: Weakly coupled processors not sharing memory.

Used in large scale scientific computing.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/37

SLIDE 7

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

Maximizing exploitation of multiprocessors:

With n processors, at least n processes or threads are needed.

Threads identification:

Explicitly identified by programmer. Created by operating system from requests. Loop iterations generated by parallel compiler (e.g. OpenMP).

High-level identification performed by programmer or system software with threads having enough number of instructions to execute.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/37

SLIDE 8

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

Multiprocessors and shared memory

SMP: Symmetric Multi-Processor Centralized shared memory. Share a single centralized memory where all have equal access time. All multi-cores are SMP . UMA: Uniform Memory Access

Memory latency is uniform.

DSM: Distributed Shared Memory Memory is distributed across processors. Needed when the number of processors is high. NUMA: Non Uniform Memory Access.

Memory latency depends on data location.

Communication through access to global variables.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/37

SLIDE 9

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

SMP: Symmetric Multi Processor

P1 P2 P3 P4

Private cache Private cache Private cache Private cache Shared cache Main memory

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/37

SLIDE 10

Shared Symmetric Memory Systems Introduction to multiprocessor architectures

DSM: Distributed Shared Memory

P1

Mem I/O

P2

Mem I/O

P3

Mem I/O

P4

Mem I/O Interconnection network

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/37

SLIDE 11

Shared Symmetric Memory Systems Centralized shared memory architectures

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/37

SLIDE 12

Shared Symmetric Memory Systems Centralized shared memory architectures

SMP and memory hierarchy

Why using centralized memory?

Multi-level large caches decrease memory bandwidth demand on main memory accesses.

Evolution:

1. Single-core with memory in shared bus.
2. Memory connection in separated bus only for memory.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/37

SLIDE 13

Shared Symmetric Memory Systems Centralized shared memory architectures

Cache memory

Kinds of data in cache memory:

Private data: Data used by a single processor. Shared data: Data used by multiple processors.

Problem with shared data:

Datum may be replicated in multiple caches. Contention is decreased.

Each processors accesses its local copy.

If two processors modify their copies . . .

Cache coherence?

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/37

SLIDE 14

Shared Symmetric Memory Systems Centralized shared memory architectures

Cache coherence

Thread 1

lw $t0 , d i r x addi $t0 , $t0 , 1 sw $t0 , d i r x

Thread 2

lw $t0 , d i r x

$t0 initially 1. Assuming write through.

Process Instruction P1 Cache P2 Cache Main memory T1 Initially Not present Not present 1 T1 lw $t0, dirx 1 Not present 1 T1 addi $t0, $t0, 1 1 Not present 1 T2 lw $t0, dirx 1 1 1 T1 sw $t0, dirx 2 1 1

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/37

SLIDE 15

Shared Symmetric Memory Systems Centralized shared memory architectures

Cache incoherence

Why does incoherence happen?

State duality:

Global state → Main memory. Local state → Private cache.

A memory system is coherent if any read from a location returns the most recent value that has been written to that location. Two aspects:

Coherence: Which value does a read return? Consistency: When does a read get the written value?

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/37

SLIDE 16

Shared Symmetric Memory Systems Centralized shared memory architectures

Conditions for coherence

Program order preservation

A read from processor P on location X after a write from processor P on location X, without intermediate writes on X by any other processor Q, always returns the value written by P .

Coherent view of memory:

A read from processor P on a memory location X, after a write form other processor Q on location X, returns the written value if both operations are separate enough in time and there are no intermediate writes on X.

Writes serialization:

Two writes on the same memory location by two different processors are seen in the same order by all the processors.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/37

SLIDE 17

Shared Symmetric Memory Systems Centralized shared memory architectures

Memory consistency

Defines in which point in time a process reading values will see a written value. Coherence y consistency are complementary:

Coherence: Behavior of reads and writes on a single memory location. Consistency: Behavior of reads and writes with respect to accesses to other memory locations.

There are different consistency memory models.

We will have a specific lecture on this problem

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/37

SLIDE 18

Shared Symmetric Memory Systems Cache coherence alternatives

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/37

SLIDE 19

Shared Symmetric Memory Systems Cache coherence alternatives

Coherent multiprocessors

A coherent multiprocessor offers:

Shared data migration.

A datum may be moved to a local cache and be used transparently. Decreases remote data access latency and bandwidth demand to shared memory.

Shared data replication simultaneously read.

Performs data copy in local cache. Decreases access latency and read contention.

Critical properties for performance:

Solution: Hardware protocol for keeping cache coherence.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/37

SLIDE 20

Shared Symmetric Memory Systems Cache coherence alternatives

Kinds of cache coherence protocols

Directory based:

Sharing state is kept in a directory. SMP: Centralized directory in memory or in LLC (Last-level cache). DSM: To avoid bottlenecks a distributed directory is used (more complex).

Snooping:

Each cache keeps the sharing state of each block that it stores. Caches accessible through a broadcasting device (bus). All caches monitor broadcasting device to determine if they have a copy of the block.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/37

SLIDE 21

Shared Symmetric Memory Systems Snooping protocols

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/37

SLIDE 22

Shared Symmetric Memory Systems Snooping protocols

Coherence maintenance

Write invalidation:

Guarantees that a processor has exclusive access to a block before performing a write. Invalidates the rest of copies that other processors might have.

Write updates (write broadcasting):

Broadcasts all writes to all caches to modify block. Makes use of more bandwidth.

Most common strategy ⇒ Invalidation.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/37

SLIDE 23

Shared Symmetric Memory Systems Snooping protocols

Memory bus use

Invalidation.

Processor acquires bus and broadcasts the address to be invalidated. All processors are snooping the bus. Each processor checks if it has in cache the broadcasted address and invalidate it.

There cannot be two simultaneous writes:

Exclusive use of bus serializes writes.

Cache misses:

Write through:

Memory contains the last performed write.

Write back:

If a processor has a modified copy, it sends it to a cache miss from the other processor.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/37

SLIDE 24

Shared Symmetric Memory Systems Snooping protocols

Implementation

Invalidation:

Takes advantage from validity bit (V) associated to each block.

Writes:

Need to know if there are other copies in cache.

If there are no other copies write broadcast is not needed.

Sharing bit (S) is added to each associated block. When there is a write:

Bus invalidation is generated. Transition from shared state to exclusive state. No need to send new invalidations.

When there is a miss cache in other processor:

Transition from exclusive state to shared state.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/37

SLIDE 25

Shared Symmetric Memory Systems Snooping protocols

Basic protocol

Based in state machine for each cache block:

State changes generated by:

Processor requests. Bus requests.

Actions:

State transitions. Actions on the bus.

Simple approach with three states:

M: Block has been modified. S: Block is shared. I: Block has been invalidated.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/37

SLIDE 26

Shared Symmetric Memory Systems Snooping protocols

Actions generated by processor

Request State Action Description Read hit S → S Hit Read data from local cache Read hit M → M Hit Read data from local cache Read miss I → S Miss Broadcast read miss on bus. Read miss S → S Replacement Address conflict miss. Broadcast read miss on bus. Read miss M → S Replacement Address conflict miss. Write block and broadcast read miss. Write hit M → M Hit Write data in local cache. Write hit S → M Coherence Bus invalidation. Write miss I → M Miss Broadcast write miss on bus. Write miss S → M Replacement Address conflict miss. Broadcast write miss on bus. Write miss M → M Replacement Address conflict miss. Write block and broadcast write miss. cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/37

SLIDE 27

Shared Symmetric Memory Systems Snooping protocols

MSI Protocol: Processor actions

Invalid (I) Shared (S) Modified (M) Read hit Read hit R e a d m i s s B u s : R e a d m i s s Read miss Bus: Read miss Read miss Write block / Bus: Read miss Write hit Bus: Invalidation Write hit W r i t e m i s s B u s : W r i t e m i s s Write miss Bus: Write miss Write miss

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/37

SLIDE 28

Shared Symmetric Memory Systems Snooping protocols

Actions generated by bus

Request State Action Description Read miss S → S – Shared memory serves miss. Read miss M → S Coherence Attempt to share data. Place block on bus. Invalidate S → I Coherence Attempt to write a shared block. Invalidate block. Write miss S → I Coherence Attempt to write a shared block. Invalidate block. Write miss M → I Coherence Attempt to write a block that is exclusive elsewhere Write back cache block. cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/37

SLIDE 29

Shared Symmetric Memory Systems Snooping protocols

MSI Protocol: Bus actions

Invalid (I) Shared (S) Modififed (M) Read miss Read miss Write block Abort memory access Invalidate Write miss Write miss Write block Abort memory access

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/37

SLIDE 30

Shared Symmetric Memory Systems Snooping protocols

MSI protocol complexities

Protocol assumes that operations are atomic.

Example: It is assumed that a miss can be detected, bus acquired, and response received in a single action without interruption.

If operations are not atomic:

Possibility for a deadlock or data race.

Solution:

Processor sending invalidation keeps bus ownership until invalidation arrives to the rest of processors.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/37

SLIDE 31

Shared Symmetric Memory Systems Snooping protocols

Extensions to MSI

MESI:

Add exclusive state (E) signaling that a block lives in a single cache but is not modified. Writing of an E block does not generate invalidations.

MESIF:

Adds forward state (F): Alternative to S signaling which node must answer each request. Used by Intel Core i7.

MOESI:

Adds owned state (O) signaling that block in memory is not updated. Avoids memory writes. Used by AMD Opteron.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/37

SLIDE 32

Shared Symmetric Memory Systems Performance in SMPs

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/37

SLIDE 33

Shared Symmetric Memory Systems Performance in SMPs

Performance

Use of cache coherence policies has impact on miss rate. Coherence misses emerge:

True sharing misses:

A processor writes to a shared block and invalidates. A different processor reads a shared block.

False sharing misses:

A processor writes a shared block and invalidates it. A different processor reads a different word from the same block.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/37

SLIDE 34

Shared Symmetric Memory Systems Conclusion

1

Introduction to multiprocessor architectures

2

Centralized shared memory architectures

3

Cache coherence alternatives

4

Snooping protocols

5

Performance in SMPs

6

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/37

SLIDE 35

Shared Symmetric Memory Systems Conclusion

Summary

Multiprocessor as computer with multiple highly coupled processors with coordination, use, and memory sharing. Multiprocessors classified into SMP (Symmetric multiprocessors) and DSM (Distributed Shared Memory). Two aspects to consider in memory hierarchy: coherence and consistency. Two alternatives in cache coherence: directory and snooping. Snooping protocols do not require a centralized element.

But they generate more bus traffic.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/37

SLIDE 36

Shared Symmetric Memory Systems Conclusion

References

Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections: 5.1, 5.2, 5.3. Recommended exercises:

5.1, 5.2, 5.3, 5.4, 5.5, 5.6.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 36/37

SLIDE 37

Shared Symmetric Memory Systems Conclusion

Shared Symmetric Memory Systems

Computer Architecture

J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 37/37