Slides for Lecture 8 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 8 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 4 February, 2014

/29 ENCM 501 W14 Slides for Lecture 8 Previous Lecture ◮ conditional branches in various ISAs ◮ introduction to memory systems ◮ review of SRAM and DRAM

/29 ENCM 501 W14 Slides for Lecture 8 Today’s Lecture ◮ more about DRAM ◮ introduction to caches Related reading in Hennessy & Patterson: Sections B.1–B.2

/29 ENCM 501 W14 Slides for Lecture 8 The “1T” DRAM (Dynamic RAM) cell BITLINE WORDLINE Q The bit is stored as a voltage on a capacitor. A relatively high voltage at Q is a 1, and a relatively low voltage at Q is a 0. When the stored bit is a 1, charge is slowly leaking from node Q to ground. In a DRAM array, each row of cells must periodically be read and written back to strengthen the voltages in cells with stored 1’s—this is called refresh . DRAM gets the name dynamic from the continuing activity needed to keep the stored data valid.

/29 ENCM 501 W14 Slides for Lecture 8 Writing to a DRAM cell BITLINE WORDLINE Q Set BITLINE to the appropriate voltage for a 1 or a 0. Turn on WORDLINE . Q will take on the appropriate voltage.

/29 ENCM 501 W14 Slides for Lecture 8 Reading from a DRAM cell BITLINE WORDLINE Q Pre-charge BITLINE and some nearby electrically similar reference wire to the same voltage, somewhere between logic 0 and logic 1. Turn on WORDLINE . The cell will create a voltage difference between BITLINE and the reference wire, such that the difference can be reliably measured by a sense amplifier . Reading a DRAM cell destroys the data in the cell. After a read, the data must be written back .

/29 ENCM 501 W14 Slides for Lecture 8 A 4 × 4 DRAM array A circuit schematic is shown on the next slide. There is no good commercial reason to build such a tiny DRAM array, but nevertheless the schematic can be used to partially explain how DRAM works. In a read operation, half of the bitlines get used to capture bit values from DRAM cells, and the other half are used as reference wires. This technique is called folded bitlines. The schematic does not show the physical layout of folded bitlines. The block labeled [THIS IS COMPLICATED!] has a lot to do! In there we need bitline drivers, sense amplifiers, refresh logic, and more . . .

/29 ENCM 501 W14 Slides for Lecture 8 BL 3 BL 3 BL 2 BL 2 BL 1 BL 1 BL 0 BL 0 WL 3 DRAM DRAM DRAM DRAM CELL CELL CELL CELL ADDRESS DECODER WL 2 A 1 DRAM DRAM DRAM DRAM CELL CELL CELL CELL A 0 WL 1 DRAM DRAM DRAM DRAM CELL CELL CELL CELL WL 0 DRAM DRAM DRAM DRAM CELL CELL CELL CELL [THIS IS COMPLICATED!] CTRL D 3 D 2 D 1 D 0

/29 ENCM 501 W14 Slides for Lecture 8 DRAM arrays have long latencies compared to SRAM arrays. Why? 1. DRAM arrays typically have much larger capacities than SRAM arrays, so the ratio of cell dimensions to bitline length is much worse for DRAM arrays. 2. A passive capacitor (DRAM) is less effective at changing bitline voltages than is an active pair of inverters (SRAM). 3. Today, SRAM circuits are usually on the same chip as processor cores, while DRAMs are off-chip, connected to processor chips by wires that may be as long as tens of millimeters. 4. DRAM circuits have to dedicate some time to refresh , but SRAM circuits don’t.

/29 ENCM 501 W14 Slides for Lecture 8 A 4 GB DRAM SO-DIMM (Image source: Wikipedia—see http://en.wikipedia.org/wiki/File: 4GB_DDR3_SO-DIMM.jpg for details.)

/29 ENCM 501 W14 Slides for Lecture 8 A 4 GB DRAM SO-DIMM, continued SO-DIMM : small outline dual inline memory module. The module in the image appears to have eight 4 Gb DRAM chips (but might have sixteen 2 Gb DRAM chips, with eight on each side of the module). 64 of the 204 connectors are for data. The rest are for address bit, control, power and ground, etc. The module shown can receive or send data at a rate of up to 10667 MB/s—64-bit transfers at a rate of 1333 million transfers per second. How long would it take two such modules—working in parallel—to transfer 64 bytes to a DRAM controller?

/29 ENCM 501 W14 Slides for Lecture 8 Why is DRAM bandwidth good when latency is so bad? A partial answer . . . The internal arrangement of a typical 4 Gb DRAM chip might be four DRAM arrays—called banks —of 1 Gb each. The dimensions of a bank would then be 2 15 rows × 2 15 columns. So access to a single row accesses 32 Kb of data! It pays for the DRAM controller to do writes and reads of chunks of data much larger than 4 or 8 bytes. But because the data bus width of a DIMM is only 8 bytes, these big transfers have to be serialized into multi-transfer bursts .

/29 ENCM 501 W14 Slides for Lecture 8 Quantifying Cache Performance Rather than starting with a multi-core system with multiple levels of caches and complex interactions between caches and TLBs, let’s start with a simple system: ◮ one core ◮ no virtual memory ◮ only one level of caches ◮ constant processor clock frequency This is shown on the next slide . . .

/29 ENCM 501 W14 Slides for Lecture 8 No VM, only one level of cache . . . L1 I- DRAM CONTROLLER CACHE DRAM CORE MODULES L1 D- CACHE We’ll measure time in processor clock cycles .

/29 ENCM 501 W14 Slides for Lecture 8 Hits and misses The purpose of a cache memory is to provide a fast mirror of a small portion of the contents of a much larger memory one level farther away from the processor core. (In our simple system, the next level is just DRAM.) A cache hit occurs when a memory access can be handled by a cache without any delay waiting for help from the next level out. A cache miss , then, is a memory access that is not a hit. L1 I-caches and D-caches are generally designed to keep the core running at full speed, in the ideal, happy, but unlikely circumstance that all memory accesses hit in these caches.

/29 ENCM 501 W14 Slides for Lecture 8 Miss rates, miss penalties, memory access time (1) Here is the definition of miss rate : number of misses miss rate = total number of cache accesses Miss rate is program-dependent and also depends on the design of a cache. What kinds of programs have low miss rates? What aspects of cache design lead to low miss rates? The miss penalty is defined as the average number of clock cycles a processor must stall in response to a miss. Even for the simple system we’re considering, it’s an average , not a constant property of the cache and DRAM hardware. Let’s write down some reasons why the length of stall might vary from one miss to the next.

/29 ENCM 501 W14 Slides for Lecture 8 Miss rates, miss penalties, memory access time (2) Hit time can be defined as the length of time needed to complete a memory access in the case of a cache hit. It is likely to be 1 processor clock cycle in the case of an L1 cache. We can now define average memory access time (AMAT) as hit time + miss rate × miss penalty Suppose hit time is 1 cycle and miss penalty is 100 cycles. What is AMAT if the miss rate is 0? 1%? 5%? 50%? We’ll return to this kind of analysis to quantify the overall impact of miss rates and miss penalties on program running times, but first we’ll look qualitatively at design options for caches.

/29 ENCM 501 W14 Slides for Lecture 8 We’ll start with a very specific cache design Most textbooks on computer architecture discuss cache design options in one of two ways: ◮ lengthy exploration of most of the available options, followed by some specific examples of cache designs—this is what is done in Section B.1 of our textbook; ◮ presentation of structures that are too simple to work very well, followed by presentation of more complex structures that perform better. Instead of either of those approaches, let’s start with a structure that would be fairly effective for an L1 cache in 2014, then consider the costs and benefits of changing that structure.

/29 ENCM 501 W14 Slides for Lecture 8 A 2-way set-associative 32 KB data cache Address width is 32 bits. Reads and writes are supported for the following data sizes: byte, 16-bit halfword, 32-bit word, 64-bit doubleword. way 0 way 1 set 0 8-to-256 decoder set 1 set 2 index . . . . . . . . . . . . . . . 8 set 253 set 254 set 255 block status: 1 valid bit and 1 dirty bit per block Key tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set

/29 ENCM 501 W14 Slides for Lecture 8 Cache capacity How is cache capacity defined? Why exactly is the capacity of our example data cache 32 KB?

/29 ENCM 501 W14 Slides for Lecture 8 Wires between the core and the example cache Let’s look at all of the kinds of communication that can happen between the core and the cache. What unidirectional wires must there be to communicate information from the core to the cache? What unidirectional wires must there be to communicate information from the cache to the core? What kind of wires must be bidirectional?

Slides for Lecture 8 ENCM 501: Principles of Computer Architecture - PowerPoint PPT Presentation

Slides for Lecture 8 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 4 February, 2014 slide 2/29 ENCM 501 W14

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 6 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 27 ENEL 353: Digital Circuits Fall

Correctness of parallel programs Shaz Qadeer Research in

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

Recitation 4 Question 3: Flying off the handle Parent Child fork() count++; print(count); 1

Basics of HMMs You should be able to take this and fill in the right-hand sides. 1 The problem X

9: Viterbi Algorithm for HMM Decoding Machine Learning and Real-world Data Simone Teufel and Ann

Optics solutions for the PS2 ring Y. Papaphilippou CERN February 7 th , 2008 Contributors

A pp and e+e- collider in a 100km ring at Fermilab Tanaji Sen In collaboration with C.M.Bhat,

Scaling FFAG lattices for muon acceleration T. Planche, Y. Mori, Kyoto University. Muon