Caches
Samira Khan March 21, 2017
Caches Samira Khan March 21, 2017 Agenda Logistics Review from - - PowerPoint PPT Presentation
Caches Samira Khan March 21, 2017 Agenda Logistics Review from last lecture Out-of-order execution Data flow model Superscalar processor Caches Final Exam Combined final exam 7-10PM on Tuesday, 9 May 2017 Any
Samira Khan March 21, 2017
younger instructions into functional (execution) units
unit
F D E R E E E E E E E E E E E E E E E E E E E E
. . .
Integer add Integer mul FP mul Cache miss
W
4
(with respect to execution in the previous design)?
IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 LD R3 ß R1 (0) ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9
5
F D W E E E E R F D E R W F IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R3, R5 D E R W F D E R W F D E R W F D W E E E E R F D STALL STALL E R W F D E E E E STALL E R F D E E E E R W F D E R W WAIT WAIT W
6
Tomasulo
Units,”
introduction,” MICRO 1985.
microarchitecture,” MICRO 1985.
7
processors
IBM z196, Oracle UltraSPARC T4, ARM Cortex A15
Politics Behind Intel's Landmark Chips by Robert
memory). Two key properties:
instructions
10 When is a value interpreted as an instruction?
executed in control flow order
in data flow order
11
nWhich model is more natural to you as a programmer?
12
v <= a + b; w <= b * 2; x <= v - w y <= v + w z <= x * y
+ *2
*
a b z
Sequential Dataflow
nodes
inputs are ready
13
14
Legend
c
X Y Z
Copy Initially Z=X then Z=Y
X Z
Z=X-Y
A B
XOR
c
AND =0?
c 1 +
F T
c
F T
ANSWER
Legend
c
X Y Z
Copy Initially Z=X then Z=Y
X Z
Z=X-Y
A B
XOR
c
AND =0?
c 1 +
F T
c
F T
ANSWER
val = a ^ b
Legend
c
X Y Z
Copy Initially Z=X then Z=Y
X Z
Z=X-Y
A B
XOR
c
AND =0?
c 1 +
F T
c
F T
ANSWER
val = a ^ b val =! 0
Legend
c
X Y Z
Copy Initially Z=X then Z=Y
X Z
Z=X-Y
A B
XOR
c
AND =0?
c 1 +
F T
c
F T
ANSWER
val = a ^ b val =! 0 val &= val - 1;
Legend
c
X Y Z
Copy Initially Z=X then Z=Y
X Z
Z=X-Y
A B
XOR
c
AND =0?
c 1 +
F T
c
F T
ANSWER
val = a ^ b val =! 0 val &= val - 1; dist = 0 dist++;
int hamming_distance(unsigned a, unsigned b) { int dist = 0; unsigned val = a ^ b; // Count the number of bits set while (val != 0) { // A bit is set, so increment the count and clear the bit dist++; val &= val - 1; } // Return the number of differing bits return dist; }
22
advice to other researchers
his life time
semantics?)
23
graph of a piece of the program
instructions
window?
24
F D E M W F D E M W F D E M W F D E M W
Each instruction still takes 5 cycles, but instructions now complete every cycle: CPI → 1
F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W
Each instruction still takes 5 cycles, but instructions now complete every cycle: CPI → 0.5
executed, and committed per cycle
somewhat breaks in the flow
30
CORE 1
L2 CACHE 0
SHARED L3 CACHE DRAM INTERFACE
CORE 0 CORE 2 CORE 3
L2 CACHE 1 L2 CACHE 2 L2 CACHE 3
DRAM BANKS
DRAM MEMORY CONTROLLER
parallel)
31
technology
32
storage of 1 or 0
33
row enable _bitline
“cell”
34
row select bitline _bitline
35
36
the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(er) level(s)
37
38
fast small big but slow move what you use here backup everything here With good locality of reference, memory appears as fast as and as large as
faster per byte cheaper per byte
bandwidth
39
CPU Main Memory (DRAM) RF Cache Hard Disk
will do the same thing again soon
again regularly
something similar/related (in space)
people
40
many times and all within a small window of time
time
41
memory (called cache)
Computers, 1965.
memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.”
42
managed fast memory
43
time
44
45
CPU Main Memory (DRAM) RF Level1 Cache Level 2 Cache
levels
SRAM in lieu of a cache)
levels, transparently to the programmer
++ programmer’s life is easier
“correct” program! (What if you want a “fast” program?)
46
IEEE Trans. On Electronic Computers, 1965.
accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”
47
48 Register File 32 words, sub-nsec L1 cache ~32 KB, ~nsec L2 cache 512 KB ~ 1MB, many nsec L3 cache, ..... Main memory (DRAM), GB, ~100 nsec Disk 100 GB, ~10 msec
manual/compiler register spilling automatic demand paging Automatic HW cache management Memory Abstraction
perceived access time Ti is longer than ti
Ti = hi·ti + mi·(ti + Ti+1) Ti = ti + mi ·Ti+1
49
Ti = ti + mi ·Ti+1
need, prefetching::anticipate what you will need)
50
if m1=0.1, m2=0.1 T1=7.6, T2=36 if m1=0.01, m2=0.01 T1=4.2, T2=19.8 if m1=0.05, m2=0.01 T1=5.00, T2=19.8 if m1=0.01, m2=0.50 T1=5.08, T2=108