Advance Caching
1
Advance Caching 1 Today quiz 5 recap quiz 6 recap advanced - - PowerPoint PPT Presentation
Advance Caching 1 Today quiz 5 recap quiz 6 recap advanced caching Hand a bunch of stuff back. 2 Speeding up Memory ET = IC * CPI * CT CPI = noMemCPI * noMem% + memCPI*mem% memCPI = hit% * hitTime +
1
2
3
evicted by another piece of data that mapped to the same “set” (or cache line in a direct mapped cache)
the cache can hold.
4
processor will need.
proactively “prefetch” data program will ask for.
lastAddress, it’s consistent, start fetching thisAddress + delta.
5
for(i = 0;i < 100; i++) { sum += data[i]; }
bigger chunks of memory.
number of lines.
place (i.e., no spatial locality) this will hurt performance
6
for(i = 0;i < 1000000; i++) { sum += data[i]; }
proactively “prefetch” data program will ask for.
lastAddress, it’s consistent, start fetching thisAddress + delta.
7
for(i = 0;i < 1000000; i++) { sum += data[i]; }
8
for(i = 0;i < 1000000; i++) { sum += data[i]; “load data[i+16] into $zero” }
cache line
same cache line (N + 1 if N is the associativity)
while(1) { for(i = 0;i < 1024; i+=4096) { sum += data[i]; } // Assume a 4 KB Cache }
10
chunks of memory (maybe 128MB).
the same parts of the cache.
11
Stack Thread 0 0x100000 Stack Thread 1 0x200000 Stack Thread 2 0x300000 Stack Thread 3 0x400000 Stack Thread 0 0x100000 Stack Thread 1 0x200000 Stack Thread 2 0x300000 Stack Thread 3 0x400000
program.
to miss frequently.
equivalently-sized fully-associative cache.
conflict misses, and what you have left are the capacity misses.
12
the L2.
13
its working set.
14
loop
memory hierarchy.
15
Each pass, all at once Many misses All passes, consecutively for each piece Few misses Cache
16
17
AMD Opteron Intel Core 2 Duo .00346 miss rate Spec00 .00366 miss rate Spec00 (From Mark Hill’s Spec Data)
Cache optimization in the real world: Core 2 duo vs AMD Opteron (via simulation)
Intel gets the same performance for less capacity because they have better SRAM Technology: they can build an 8-way associative L1. AMD seems not to be able to.
tag index
valid tag data
1 800000
miss: compulsory hit!
1 800000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
miss: compulsory miss: compulsory hit! hit!
1 300000 1 800000
hit! miss: conflict hit!
tag indexoffset
valid tag data
miss: compulsory hit!
1 800000 1 2 3 4 5 6 7
miss: compulsory miss: conflict hit! hit!
1 300000 1 800000
hit! hit! hit!
tag index offset
valid tag data
miss: compulsory hit!
1 1000000 1 2 3
miss: compulsory hit! hit! hit! hit!
1 600000
hit! hit!
0x00000 0x10000 (64KB) Stack Heap (Physical) Memory malloc(0x20000)
Stack Heap (Physical) Memory Stack Heap 0x00000 0x10000 (64KB)
Stack Heap Virtual Memory 0x00000 0x10000 (64KB) Physical Memory 0x00000 0x10000 (64KB) Stack Heap Virtual Memory 0x00000 0x10000 (64KB)
Stack Heap Virtual Memory 0x00000 0x400000 (4MB) Physical Memory 0x00000 0x10000 (64KB) Stack Heap Virtual Memory 0x00000 0xF000000 (240MB) Disk (GBs)
Address translation
Specify memory + caching behavior
Demand paging
100 (10 ms)
not really the additional level of memory hierarchy it is billed to be
– memory divided into fixed sized pages
each page has a base physical address
– memory is divided into variable length segments
each segment has a base physical address + length
Physical Address Space Virtual Address Space 264 - 1 240 – 1 (or whatever) Stack We need to keep track of this mapping…
virtual page number page offset
valid
physical page number page table reg physical page number page offset virtual address physical address page table
determined solely by the valid bit (i.e., no tag)
Table often includes information about protection and cache-ability.
Two issues; somewhat orthogonal
1 KB, 4 KB (very common), 32 KB, 1 MB, 4 MB …
Level 1 Page Table Level 2 Page Tables
Data Pages
page in primary memory page in secondary memory Root of the Current Page Table
p1
p2
Virtual Address (Processor Register)
PTE of a nonexistent page p1 p2 offset
11 12 21 22 31
10-bit L1 index 10-bit L2 index
Adapted from Arvind and Krste’s MIT Course 6.823 Fall 05