Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - - PowerPoint PPT Presentation
Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - - PowerPoint PPT Presentation
Understanding CPU Caches Ulrich Drepper Introduction Discrepancy main CPU and main memory speed Intel lists for Pentium M nowadays: ~240 cycles to access main memory The gap is widening Faster memory is too expensive The Solution
Introduction
Discrepancy main CPU and main memory speed
- Intel lists for Pentium M nowadays:
– ~240 cycles to access main memory
- The gap is widening
- Faster memory is too expensive
The Solution for Now
CPU caches: additional set(s) of memory added between CPU and main memory
- Designed to not change the programs' semantics
- Controlled by the CPU/chipset
- Can have multiple layers with different speed
(i.e., cost) and size
How Does It Look Like
Main Memory 3rd Level Cache 2nd Level Cache System Bus 1st Level Data Cache 1st Level Instruction Cache Execution Unit
≤1 cycle ~3 cycles ~3 cycles ~14 cycles ~240 cycles
Cache Usage Factors
Numerous factors decide cache performance:
- Cache size
- Cacheline handling
– associativity
- Replacement strategy
- Automatic prefetching
Cache Addressing
Address (32/64 Bits) M Bits Cacheline Size H Bits Hash Bucket Address Aliases ! N-way Buckets
Observing the Effects
Test Program to see the effects:
- Walks single linked list
– Sequential in memory – Randomly distributed
- Write to list elements
struct l { struct l *n; long pad[NPAD]; };
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5
Sequential Access (NPAD=0)
Working Set Size (Bytes) Cycles / List Element
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 25 50 75 100 125 150 175 200 225 250 275 300 325
Sequential List Access
Size=8 Size=64 Size=128 Size=256
Working Set Size (Bytes) Cycles / List Element
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 50 100 150 200 250 300 350 400 450 500
Sequential vs Random Access (NPAD=0)
Sequential Random
Working Set Size (in bytes) Cycles / List Element
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 3 5 8 10 13 15 18 20 23 25 28 30
Sequential Access (NPAD=1)
Follow Inc Addnext0
Working Set Size (Bytes) Cycles / List Element
Optimizing for Caches I
- Use memory sequentially
– For data, use arrays instead of lists – For instructions, avoid indirect calls
- Chose data structures as small as possible
- Prefetch memory
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 50 100 150 200 250 300 350 400 450 500
Sequential Access w/ vs w/out L3
P4/64/16k/1M-128b P4/64/16k/1M-256b P4/32/?/512k/2M- 128b P4/32/?/512k/2M- 256b
Working Set Size (Bytes) Cycles / List Element
More Fun: Multithreading
- 1. CPU Core #1 and #3 read from a memory location; L2 the
relevant L1 contain the data
- 2. CPU Core #2 writes to the memory location
a)Notify L1 of core #1 that content is obsolete b)Notify L2 and L1 of second proc that content is obsolete
CPU Core #2 L1 L2 L1 CPU Core #1 CPU Core #4 L1 L2 L1 CPU Core #3
Main Memory
More Fun: Multithreading
- 3. Core #4 writes to the memory location
a)Wait for core #2's cache content to land in main memory b)Notify core #2's L1 and L2 that content is obsolete
CPU Core #2 L1 L2 L1 CPU Core #1 CPU Core #4 L1 L2 L1 CPU Core #3
Main Memory
2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 1 10 100 1000
Sequential Increment 128 Byte Elements
Nthreads=1 Nthreads=2 Nthreads=4
Working Set Size (Bytes) Cycles / List Element
Optimizing for Caches II
Cacheline ping-pong is deadly for performance
- If possible, write always on the same CPU
- Use per-CPU memory; lock thread to specific
CPU
- Avoid placing often independently read and