Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - - PowerPoint PPT Presentation

understanding cpu caches
SMART_READER_LITE
LIVE PREVIEW

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - - PowerPoint PPT Presentation

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy main CPU and main memory speed Intel lists for Pentium M nowadays: ~240 cycles to access main memory The gap is widening Faster memory is too expensive The Solution


slide-1
SLIDE 1

Understanding CPU Caches

Ulrich Drepper

slide-2
SLIDE 2

Introduction

Discrepancy main CPU and main memory speed

  • Intel lists for Pentium M nowadays:

– ~240 cycles to access main memory

  • The gap is widening
  • Faster memory is too expensive
slide-3
SLIDE 3

The Solution for Now

CPU caches: additional set(s) of memory added between CPU and main memory

  • Designed to not change the programs' semantics
  • Controlled by the CPU/chipset
  • Can have multiple layers with different speed

(i.e., cost) and size

slide-4
SLIDE 4

How Does It Look Like

Main Memory 3rd Level Cache 2nd Level Cache System Bus 1st Level Data Cache 1st Level Instruction Cache Execution Unit

≤1 cycle ~3 cycles ~3 cycles ~14 cycles ~240 cycles

slide-5
SLIDE 5

Cache Usage Factors

Numerous factors decide cache performance:

  • Cache size
  • Cacheline handling

– associativity

  • Replacement strategy
  • Automatic prefetching
slide-6
SLIDE 6

Cache Addressing

Address (32/64 Bits) M Bits Cacheline Size H Bits Hash Bucket Address Aliases ! N-way Buckets

slide-7
SLIDE 7

Observing the Effects

Test Program to see the effects:

  • Walks single linked list

– Sequential in memory – Randomly distributed

  • Write to list elements

struct l { struct l *n; long pad[NPAD]; };

slide-8
SLIDE 8

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5

Sequential Access (NPAD=0)

Working Set Size (Bytes) Cycles / List Element

slide-9
SLIDE 9

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 25 50 75 100 125 150 175 200 225 250 275 300 325

Sequential List Access

Size=8 Size=64 Size=128 Size=256

Working Set Size (Bytes) Cycles / List Element

slide-10
SLIDE 10

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 50 100 150 200 250 300 350 400 450 500

Sequential vs Random Access (NPAD=0)

Sequential Random

Working Set Size (in bytes) Cycles / List Element

slide-11
SLIDE 11

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 3 5 8 10 13 15 18 20 23 25 28 30

Sequential Access (NPAD=1)

Follow Inc Addnext0

Working Set Size (Bytes) Cycles / List Element

slide-12
SLIDE 12

Optimizing for Caches I

  • Use memory sequentially

– For data, use arrays instead of lists – For instructions, avoid indirect calls

  • Chose data structures as small as possible
  • Prefetch memory
slide-13
SLIDE 13

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 50 100 150 200 250 300 350 400 450 500

Sequential Access w/ vs w/out L3

P4/64/16k/1M-128b P4/64/16k/1M-256b P4/32/?/512k/2M- 128b P4/32/?/512k/2M- 256b

Working Set Size (Bytes) Cycles / List Element

slide-14
SLIDE 14

More Fun: Multithreading

  • 1. CPU Core #1 and #3 read from a memory location; L2 the

relevant L1 contain the data

  • 2. CPU Core #2 writes to the memory location

a)Notify L1 of core #1 that content is obsolete b)Notify L2 and L1 of second proc that content is obsolete

CPU Core #2 L1 L2 L1 CPU Core #1 CPU Core #4 L1 L2 L1 CPU Core #3

Main Memory

slide-15
SLIDE 15

More Fun: Multithreading

  • 3. Core #4 writes to the memory location

a)Wait for core #2's cache content to land in main memory b)Notify core #2's L1 and L2 that content is obsolete

CPU Core #2 L1 L2 L1 CPU Core #1 CPU Core #4 L1 L2 L1 CPU Core #3

Main Memory

slide-16
SLIDE 16

2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 1 10 100 1000

Sequential Increment 128 Byte Elements

Nthreads=1 Nthreads=2 Nthreads=4

Working Set Size (Bytes) Cycles / List Element

slide-17
SLIDE 17

Optimizing for Caches II

Cacheline ping-pong is deadly for performance

  • If possible, write always on the same CPU
  • Use per-CPU memory; lock thread to specific

CPU

  • Avoid placing often independently read and

written-to data in the same cacheline

slide-18
SLIDE 18

Questions?