Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - PowerPoint PPT Presentation

Understanding CPU Caches Ulrich Drepper

Introduction Discrepancy main CPU and main memory speed ● Intel lists for Pentium M nowadays: – ~240 cycles to access main memory ● The gap is widening ● Faster memory is too expensive

The Solution for Now CPU caches: additional set(s) of memory added between CPU and main memory ● Designed to not change the programs' semantics ● Controlled by the CPU/chipset ● Can have multiple layers with different speed (i.e., cost) and size

How Does It Look Like ~240 cycles Main Memory System Bus 3 rd Level Cache ~14 cycles ~3 cycles 1 st Level Data 2 nd Level Cache Cache 1 st Level Instruction Execution Unit Cache ≤1 cycle ~3 cycles

Cache Usage Factors Numerous factors decide cache performance: ● Cache size ● Cacheline handling – associativity ● Replacement strategy ● Automatic prefetching

Cache Addressing Address (32/64 Bits) M Bits H Bits Aliases ! Cacheline Hash Bucket Size Address N-way Buckets

Observing the Effects Test Program to see the effects: ● Walks single linked list – Sequential in memory – Randomly distributed ● Write to list elements struct l { struct l *n; long pad[NPAD]; };

Sequential Access (NPAD=0) 9.5 9 8.5 Cycles / List Element 8 7.5 7 6.5 6 5.5 5 4.5 4 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes)

Sequential List Access 325 300 Cycles / List Element 275 250 225 200 175 150 125 100 75 50 25 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Size=8 Size=64 Size=128 Size=256

Sequential vs Random Access (NPAD=0) 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (in bytes) Sequential Random

Sequential Access (NPAD=1) 30 28 Cycles / List Element 25 23 20 18 15 13 10 8 5 3 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Follow Inc Addnext0

Optimizing for Caches I ● Use memory sequentially – For data, use arrays instead of lists – For instructions, avoid indirect calls ● Chose data structures as small as possible ● Prefetch memory

Sequential Access w/ vs w/out L3 500 450 Cycles / List Element 400 350 300 250 200 150 100 50 0 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) P4/64/16k/1M-128b P4/64/16k/1M-256b P4/32/?/512k/2M- P4/32/?/512k/2M- 128b 256b

More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 1. CPU Core #1 and #3 read from a memory location; L2 the relevant L1 contain the data 2. CPU Core #2 writes to the memory location a)Notify L1 of core #1 that content is obsolete b)Notify L2 and L1 of second proc that content is obsolete

More Fun: Multithreading CPU Core CPU Core CPU Core CPU Core #3 #4 #1 #2 L1 L1 L1 L1 L2 L2 Main Memory 3. Core #4 writes to the memory location a)Wait for core #2's cache content to land in main memory b)Notify core #2's L1 and L2 that content is obsolete

Sequential Increment 128 Byte Elements 1000 Cycles / List Element 100 10 1 2^10 2^12 2^14 2^16 2^18 2^20 2^22 2^24 2^26 2^28 Working Set Size (Bytes) Nthreads=1 Nthreads=2 Nthreads=4

Optimizing for Caches II Cacheline ping-pong is deadly for performance ● If possible, write always on the same CPU ● Use per-CPU memory; lock thread to specific CPU ● Avoid placing often independently read and written-to data in the same cacheline

Questions?

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy - PowerPoint PPT Presentation

Understanding CPU Caches Ulrich Drepper Introduction Discrepancy main CPU and main memory speed Intel lists for Pentium M nowadays: ~240 cycles to access main memory The gap is widening Faster memory is too expensive The Solution

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

The I/O-Model Aggarwal and Vitter, The Input/Output Complexity of Sorting and Related Problems

NFS Tricks and Benchmarking Traps Daniel Ellard and Margo Seltzer FREENIX 2003 - June 12, 2003

Database Storage Part I Lecture # 03 Database Systems Andy Pavlo AP AP Computer Science

Sequential team form and its simplification using graphical models Aditya Mahajan and Sekhar

Estimating Risk under Estimating statistics . . . Linearized techniques Interval Uncertainty:

Induction and Its Applications Part 1: Algorithm Correctness, Loop Invariants, and Induction

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD