Lecture 2: Processor Design, Single-Processor Performance - - PowerPoint PPT Presentation

lecture 2 processor design single processor performance
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Processor Design, Single-Processor Performance - - PowerPoint PPT Presentation

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September 14, 2010 Intro Basics Assembly Memory Pipelines Outline Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines Intro


slide-1
SLIDE 1

Lecture 2: Processor Design, Single-Processor Performance

G63.2011.002/G22.2945.001 · September 14, 2010

Intro Basics Assembly Memory Pipelines

slide-2
SLIDE 2

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-3
SLIDE 3

Admin Bits

  • Lec. 1 slides posted
  • New here? Welcome!

Please send in survey info (see lec. 1 slides) via email

  • PASI
  • Please subscribe to mailing list
  • Near end of class: 5-min, 3-question ‘concept check’

Intro Basics Assembly Memory Pipelines

slide-4
SLIDE 4

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-5
SLIDE 5

Introduction

Goal for Today

High Performance Computing: Discuss the actual computer end of this. . . . . . and its influence on performance

Intro Basics Assembly Memory Pipelines

slide-6
SLIDE 6

What’s in a computer?

Intro Basics Assembly Memory Pipelines

slide-7
SLIDE 7

What’s in a computer?

Processor Intel Q6600 Core2 Quad, 2.4 GHz

Intro Basics Assembly Memory Pipelines

slide-8
SLIDE 8

What’s in a computer?

Processor Intel Q6600 Core2 Quad, 2.4 GHz Die (2×) 143 mm2, 2 × 2 cores 582,000,000 transistors ∼ 100W

Intro Basics Assembly Memory Pipelines

slide-9
SLIDE 9

What’s in a computer?

Intro Basics Assembly Memory Pipelines

slide-10
SLIDE 10

What’s in a computer?

Memory

Intro Basics Assembly Memory Pipelines

slide-11
SLIDE 11

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-12
SLIDE 12

A Basic Processor

Internal Bus Register File Flags Data ALU Address ALU Control Unit PC Memory Interface

Insn. fetch

Data Bus Address Bus

(loosely based on Intel 8086)

Intro Basics Assembly Memory Pipelines

slide-13
SLIDE 13

A Basic Processor

Internal Bus Register File Flags Data ALU Address ALU Control Unit PC Memory Interface

Insn. fetch

Data Bus Address Bus

(loosely based on Intel 8086)

Bonus Question: What’s a bus?

Intro Basics Assembly Memory Pipelines

slide-14
SLIDE 14

How all of this fits together

Everything synchronizes to the Clock. Control Unit (“CU”): The brains of the

  • peration. Everything connects to it.

Bus entries/exits are gated and (potentially) buffered. CU controls gates, tells other units about ‘what’ and ‘how’:

  • What operation?
  • Which register?
  • Which addressing mode?

Internal Bus Register File Flags Data ALU Address ALU Control Unit PC Memory Interface

Insn. fetch

Data Bus Address Bus

Intro Basics Assembly Memory Pipelines

slide-15
SLIDE 15

What is. . . an ALU?

Arithmetic Logic Unit One or two operands A, B Operation selector (Op):

  • (Integer) Addition, Subtraction
  • (Logical) And, Or, Not
  • (Bitwise) Shifts (equivalent to

multiplication by power of two)

  • (Integer) Multiplication, Division

Specialized ALUs:

  • Floating Point Unit (FPU)
  • Address ALU

Operates on binary representations of

  • numbers. Negative numbers represented

by two’s complement.

A B Op R

Intro Basics Assembly Memory Pipelines

slide-16
SLIDE 16

What is. . . a Register File?

Registers are On-Chip Memory

  • Directly usable as operands in

Machine Language

  • Often “general-purpose”
  • Sometimes special-purpose: Floating

point, Indexing, Accumulator

  • Small: x86 64: 16×64 bit GPRs
  • Very fast (near-zero latency)

%r0 %r1 %r2 %r3 %r4 %r5 %r6 %r7

Intro Basics Assembly Memory Pipelines

slide-17
SLIDE 17

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-18
SLIDE 18

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-19
SLIDE 19

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-20
SLIDE 20

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-21
SLIDE 21

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-22
SLIDE 22

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Intro Basics Assembly Memory Pipelines

slide-23
SLIDE 23

How does computer memory work?

One (reading) memory transaction (simplified): Processor Memory

CLK R/ ¯ W A0..15 D0..15

Observation: Access (and addressing) happens in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

slide-24
SLIDE 24

What is. . . a Memory Interface?

Memory Interface gets and stores binary words in off-chip memory. Smallest granularity: Bus width Tells outside memory

  • “where” through address bus
  • “what” through data bus

Computer main memory is “Dynamic RAM” (DRAM): Slow, but small and cheap.

Intro Basics Assembly Memory Pipelines

slide-25
SLIDE 25

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-26
SLIDE 26

A Very Simple Program

int a = 5; int b = 17; int z = a ∗ b;

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp) b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp) 12: 8b 45 f4 mov −0xc(%rbp),%eax 15: 0f af 45 f8 imul −0x8(%rbp),%eax 19: 89 45 fc mov %eax,−0x4(%rbp) 1c: 8b 45 fc mov −0x4(%rbp),%eax

Things to know:

  • Addressing modes (Immediate, Register, Base plus Offset)
  • 0xHexadecimal
  • “AT&T Form”: (we’ll use this)

<opcode><size> <source>, <dest>

Intro Basics Assembly Memory Pipelines

slide-27
SLIDE 27

Another Look

Internal Bus Register File Flags Data ALU Address ALU Control Unit PC Memory Interface

Insn. fetch

Data Bus Address Bus

Intro Basics Assembly Memory Pipelines

slide-28
SLIDE 28

Another Look

Internal Bus Register File Flags Data ALU Address ALU Control Unit PC Memory Interface

Insn. fetch

Data Bus Address Bus

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp) b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp) 12: 8b 45 f4 mov −0xc(%rbp),%eax 15: 0f af 45 f8 imul −0x8(%rbp),%eax 19: 89 45 fc mov %eax,−0x4(%rbp) 1c: 8b 45 fc mov −0x4(%rbp),%eax Intro Basics Assembly Memory Pipelines

slide-29
SLIDE 29

A Very Simple Program: Intel Form

4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5 b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11 12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc] 15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8] 19: 89 45 fc mov DWORD PTR [rbp−0x4],eax 1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]

  • “Intel Form”: (you might see this on the net)

<opcode> <sized dest>, <sized source>

  • Goal: Reading comprehension.
  • Don’t understand an opcode?

Google “<opcode> intel instruction”.

Intro Basics Assembly Memory Pipelines

slide-30
SLIDE 30

Machine Language Loops

int main() { int y = 0, i ; for (i = 0; y < 10; ++i) y += i; return y; }

0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax 17: 01 45 f8 add %eax,−0x8(%rbp) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax 27: c9 leaveq 28: c3 retq

Things to know:

  • Condition Codes (Flags): Zero, Sign, Carry, etc.
  • Call Stack: Stack frame, stack pointer, base pointer
  • ABI: Calling conventions

Intro Basics Assembly Memory Pipelines

slide-31
SLIDE 31

Machine Language Loops

int main() { int y = 0, i ; for (i = 0; y < 10; ++i) y += i; return y; }

0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax 17: 01 45 f8 add %eax,−0x8(%rbp) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax 27: c9 leaveq 28: c3 retq

Things to know:

  • Condition Codes (Flags): Zero, Sign, Carry, etc.
  • Call Stack: Stack frame, stack pointer, base pointer
  • ABI: Calling conventions

Want to make those yourself? Write myprogram.c. $ cc -c myprogram.c $ objdump --disassemble myprogram.o

Intro Basics Assembly Memory Pipelines

slide-32
SLIDE 32

We know how a computer works!

All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer:

Intro Basics Assembly Memory Pipelines

slide-33
SLIDE 33

We know how a computer works!

All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster!

Intro Basics Assembly Memory Pipelines

slide-34
SLIDE 34

We know how a computer works!

All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster! Goal now: Understand sources of slowness, and how they get addressed. Remember: High Performance Computing

Intro Basics Assembly Memory Pipelines

slide-35
SLIDE 35

The High-Performance Mindset

Writing high-performance Codes

Mindset: What is going to be the limiting factor?

  • ALU?
  • Memory?
  • Communication? (if multi-machine)

Benchmark the assumed limiting factor right away.

Evaluate

  • Know your peak throughputs (roughly)
  • Are you getting close?
  • Are you tracking the right limiting factor?

Intro Basics Assembly Memory Pipelines

slide-36
SLIDE 36

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-37
SLIDE 37

Source of Slowness: Memory

Memory is slow. Distinguish two different versions of “slow”:

  • Bandwidth
  • Latency

→ Memory has long latency, but can have large bandwidth. Size of die vs. distance to memory: big! Dynamic RAM: long intrinsic latency!

Intro Basics Assembly Memory Pipelines

slide-38
SLIDE 38

Source of Slowness: Memory

Memory is slow. Distinguish two different versions of “slow”:

  • Bandwidth
  • Latency

→ Memory has long latency, but can have large bandwidth. Size of die vs. distance to memory: big! Dynamic RAM: long intrinsic latency! Idea: Put a look-up table of recently-used data onto the chip. → “Cache”

Intro Basics Assembly Memory Pipelines

slide-39
SLIDE 39

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories: Registers L1 Cache L2 Cache DRAM Virtual Memory (hard drive) 1 kB, 1 cycle 10 kB, 10 cycles 1 MB, 100 cycles 1 GB, 1000 cycles 1 TB, 1 M cycles

Intro Basics Assembly Memory Pipelines

slide-40
SLIDE 40

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories: Registers L1 Cache L2 Cache DRAM Virtual Memory (hard drive) 1 kB, 1 cycle 10 kB, 10 cycles 1 MB, 100 cycles 1 GB, 1000 cycles 1 TB, 1 M cycles How might data locality factor into this? What is a working set?

Intro Basics Assembly Memory Pipelines

slide-41
SLIDE 41

Cache: Actual Implementation

Demands on cache implementation:

  • Fast, small, cheap, low-power
  • Fine-grained
  • High “hit”-rate (few “misses”)

Main Memory Cache Memory

Index Data 0 xyz 1 pdq 2 abc 3 rgf Index Tag Data abc 2 xyz 1

Problem: Goals at odds with each other: Access matching logic expensive! Solution 1: More data per unit of access matching logic → Larger “Cache Lines” Solution 2: Simpler/less access matching logic → Less than full “Associativity” Other choices: Eviction strategy, size

Intro Basics Assembly Memory Pipelines

slide-42
SLIDE 42

Cache: Associativity

Direct Mapped Memory 1 2 3 4 5 6 . . . Cache 1 2 3 2-way set associative Memory 1 2 3 4 5 6 . . . Cache 1 2 3

1e-006 1e-005 0.0001 0.001 0.01 0.1 Inf 1M 256K 64K 16K 4K 1K miss rate cache size Direct 2-way 4-way 8-way Full

Intro Basics Assembly Memory Pipelines

slide-43
SLIDE 43

Cache: Associativity

Direct Mapped Memory 1 2 3 4 5 6 . . . Cache 1 2 3 2-way set associative Memory 1 2 3 4 5 6 . . . Cache 1 2 3

1e-006 1e-005 0.0001 0.001 0.01 0.1 Inf 1M 256K 64K 16K 4K 1K miss rate cache size Direct 2-way 4-way 8-way Full

Miss rate versus cache size on the Integer por- tion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelines

slide-44
SLIDE 44

Cache Example: Intel Q6600/Core2 Quad

  • -- L1 data cache ---

fully associative cache = false threads sharing this cache = 0x0 (0) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0x7 (7) number of sets - 1 (s) = 63

  • -- L1 instruction
  • fully associative cache

= false threads sharing this cache = 0x0 (0) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0x7 (7) number of sets - 1 (s) = 63

  • -- L2 unified cache ---

fully associative cache false threads sharing this cache = 0x1 (1) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0xf (15) number of sets - 1 (s) = 4095

More than you care to know about your CPU: http://www.etallen.com/cpuid.html

Intro Basics Assembly Memory Pipelines

slide-45
SLIDE 45

Measuring the Cache I

void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof(int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary[ i ] ∗= 17; } free (ary ); }

20 21 22 23 24 25 26 27 28 29 210 Stride 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Time [s]

Intro Basics Assembly Memory Pipelines

slide-46
SLIDE 46

Measuring the Cache I

void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof(int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary[ i ] ∗= 17; } free (ary ); }

20 21 22 23 24 25 26 27 28 29 210 Stride 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Time [s]

Intro Basics Assembly Memory Pipelines

slide-47
SLIDE 47

Measuring the Cache II

void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof(int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i∗16) & asm1] ++; free (ary ); } 212 214 216 218 220 222 224 226 Array Size [Bytes] 1 2 3 4 5 6

  • Eff. Bandwidth [GB/s]

Intro Basics Assembly Memory Pipelines

slide-48
SLIDE 48

Measuring the Cache II

void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof(int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i∗16) & asm1] ++; free (ary ); } 212 214 216 218 220 222 224 226 Array Size [Bytes] 1 2 3 4 5 6

  • Eff. Bandwidth [GB/s]

Intro Basics Assembly Memory Pipelines

slide-49
SLIDE 49

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof(int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary[p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); } 100 200 300 400 500 600 Stride [bytes] 5 10 15 20 Array Size [MB]

Intro Basics Assembly Memory Pipelines

slide-50
SLIDE 50

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof(int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary[p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); } 100 200 300 400 500 600 Stride [bytes] 5 10 15 20 Array Size [MB]

Intro Basics Assembly Memory Pipelines

slide-51
SLIDE 51

Programming for the Cache

How can we rearrange programs to be cache-friendly? Examples:

  • Large vectors x, a, b

Compute x ← x + 3a − 5b.

Intro Basics Assembly Memory Pipelines

slide-52
SLIDE 52

Programming for the Cache

How can we rearrange programs to be cache-friendly? Examples:

  • Large vectors x, a, b

Compute x ← x + 3a − 5b.

  • Matrix-Matrix Multiplication

→ Homework 1, posted.

Intro Basics Assembly Memory Pipelines

slide-53
SLIDE 53

Outline

Intro The Basic Subsystems Machine Language The Memory Hierarchy Pipelines

Intro Basics Assembly Memory Pipelines

slide-54
SLIDE 54

Source of Slowness: Sequential Operation

IF Instruction fetch ID Instruction Decode EX Execution MEM Memory Read/Write WB Result Writeback

Intro Basics Assembly Memory Pipelines

slide-55
SLIDE 55

Solution: Pipelining

Intro Basics Assembly Memory Pipelines

slide-56
SLIDE 56

Pipelining

(MIPS, 110,000 transistors)

Intro Basics Assembly Memory Pipelines

slide-57
SLIDE 57

Issues with Pipelines

Pipelines generally help performance–but not always. Possible issues:

  • Stalls
  • Dependent Instructions
  • Branches (+Prediction)
  • Self-Modifying Code

“Solution”: Bubbling, extra circuitry

Waiting Instructions

Stage 1: Fetch Stage 2: Decode Stage 3: Execute Stage 4: Write-back

PIPELINE Completed Instructions 1 2 3 4 5 6 7 8 Clock Cycle 9

Intro Basics Assembly Memory Pipelines

slide-58
SLIDE 58

Intel Q6600 Pipeline

Instruction Fetch 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 7+ Entry µop Buffer Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File (Program Visible State)

Micro- code Complex Decoder Simple Decoder Simple Decoder Simple Decoder

32 Entry Reservation Station

ALU ALU SSE Shuffle ALU SSE Shuffle MUL ALU Branch SSE ALU 128 Bit FMUL FDIV 128 Bit FADD Store Address Store Data Load Address

Memory Ordering Buffer (MOB) Memory Interface

Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Internal Results Bus Load Store 128 Bit 128 Bit 4 µops 4 µops 4 µops 4 µops 1 µop 1 µop 1 µop 128 Bit 6 Instructions 4 µops

Intro Basics Assembly Memory Pipelines

slide-59
SLIDE 59

Intel Q6600 Pipeline

Instruction Fetch 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 7+ Entry µop Buffer Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File (Program Visible State)

Micro- code Complex Decoder Simple Decoder Simple Decoder Simple Decoder

32 Entry Reservation Station

ALU ALU SSE Shuffle ALU SSE Shuffle MUL ALU Branch SSE ALU 128 Bit FMUL FDIV 128 Bit FADD Store Address Store Data Load Address

Memory Ordering Buffer (MOB) Memory Interface

Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Internal Results Bus Load Store 128 Bit 128 Bit 4 µops 4 µops 4 µops 4 µops 1 µop 1 µop 1 µop 128 Bit 6 Instructions 4 µops

New concept: Instruction-level parallelism (“Superscalar”)

Intro Basics Assembly Memory Pipelines

slide-60
SLIDE 60

Programming for the Pipeline

How to upset a processor pipeline:

for (int i = 0; i < 1000; ++i) for (int j = 0; j < 1000; ++j) { if (j % 2 == 0) do something(i, j ); }

. . . why is this bad?

Intro Basics Assembly Memory Pipelines

slide-61
SLIDE 61

A Puzzle

int steps = 256 ∗ 1024 ∗ 1024; int [] a = new int[2]; // Loop 1 for (int i=0; i<steps; i++) { a[0]++; a[0]++; } // Loop 2 for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

Which is faster? . . . and why?

Intro Basics Assembly Memory Pipelines

slide-62
SLIDE 62

Two useful Strategies

Loop unrolling:

for (int i = 0; i < 1000; ++i) do something(i);

for (int i = 0; i < 500; i+=2) { do something(i); do something(i+1); }

Software pipelining:

for (int i = 0; i < 1000; ++i) { do a(i ); do b(i ); }

for (int i = 0; i < 500; i+=2) { do a(i ); do a(i+1); do b(i ); do b(i+1); }

Intro Basics Assembly Memory Pipelines

slide-63
SLIDE 63

SIMD

Control Units are large and expensive. Functional Units are simple and cheap. → Increase the Function/Control ratio: Control several functional units with

  • ne control unit.

All execute same operation.

Data Pool Instruction Pool

PU PU PU PU

SIMD

GCC vector extensions:

typedef int v4si attribute (( vector size (16))); v4si a, b, c; c = a + b; // +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

Will revisit for OpenCL, GPUs.

Intro Basics Assembly Memory Pipelines

slide-64
SLIDE 64

About HW1

  • Open-ended! Want: coherent thought about caches, memory

access ordering, latencies, bandwidths, etc. See Berkeley CS267 (lec. 3, . . . ) for more hints.

  • Also: Introduction to the machinery.

(Clusters, running on them, SSH, git, forge) Why so much machinery?

  • Linux lab machines: WWH 229/230

Intro Basics Assembly Memory Pipelines

slide-65
SLIDE 65

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj A B C

Intro Basics Assembly Memory Pipelines

slide-66
SLIDE 66

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj

Intro Basics Assembly Memory Pipelines

slide-67
SLIDE 67

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj

Intro Basics Assembly Memory Pipelines

slide-68
SLIDE 68

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj

Intro Basics Assembly Memory Pipelines

slide-69
SLIDE 69

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj

Intro Basics Assembly Memory Pipelines

slide-70
SLIDE 70

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication: Cij =

  • k

AikBkj

Intro Basics Assembly Memory Pipelines

slide-71
SLIDE 71

Questions?

?

Intro Basics Assembly Memory Pipelines

slide-72
SLIDE 72

Image Credits

  • Q6600 Wikimedia Commons
  • Mainboard: Wikimedia Commons
  • DIMM: sxc.hu/gobran11
  • Q6600 back: Wikimedia Commons
  • Core 2 die: Intel Corp. / lesliewong.us
  • Basic cache: Wikipedia
  • Cache associativity: based on Wikipedia
  • Cache associativity vs miss rate: Wikipedia

,

  • Q6600 Wikimedia Commons
  • Cache Measurements: Igor Ostrovsky
  • Pipeline stuff: Wikipedia
  • Bubbly Pipeline: Wikipedia
  • Q6600 Pipeline: Wikipedia
  • SIMD concept picture: Wikipedia

Intro Basics Assembly Memory Pipelines