ECE232: Hardware Organization and Design Lecture 28: More Virtual - - PowerPoint PPT Presentation

ece232 hardware organization and design
SMART_READER_LITE
LIVE PREVIEW

ECE232: Hardware Organization and Design Lecture 28: More Virtual - - PowerPoint PPT Presentation

ECE232: Hardware Organization and Design Lecture 28: More Virtual Memory Adapted from Computer Organization and Design , Patterson & Hennessy, UCB Overview Virtual memory used to protect applications from each other Portions of


slide-1
SLIDE 1

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB

ECE232: Hardware Organization and Design

Lecture 28: More Virtual Memory

slide-2
SLIDE 2

ECE232: More Virtual Memory 2

Overview

  • Virtual memory used to protect applications from each other
  • Portions of application located both in main memory and on disk
  • Need to speed up access for virtual memory
  • Idea: use a small cache to store translation for frequently used

pages

slide-3
SLIDE 3

ECE232: More Virtual Memory 3

How to Translate Fast?

  • Problem: Virtual Memory requires two memory accesses!
  • one to translate Virtual Address into Physical Address

(page table lookup) - Page Table is in physical memory

  • one to transfer the actual data (hopefully cache hit)
  • VM hierarchy only or Cache-memory-disk hierarchy
  • Why not create a cache of virtual to physical address

translations to make translation fast? (smaller is faster)

  • For historical reasons, such a “page table cache”

is called a Translation Lookaside Buffer, or TLB

CPU Memory

slide-4
SLIDE 4

ECE232: More Virtual Memory 4

Translation-Lookaside Buffer (TLB)

Physical Page N-1 Physical Page 0 Physical Page 1

Main Memory

  • f page 1
  • H. Stone, “High Performance Computer Architecture,” AW 1993
slide-5
SLIDE 5

ECE232: More Virtual Memory 5

TLB and Page Table

slide-6
SLIDE 6

ECE232: More Virtual Memory 6

Translation Look-Aside Buffers

  • TLB is usually small, typically 32-512 entries
  • Like any other cache, the TLB can be fully associative, set

associative, or direct mapped

TLB Cache Main Memory miss hit data hit miss Disk Memory OS Fault Handler page fault/ protection violation Page Table data virtual addr. physical addr. Processor

slide-7
SLIDE 7

ECE232: More Virtual Memory 7

Steps in Memory Access - Example

TLB Cache Main Memory

miss hit

data

hit miss

Disk Memory OS Fault Handler Page Table

data

virtual addr. physical addr.

CPU

slide-8
SLIDE 8

ECE232: More Virtual Memory 8

Valid Tag Data Page offset Page offset Virtual page number Physical page number Valid 12 20

20

16 14 Cache index 32 Data Cache hit 2 Byte

  • ffset

Dirty Tag TLB hit Physical page number Physical address tag 31 30 29 15 14 13 12 11 10 9 8 3 2 1 0 DECStation 3100/

MIPS R2000

Virtual Address

TLB Cache

64 entries, fully associative

Physical Address

16K entries, direct mapped

slide-9
SLIDE 9

ECE232: More Virtual Memory 9

Real Stuff: Pentium Pro Memory Hierarchy

  • Address Size:

32 bits (VA, PA)

  • VM Page Size:

4 KB

  • TLB organization: separate i,d TLBs

(i-TLB: 32 entries, d-TLB: 64 entries) 4-way set associative LRU approximated hardware handles miss

  • L1 Cache:

8 KB, separate i,d 4-way set associative LRU approximated 32 byte block write back

  • L2 Cache:

256 or 512 KB

slide-10
SLIDE 10

ECE232: More Virtual Memory 10

Each processor has: private 32-KB instruction and 32-KB data caches and a 512-KB L2 cache. The four cores share an 8-MB L3

  • cache. Each core also has a two-level TLB.

Intel “Nehalim” quad-core processor

13.519.6 mm die; 731 million transistors; Two 128-bit memory channels

slide-11
SLIDE 11

ECE232: More Virtual Memory 11

Intel Nehalem AMD Opteron X4 Virtual addr 48 bits 48 bits Physical addr 44 bits 48 bits Page size 4KB, 2/4MB 4KB, 2/4MB L1 TLB (per core) L1 I-TLB: 128 entries L1 D-TLB: 64 entries Both 4-way, LRU replacement L1 I-TLB: 48 entries L1 D-TLB: 48 entries Both fully associative, LRU replacement L2 TLB (per core) Single L2 TLB: 512 entries 4-way, LRU replacement L2 I-TLB: 512 entries L2 D-TLB: 512 entries Both 4-way, round-robin LRU TLB misses Handled in hardware Handled in hardware

Comparing Intel’s Nehalim to AMD’s Opteron

slide-12
SLIDE 12

ECE232: More Virtual Memory 12

Intel Nehalem AMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU, hit time n/a L1 D-cache: 32KB, 64- byte blocks, 8-way, approx LRU, write-back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU, hit time 3 cycles L1 D-cache: 32KB, 64- byte blocks, 2-way, LRU, write-back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8- way, approx LRU, write- back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU, write-back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16- way, write-back/allocate, hit time n/a 2MB, 64-byte blocks, 32- way, write-back/allocate, hit time 32 cycles

Further Comparison

slide-13
SLIDE 13

ECE232: More Virtual Memory 13

Summary

  • Virtual memory allows the appearance of a main memory that is

larger than what is physically present

  • Virtual memory can be shared by multiple applications
  • Page table indicates how to translate from virtual to physical

address

  • TLB speeds up access to virtual memory
  • Generally set associative or fully associative
  • Much smaller than main memory
  • Next time: Putting it all together (cache, TLB, virtual memory)