- - PowerPoint PPT Presentation

scm
SMART_READER_LITE
LIVE PREVIEW

- - PowerPoint PPT Presentation

SCM SCM SCM SCM Write reference


slide-1
SLIDE 1

쓰기 쓰기 참조의 참조의 특성과 특성과 SCM SCM 기반 기반 메모리 메모리 관리 관리 쓰기 쓰기 참조의 참조의 특성과 특성과 SCM SCM 기반 기반 메모리 메모리 관리 관리

Write reference characteristics and SCM Write reference characteristics and SCM-

  • based memory management

based memory management

Hyokyung Bahn 2011.4.19 NVRAMOS 2011

EWHA WOMANS UNIVERSITY

slide-2
SLIDE 2

Storage Class Memory (SCM) Storage Class Memory (SCM) g y ( ) g y ( )

SCM Ch t i ti SCM Ch t i ti

  • SCM Characteristics

SCM Characteristics – Nonvolatile, Byte-addressable

  • eg. PCM (Phase Change Memory), FeRAM, STT-RAM (MRAM)
  • eg. PCM (Phase Change Memory), FeRAM, STT RAM (MRAM)
  • SCM Perspectives

SCM Perspectives 2012 – Widely deployed in data center by 2012 – Promisingly replace HDD by 2020

  • No more than 3-5x cost of HDD

No more than 3 5x cost of HDD (<$1/GB in 2012)

  • < 1usec Access time

105 R d P d

  • > 105 Read ops. Per second
  • > 100MB / sec
  • 10x lower power than HDD

p

(IBM Almaden Research Center, USENIX FAST Tutorial, 2009)

slide-3
SLIDE 3

Why DRAM main memory need to change? Why DRAM main memory need to change?

Multi core system More concurrency Larger working set Multi-core system, More concurrency, Larger working set an enormous need for increased memory

eg) 4GB/32-bit processors, 16EB/64-bit processors (1E = 1018)

Density (cost/bit) DRAM scaling to small technology is challenge

watt

1000 1200

40% of the total system energy by the main memory

Power Consumption

400 600 800

p

200 400 CPU DRAM Mother Disk Fan NIC

(Source: Intel Labs, 2008)

CPU DRAM Memory Mother Board Disk Fan NIC

slide-4
SLIDE 4

Phase Change Phase Change Memory ( Memory (PCM) PCM) Phase Change Phase Change Memory ( Memory (PCM) PCM)

DRAM

(DRAM-DDR3 1.35V)

PCM

( High Speed PCM ’10)

Non-Volatile

NO YES

Density

1X 2X ~ 4X Read(J/GB) 0 7 1

Power (Energy)

Read(J/GB) 0.7 1 Write(J/GB) 1.1 6 Static power (mW/GB) 100 1 (mW/GB)

slide-5
SLIDE 5

PCM Challenges PCM Challenges PCM Challenges PCM Challenges

DRAM

(DRAM-DDR3 1.35V)

PCM

( High Speed PCM ‘10)

Non-Volatile

NO YES

Density

1X 2X ~ 4X Read(J/GB) 0 7 1

Power

Read(J/GB) 0.7 1 Write(J/GB) 1.1 6 Idle state (mW/GB) 100 1 (mW/GB)

Latency

Read 1X 1X~ 2X Write 1X 7X ~ 8X

Endurance **

1015 107 ~108

** SRAM 1015, STT-RAM 1015, FeRAM 1012, SLC Flash 105, MLC Flash 104

slide-6
SLIDE 6

Memory & Storage Architectures Memory & Storage Architectures Memory & Storage Architectures Memory & Storage Architectures

CPU L1 D-cache L1 I-cache CPU L1 D-cache L1 I-cache SRAM SRAM SRAM L2 cache SRAM STT RAM SRAM SRAM DRAM L2 cache Main memory PCM STT-RAM DRAM HDD y Secondary storage Flash SSD storage

  • STT-RAM, PCM, Flash SSD: write is slower than read

, ,

slide-7
SLIDE 7

Estimating Future Writes Estimating Future Writes

  • 1. Find a good estimator for future write references

I i C id i d d it hi t t th id i it hi t l Issue i. Considering read and write history together or considering write history alone Issue ii. Which is better? Temporal locality or Frequency based estimation

  • 2. Store pages likely to be re-written on DRAM.

3 Comparing

  • 2. Store pages likely to be re written on DRAM.
  • 1. Temporal Locality
  • Only write history
  • Total (read+write) history
  • 2. Frequency
  • Only write history
  • Total (read+write) history
  • 3. Comparing

Temporal Locality & Frequency

B d E ti ti Based Estimation

R ki Write count Temporal Locality Write count Frequency R ki Write count

  • by (read + write) recency
  • by write recency

Ranking Ranking

  • by (read + write) frequency
  • by write frequency

Ranking

  • by recency
  • by frequency
slide-8
SLIDE 8

Virtual Memory Traces Used Virtual Memory Traces Used y

Workload Contents Memory footprint(KB) Ratio of operations (data reads : data writes) Memory access count total Instruction read Data read Data write xmms Mp3 player 8,052 1 : 7.79 1,169,310 65,413 125,653 978,244 p p y , , , , , , gqview Image viewer 7,428 1 : 2.01 611,142 93,653 172,044 345,445 shotwell Photo management S/W 88,228 1 : 1.04 15,090,070 528,549 7,124,101 7,437,420 gnuplot Graphing utility 21,132 1 : 1.10 220,240 47,551 82,110 90,579 g p p g y firefox Web browser 101,520 1.88 : 1 12,648,471 2,392,952 6,690,045 3,565,474 freecell Game 10,084 5.26 : 1 490,700 114,750 315,906 60,044 gedit Word processor 14,460 7.16 : 1 1,736,440 652,154 951,450 132,836 kghostview PDF file viewer 17,388 10.26 : 1 1,548,820 373,260 1,062,008 103,552

slide-9
SLIDE 9

Temporal Locality

  • Using both read & write history estimates future writes better within top 10 rankings.
  • Beyond top rankings, using write history alone may be better estimates of future writes.
  • Overall, both estimators show similar results.
slide-10
SLIDE 10

Temporal Locality

  • Temporal locality for relatively write intense workloads are rather irregular

(Ranking inversion)

  • Temporal locality alone may not be sufficient to estimate the likelihood of future writes.
slide-11
SLIDE 11

Why temporal locality of write irregular? Why temporal locality of write irregular?

Maybe due to write Maybe due to write-

  • back operation of cache memory

back operation of cache memory

– page references observed at VM contain only cache-missed ones page references observed at VM contain only cache missed ones – In case of read,

  • cache-missed requests are directly propagated to VM

→ Even though temporal locality becomes weak it is not damaged seriously → Even though temporal locality becomes weak, it is not damaged seriously

– In case of write,

  • cache-missed requests are not propagated directly to VM
  • but just written to the cache memory.
  • requests are delivered to VM only after evicted from cache memory.
  • time a write request arrives ≠ time the request is delivered to VM

Read request A Write request A Cache memory Read request A (cache missed) Cache memory Write request B (evicted from cache) Main memory Main memory

slide-12
SLIDE 12

Frequency Frequency

  • Write frequency alone is more effective than frequency counted by both reads and writes
slide-13
SLIDE 13

Frequency

  • Write frequency alone is more effective than frequency counted by both reads and writes
slide-14
SLIDE 14

Temporal Locality vs. Frequency Temporal Locality vs. Frequency

  • Frequency is more effective than temporal locality for most cases.
  • However, at least the most recent reference history must be considered.
slide-15
SLIDE 15

Temporal Locality vs. Frequency Temporal Locality vs. Frequency

  • Frequency is more effective than temporal locality for most cases.
  • However, at least the most recent reference history must be considered.
slide-16
SLIDE 16

Memory Architecture Memory Architecture

Write latency & Endurance problem of PCM Use a small amount of DRAM along with PCM. g

PCM CPU DRAM PCM CPU DRAM DRAM PCM

Last level cache memory

CPU

Hybrid main memory (single physical address space)

  • DRAM cache miss PCM access
  • Address translation through page table

Main memory

  • DRAM cache miss PCM access
  • DRAM cache is hidden to the OS

H/W implementation, Fully associative placement is difficult! Collision may degrade space efficiency

  • Address translation through page table
  • DRAM can be managed by OS

Fully associative placement is possible Limited reference information Collision may degrade space efficiency (eg. reference bit)

slide-17
SLIDE 17

Comparison Comparison of Cache Replacement Problems

  • f Cache Replacement Problems

Cache Memory Virtual Memory System File I/O Buffer Cache

in Each Layer in Each Layer

Who manages hits/misses? Hit H/W H/W OS Miss H/W OS OS Representative Random / LRU CLOCK LRU p Algorithms Random / LRU CLOCK LRU Replacement manager H/W OS OS H/W implementation S/W i l t ti S/W i l t ti H/W implementation (Logical timestamp or bit shifting for each reference in a set) S/W implementation supported by H/W (reference bit) S/W implementation How to Implement?

MRU position R:1 R:1 R:0 R:1 R:0 R:0

p

LRU

memory hit

R:1 R:0 R:0 R:1 R:1 LRU position

victim!

R:1 R:0 R:0 R:1 R:0

slide-18
SLIDE 18

CLOCK CLOCK-

  • DWF

DWF C OC C OC

(Clock with Dirty bits and Write Frequency) (Clock with Dirty bits and Write Frequency) CLOCK DWF CLOCK-DWF

  • Allocate read-intensive pages to PCM, write-intensive pages to DRAM.

PCM

(2) Read page fault (1) Read page A

page table

CPU

HDD (3) PCM is full HDD

  • r

Flash

DRAM

CLOCK

slide-19
SLIDE 19

CLOCK CLOCK-

  • DWF

DWF C OC C OC

(Clock with Dirty bits and Write Frequency) (Clock with Dirty bits and Write Frequency) CLOCK DWF CLOCK-DWF

  • Allocate read-intensive pages to PCM, write-intensive pages to DRAM.

PCM

page table

(1) Write page A

CPU

HDD (4) PCM is full HDD

  • r

Flash

DRAM

(2) write operation

  • n a PCM

(3) DRAM is full

CLOCK

(2)’ Write page fault

CLOCK-DWF

  • generate an intentional

page fault (minor fault)

  • DRAM: dirty pages only

PCM: clean & dirty pages PCM: clean & dirty pages

slide-20
SLIDE 20

CLOCK CLOCK-

  • DWF

DWF

A B

C OC C OC

(Clock with Dirty bits and Write Frequency) (Clock with Dirty bits and Write Frequency)

A C C C B B A

Temporal locality (dirty bit) present 1 rotation 2 rotation n rotation Frequency (frequency count,

  • verlooked rotation)

p 1 rotation 2 rotation n rotation

Dirty 1

A B C A B C

Frequency count 5 6 4 5 6 5 Overlooked rotation 6 1 3 7 2

  • frequency count does not indicate the real frequency but a reset count of a dirty bit.

considering correlated references

slide-21
SLIDE 21

CLOCK CLOCK-

  • DWF

DWF C OC C OC

(Clock with Dirty bits and Write Frequency) (Clock with Dirty bits and Write Frequency)

  • Each page in DRAM has a dirty bit frequency count and overlooked rotation count
  • Each page in DRAM has a dirty bit, frequency count and overlooked rotation count.
  • Dirty bit: set to 1 when a write operation occur, reset to 0 by CLOCK-DWF
  • Frequency count: increased when dirty bit become zero.
  • Overlooked rotation count: keep track of how many times the page was overlooked.

p y p g Victim Selection if dirty_bit(page) is 0 if frequency(page) > Threshold & overlooked rotation (page) < Expiration if frequency(page) Threshold & overlooked_rotation (page) Expiration

  • verlooked_rotation(page)++;

else set dirty_bit (page) to 1 and evict it end if e d else /* dirty_bit(page) is 1 */ dirty_bit(page) = 0 ; frequency(page)++; overlooked_rotation(page) = 0; end if

slide-22
SLIDE 22

Parameter Parameter setting setting Parameter Parameter setting setting

hot_page_threshold

Determines the number of writes required for a page to be considered as a hot page

  • Determines the number of writes required for a page to be considered as a hot page.

hot_page_threshold { hot_page_threshold x (SIZEDRAM – 1) + frequency(p) } / SIZEDRAM

long-term frequency period

  • Number of rotations that can be overlooked for hot pages despite not being re-written
  • When the memory size becomes large,
  • Optimal value becomes small.
  • Performance is less sensitive.

1 00 0 85 0.90 0.95 1.00 80% 60% 40% 20%

M Write Count

0.70 0.75 0.80 0.85 20% 15% 10% 5%

rmalized PCM

0.70 1 2 4 8 16 32

Long‐term frequency period No gqview trace xmms trace

slide-23
SLIDE 23

Experimental Setup Experimental Setup Experimental Setup Experimental Setup

Baseline Configuration

  • Page size: 4KB
  • Processor core: 4-core, each core runs at 2.66GHz
  • L1 I-Cache & D-Cache: 32KB, 64-byte lines, 8-way set associative

L2 Cache: 6MB 64 byte lines 24 way set associative

  • L2 Cache: 6MB, 64-byte lines, 24-way set associative
  • Main memory: 4GB, 8 ranks of 8 banks each
  • Hard disk drive: 5ms average access time

DRAM PCM Read / Write Latency 50 / 50 ns 50 or 100 / 350 ns Read / Write Energy 0 1 / 0 1 nJ/bit 0 2 / 1 0 nJ/bit Read / Write Energy 0.1 / 0.1 nJ/bit 0.2 / 1.0 nJ/bit Static Power 1 W/GB 0.1 W/GB Endurance N/A 107 Endurance N/A 10

slide-24
SLIDE 24

CLOCK CLOCK-

  • DWF vs. CLOCK

DWF vs. CLOCK C OC C OC s C OC s C OC

PCM write count PCM write count

x-axis: DRAM size of the maximum write memory usage of the workloads.

1 0 CM 1 0 M 1 0 M 1 0 M

y-axis: PCM writes of CLOCK-DWF normalized to that of CLOCK.

0.4 0.6 0.8 1.0 ed write counts in PC 0.4 0.6 0.8 1.0 d write counts in PCM 0.4 0.6 0.8 1.0 ed write counts in PCM 0.4 0.6 0.8 1.0 d write counts in PCM 0.0 0.2 20 40 60 80 100 Normalize DRAM size (%) CLOCK-DWF CLOCK 0.0 0.2 20 40 60 80 100 Normalized DRAM size (%) CLOCK-DWF CLOCK 0.0 0.2 20 40 60 80 100 Normalize DRAM size (%) CLOCK-DWF CLOCK 0.0 0.2 20 40 60 80 100 Normalized DRAM size (%) CLOCK-DWF CLOCK 0.8 1.0 unts in PCM 0.8 1.0 unts in PCM 0.8 1.0 unts in PCM 0.8 1.0

  • unts in PCM

(a) gqview (b) gnuplot (c) xmms (d) shotwell

0 0 0.2 0.4 0.6 Normalized write cou CLOCK-DWF CLOCK 0 0 0.2 0.4 0.6 Normalized write cou CLOCK-DWF CLOCK 0 0 0.2 0.4 0.6 Normalized write cou CLOCK-DWF CLOCK 0 0 0.2 0.4 0.6 Normalized write co CLOCK-DWF CLOCK 0.0 20 40 60 80 100 DRAM size (%) 0.0 20 40 60 80 100 DRAM size (%) 0.0 20 40 60 80 100 DRAM size (%) 0.0 20 40 60 80 100 DRAM size (%)

(e) firefox (f) freecell (g) gedit (h) kghostview

slide-25
SLIDE 25

CLOCK CLOCK-

  • DWF

DWF VS. DRAM

  • VS. DRAM Cache

Cache C OC C OC S Cac e Cac e

PCM write count PCM write count

DRAM Cache: 16-way set associative LRU x-axis: DRAM size relative to total memory footprint y-axis: # of PCM writes normalized to that of DRAM Cache

0.6 0.8 1.0 1.2 e counts in PCM 0 6 0.8 1.0 1.2 e counts in PCM 0 6 0.8 1.0 1.2 e counts in PCM 0 6 0.8 1.0 1.2 e counts in PCM 0.0 0.2 0.4 20 40 60 80 100 Normalized write DRAM i (%) CLOCK-DWF DRAM cache 0.0 0.2 0.4 0.6 20 40 60 80 100 Normalized write CLOCK-DWF DRAM cache 0.0 0.2 0.4 0.6 20 40 60 80 100 Normalized write CLOCK-DWF DRAM cache 0.0 0.2 0.4 0.6 20 40 60 80 100 Normalized write CLOCK-DWF DRAM cache DRAM size (%) DRAM size (%) DRAM size (%) DRAM size (%) 0 8 1.0 1.2 ts in PCM 0 8 1.0 1.2 nts in PCM 0 8 1.0 1.2 ts in PCM 0 8 1.0 1.2 ts in PCM

(a) gqview (b) gnuplot (c) xmms (d) shotwell

0.2 0.4 0.6 0.8 Normalized write count CLOCK-DWF DRAM cache 0.2 0.4 0.6 0.8 Normalized write coun CLOCK-DWF DRAM cache 0.2 0.4 0.6 0.8 Normalized write count CLOCK-DWF DRAM cache 0.2 0.4 0.6 0.8 Normalized write count CLOCK-DWF DRAM cache 0.0 20 40 60 80 100 N DRAM size (%) 0.0 20 40 60 80 100 N DRAM size (%) 0.0 20 40 60 80 100 N DRAM size (%) 0.0 20 40 60 80 100 N DRAM size (%)

(e) firefox (f) freecell (g) gedit (h) kghostview

slide-26
SLIDE 26

PCM Lifetime PCM Lifetime PCM Lifetime PCM Lifetime

Sequentially execute the 8 workloads repeatedly until the write limit of PCM DRAM Cache CLOCK-DWF: 30% memory size, 4.7 years 6.7 years

DRAM cache CLOCK-DWF CLOCK

CLOCK CLOCK-DWF: 40~80% memory size, 5.8% extended.

1 1.2 1.4 1.6 me of PCM DRAM cache CLOCK DWF CLOCK 0.4 0.6 0.8 1 malized lifetim 0.2 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Norm Memory size (%) Memory size (%)

slide-27
SLIDE 27

CLOCK CLOCK DWF vs Conventional System DWF vs Conventional System CLOCK CLOCK-DWF vs. Conventional System DWF vs. Conventional System

Average memory access time Average memory access time

1.3 1.4 1.5 access time gqview gnuplot xmms

x-axis

DRAM size of CLOCK-DWF

0.9 1.0 1.1 1.2 Average memory a xmms shotwell firefox freecell gedit k i

y-axis

Performance normalized to conventional system

0.8 1 5 10 20 50 75 100 A DRAM size (%) kgview

y

Performance degradation

  • Case (a)

(a) read_access_timePCM = read_access_timeDRAM

1 4 1.6 1.8 2.0 ry access time gqview gnuplot xmms shotwell

( )

  • smaller than 10%.
  • Case (b)
  • Read-intensive: 74.6%

0.8 1.0 1.2 1.4 Average memor shotwell firefox freecell gedit kgview

  • Write-intensive: 31.8%

1 5 10 20 50 75 100 DRAM size (%)

(b) read_access_timePCM = 2 x read_access_timeDRAM

slide-28
SLIDE 28

CLOCK CLOCK-

  • DWF vs. Conventional System

DWF vs. Conventional System C OC C OC s Co e t o a Syste s Co e t o a Syste

Total elapsed time Total elapsed time

x-axis

  • CLOCK-DWF
  • DRAM:PCM = 1:9

1.0 1.1 1.2 time gqview gnuplot xmms

  • DRAM:PCM = 1:9
  • Conventional system
  • DRAM only

y-axis

0.7 0.8 0.9 Total elpased xmms shotwell firefox freecell gedit k i

y

Performance normalized to conventional system

0.6 1 5 10 20 50 75 100 Memory size (%) kgview

(a) read_access_timePCM = read_access_timeDRAM

Performance degradation

  • less than 8%
  • due to large page fault overhead

0.9 1.0 1.1 1.2 pased time gqview gnuplot xmms shotwell 0.6 0.7 0.8 1 5 10 20 50 75 100 Total elp firefox freecell gedit kgview

(b) read_access_timePCM = 2 x read_access_timeDRAM

1 5 10 20 50 75 100 Memory size (%)

slide-29
SLIDE 29

CLOCK CLOCK-

  • DWF vs. Conventional System

DWF vs. Conventional System C OC C OC s Co e t o a Syste s Co e t o a Syste

Power consumption Power consumption

Power consumption Power consumption

DRAM PCM Read / Write Energy 0.1 / 0.1 nJ/bit 0.2 / 1.0 nJ/bit Static Power 1 W/GB 0 1 W/GB

Power-savings become large as memory size increases. Static power accounts for a large portion.

Static Power 1 W/GB 0.1 W/GB

1 6 1.8 0.8 1.0 1.2 1.4 1.6 consumption 0.0 0.2 0.4 0.6 1 5 10 20 50 75 100 Power c 1 5 10 20 50 75 100 Memory size (%)

slide-30
SLIDE 30

Summary Summary Summary Summary

CLOCK-DWF CLOCK DRAM Cache

PCM

PCM PCM

CPU DRAM

DRAM

CLOCK CLOCK-DWF

CPU

DRAM

CLOCK CLOCK

CPU

CLOCK

Memory architecture

DRAM + PCM memory DRAM + PCM memory DRAM Cache, PCM memory

DRAM usage

write write read / write

DRAM Replacement Policy

CLOCK-DWF (fully associative) CLOCK (fully associative) LRU (16-way set associative)

Temporal locality

O O O

Frequency

O X X

Write co nts on PCM

0 65 0 24 0 76 0 57 1

Write counts on PCM

0.65~0.24 0.76~0.57 1

slide-31
SLIDE 31

Access I nformation Access I nformation Access I nformation Access I nformation

If you want to cite this material, please contact the following information.

  • http://home.ewha.ac.kr/~bahn
  • bahn@ewha.ac.kr