Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data - - PowerPoint PPT Presentation

speeding up crossbar resistive memory by exploiting in
SMART_READER_LITE
LIVE PREVIEW

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data - - PowerPoint PPT Presentation

March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Ex Execut utive Sum ummary Problems: performance and reliability of write operations The large sneaky


slide-1
SLIDE 1

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Wen Wen Lei Zhao, Youtao Zhang, Jun Yang March 12, 2018

slide-2
SLIDE 2

Ex Execut utive Sum ummary

  • Problems: performance and reliability of write operations
  • The large sneaky currents and IR drop issues in crossbar ReRAM
  • Proposed solutions: speeding up RESET operation based on data pattern
  • Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing

capability of crossbar ReRAM

  • Data compression and row address dependent layout to reduce bitline LRS cells
  • Contributions
  • Correlation between the RESET latency and the number of LRS cells on selected bitlines
  • A novel profiling technique to dynamically track the bitline data patterns
  • Results
  • Performance: 20.5% over baseline, 14.2% over state-of-the-art
  • Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art

2

slide-3
SLIDE 3

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

3

slide-4
SLIDE 4

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

4

slide-5
SLIDE 5

5 Top Metal Layer Bottom Metal Layer Metal Oxide Oxygen Ion Oxygen Vacancy High Resistance State (HRS, Logic “0”) Low Resistance State (LRS, Logic “1”)

Re ReRAM RAM ce cell struct cture and two resistance ce states

SE SET: “H “High” to to “Lo “Low”/R ”/RESET: “Lo “Low” to to “H “High”

Re ReRAM Ce Cell

slide-6
SLIDE 6

6

Sourceline Bitline Wordline ReRAM Cell

“1T1R” Structure

Wordline B i t l i n e ReRAM Cell Diode “1D1R” Structure

Re ReRAM RAM array struct ctures Crossbar Structures

Re ReRAM Crossbar

üSmallest 4F2 planar cell size, low fab cost and better scalability. ✘ Sneak currents and IR drop

B i t l i n e ReRAM Cell “0T1R” Structure Wordline

slide-7
SLIDE 7
  • Diode selectors help but cannot eliminate sneak currents
  • Sneaky currents lead to serious IR drop issue
  • Hurt energy efficiency, performance and write reliability
  • The Slower RESET operation is the performance bottleneck
  • SET takes shorter time than RESET [Xu et al’HPCA15, Zhang et al’DATE16]

7

V 1/2V 1/2V 1/2V 1/2V 1/2V 1/2V Half-selected cells H a l f

  • s

e l e c t e d c e l l s Selected cell Sneak Current

Sn Sneak Cu Curr rrents in in Cr Cros

  • ssbar Re

ReRAM

slide-8
SLIDE 8
  • RESET latency is highly sensitive, i.e., exponentially inverse correlation, to

voltage drop

  • t:RESET switching time; Vd:voltage drop
  • C and k are experimental fittings constants
  • A voltage drop of 0.4V results in 10x RESET latency increase [Govoreanu et al’IEDM11]

8

! × #$%& = ( Ho How do does es IR IR dr drop af affect RE RESET la laten ency?

slide-9
SLIDE 9

1.During RESET, half-selected cells exhibit as resistive devices. 2.With same voltage stress, a half-selected cell in LRS would have larger sneak current than the one in HRS. 3.RESET operations conservatively use the worst-case access latency of all cells in ReRAM arrays.

9

Facts Observations TODO List

Intuitive Th Thoughts

1.The number of half-selected cells in LRS affects RESET latency. 2.Dynamically profile and track runtime data patterns may avoid using worst-case RESET latency for all cells. 1.Explore correlation between RESET latency and the number of bitline LRS cells. 2.Need a runtime profiler to dynamically track bitline data patterns. 3.Need to reduce the number of LRS cells on bitlines.

slide-10
SLIDE 10
  • More LRS cells there are in the bitline, the larger IR drop the sneak current

brings, and the longer time the RESET operation takes.

10

RE RESET la laten ency vs

  • vs. # of
  • f LR

LRS ce cells

2.7 2.75 2.8 2.85 2.9 2.95 3 50 100 150 200 250 100 87.5 75 62.5 50 37.5 25 12.5 Voltage Drop (V) RESET Latency (ns) Bitline LRS Cell Percentage RESET Latency Voltage

slide-11
SLIDE 11
  • This impact diminishes as the row becomes closer to the write driver.

11

RE RESET la laten ency vs

  • vs. # of
  • f LR

LRS ce cells

slide-12
SLIDE 12

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

12

slide-13
SLIDE 13

Lo Low O Overh rhead R Runtime me P Prof

  • filing(1)

1)

13

Update W-Flag = MAX{64 3-bit values} 3-bit 3-bit 3-bit Shared ADC & Comparator Wordline Decoders Profiling Mat1 Shared ADC & Comparator Wordline Decoders Profiling Mat0 Shared ADC & Comparator Wordline Decoders Profiling Mat63 bitline-sharing-set 64B cacheline: a0 64B cacheline: a1 a0 a1 a0 a1 a0 a1

  • The worst-case bitline data pattern within one bitline-sharing-set determines

the optimal RESET latency

slide-14
SLIDE 14

Lo Low O Overh rhead R Runtime me P Prof

  • filing(1)

1)

14

Update W-Flag = MAX{64 3-bit values} 3-bit 3-bit 3-bit Shared ADC & Comparator Wordline Decoders Profiling Mat1 Shared ADC & Comparator Wordline Decoders Profiling Mat0 Shared ADC & Comparator Wordline Decoders Profiling Mat63 bitline-sharing-set 64B cacheline: a0 64B cacheline: a1 a0 a1 a0 a1 a0 a1

  • The worst-case bitline data pattern within one bitline-sharing-set determines

the optimal RESET latency

Vread Shared ADC

S/H S/H S/H

3-bit counter value From

  • ther

columns

Shared Comparators

Transmission Gate Mux Vread Vread Ron Roff I1=Vread/Ron I2=Vread/Roff I=I1+I2 Dot-Product Operation

slide-15
SLIDE 15

15

  • Aggregated currents firstly are converted into digital counters, which represent

LRS cell percentages.

0.089 0.507 0.704 0.867 1.0326 1.2345 1.5043 1.9166 2.7122

0.5 1 1.5 2 2.5 3 12.5 25 37.5 50 62.5 75 87.5 100 I/mA LRS Cell/%

Counter = 111 Counter = 110 Counter = 101 Counter = 100 Counter = 011 Counter = 010 Counter = 001 Safeguarding area Current to 3-bit value

  • f LRS percentage

Counter = 000

Lo Low O Overh rhead R Runtime me P Prof

  • filing(2)

2)

  • W-Flag is updated by comparing all counters in one bit-sharing-set to decide

the worst-case bitline.

Vread Shared ADC

S/H S/H S/H

3-bit counter value From

  • ther

columns

Shared Comparators

Transmission Gate Mux Vread Vread Ron Roff I1=Vread/Ron I2=Vread/Roff I=I1+I2 Dot-Product Operation

000 001 010 011 100 101 110 111

slide-16
SLIDE 16
  • Until now, we’ve talked about how to profile bitline data patterns, but we have

not exploit the impact of row address on RESET latency yet!

16

Write Drivers & SA Wordline Decoders

Row Address Group #0 : 0-63 Row Address Group #1 : 64-127 Row Address Group #7 : 448-511 RESET latency Decrease

Ro Row Ad Address Im Impac pact

  • The rows with different addresses are mapped to 8 groups with different

worst-case RESET latencies.

slide-17
SLIDE 17

17

  • Finding out the worst-case bitline: 3-bit W-Flag
  • Recording the percentage of LRS cells in worst-case bitline
  • Periodically detecting in each mat

A A Summary for Profiling Technique

  • Tracking the worst-case: 6-bit W-Cnt
  • W-Cnt is cleared when W-Flag is updated
  • Increment the counter of W-Cnt for each write
  • W-Cnt overflow triggers increment of the W-Flag
  • RESET latency optimization
  • W-Flag, W-Cnt and row address are used to determine tWR for RESET
slide-18
SLIDE 18
  • RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row

address

18 Row Address Group LRS Cell Ratio 1 2 3 4 5 6 7 111 202.4 197.7 184.9 165.9 142.3 117.2 92.4 69.1 110 202.4 197.7 184.9 165.9 142.3 117.2 92.4 69.1 101 199 194 181.8 162.9 139.8 115 90.5 68 100 189 184.3 172.6 154.8 132.9 109 85.8 65.5 011 173.8 169.7 158.5 142 121.9 99.8 80.2 63.4 010 154.6 150.9 140.9 126 107.9 90.3 74.7 60.9 001 132.9 129.3 120.9 107.9 93.9 81.3 69.2 58.8 000 109.7 106.9 99.7 90.8 81.8 73.2 64.5 56.4

W-Fl Flag, W-Cn Cnt Ro Row address

Deter ermine ne RE RESET Ti Timing (n (ns) s)

  • RESET latency conservatively uses the upper limit number, which is the worst-

case in next LRS cell ratio range

slide-19
SLIDE 19
  • RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row

address

19 Row Address Group LRS Cell Ratio 1 2 3 4 5 6 7 111 202.4 197.7 184.9 165.9 142.3 117.2 92.4 69.1 110 202.4 197.7 184.9 165.9 142.3 117.2 92.4 69.1 101 199 194 181.8 162.9 139.8 115 90.5 68 100 189 184.3 172.6 154.8 132.9 109 85.8 65.5 011 173.8 169.7 158.5 142 121.9 99.8 80.2 63.4 010 154.6 150.9 140.9 126 107.9 90.3 74.7 60.9 001 132.9 129.3 120.9 107.9 93.9 81.3 69.2 58.8 000 109.7 106.9 99.7 90.8 81.8 73.2 64.5 56.4

W-Fl Flag, W-Cn Cnt Ro Row address LR LRS: 62. 62.5% 5%~75% 75%, but but no no ex exceeding 87. 87.5% 5% be before W-Cn Cnt ov

  • verflow
  • ws

Deter ermine ne RE RESET Ti Timing (n (ns) s)

  • RESET latency conservatively uses the upper limit number, which is the worst-

case in next LRS cell ratio range

slide-20
SLIDE 20

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

20

slide-21
SLIDE 21
  • Naïve data compression cannot help since:
  • The RESET latency depends on the worst-case of all 512 bitlines
  • Naïve data compression may not help the worst-case bitline

21

Compressed Data Layout after Shifting in ReRAM Mat

Mat

Bitline-sharing-sets

Re Reduce Bi Bitline LR LRS Ce S Cells

  • Row-address biased data layout
  • Evenly distribute extra 0s after compression

00001111 00001001 00000111 00000001 00000011 00100010 00010110 01010010 00001111 10001000 11000001 00010000 01100000 00010001 01011000 10100100 0b 1b 2b 3b 4b 5b 6b 7b 01122365 33223213 Compressed Data Before Shifting Compressed Data After Shifting n-bit shifting 01001111 11001001 01010101 01110001 00000011 00100011 11010110 01110010 26342356 Original Data Before Compression

slide-22
SLIDE 22

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

22

slide-23
SLIDE 23
  • Chip-multiprocessor system simulator
  • CPU, multi-level cache and ReRAM memory
  • ReRAM
  • Performance/energy numbers from HSPICE and NVSim
  • Memory: 8GB, 1 channel, 2 ranks, 8 chips/rank, 2Gb x8 ReRAM Chip, 8 banks/chip, 1024 mats/bank
  • ReRAM Timing: Read: 18ns@1.5V, SET: 10ns@3V, RESET is dynamically determined@-3V, 88!A
  • Workloads
  • SPEC2006, BioBench and PARSEC
  • WPKI/RPKI: High, Medium, Low
  • Compared schemes
  • BL: Conventional ReRAM crossbar design with DSGB
  • RA: The state-of-the-art row address awareness technique [Zhang et al’ DATE’16]
  • LRS: Naïve data pattern profiling technique
  • CMP: LRS + data compression + row-address biased shifting technique
  • ALL: Proposed design with all enhancements

23

Expe Experimental Me Method

  • dol
  • log
  • gies
slide-24
SLIDE 24

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

24

slide-25
SLIDE 25

25

0.2 0.4 0.6 0.8 1 1.2 ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean Normalized Memory Write Latency BL RA LRS CMP ALL 63% 54%

Me Memor mory W Wri rite La Latency cy

High Medium Low

slide-26
SLIDE 26

0.2 0.4 0.6 0.8 1 1.2 ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean Normalized Memory Read Latency BL RA LRS CMP ALL

26

38% 28%

Me Memor mory R Read La Latency cy

High Medium Low

slide-27
SLIDE 27

0.2 0.4 0.6 0.8 1 1.2 ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean Normalized CPI BL RA LRS CMP ALL

27

  • More performance improvements on high memory intensity benchmarks

21% 14% Normalized Cycle Per Instruction Comparison (The lower, the better)

Sy System Pe Performance

High Medium Low

slide-28
SLIDE 28

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2

BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean

Normalized EDP Normalized Dynamic Energy

write read profiling EDP

  • 15.7% and 7.6% dynamic energy reduction, and 31.9% and 19.5% EDP

improvements over baseline and state-of-the-art, respectively.

28

High Medium Low

Me Memor mory Dyna namic Ene Energy gy an and ED EDP

slide-29
SLIDE 29

Ou Outline

  • Background and Motivation
  • Design Details
  • Low Overhead Runtime Profiling
  • Reduce Bitline LRS Cells
  • Evaluation
  • Methodologies
  • Experimental Results
  • Conclusion

29

slide-30
SLIDE 30

Con Conclusion

  • n
  • Problems: performance and reliability of write operations
  • The large sneaky currents and IR drop issues in crossbar ReRAM
  • Proposed solutions: speeding up RESET operation based on data pattern
  • Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing

capability of crossbar ReRAM

  • Data compression and row address dependent layout to reduce bitline LRS cells
  • Contributions
  • Correlation between the RESET latency and the number of LRS cells on selected bitlines
  • A novel profiling technique to dynamically track the bitline data patterns
  • Results
  • Performance: 20.5% over baseline, 14.2% over state-of-the-art
  • Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art

30

slide-31
SLIDE 31

Thank you!

31

slide-32
SLIDE 32

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Wen Wen Lei Zhao, Youtao Zhang, Jun Yang March 12, 2018

slide-33
SLIDE 33

Backup Slides

33

slide-34
SLIDE 34
  • Double-sided ground biasing and multi-phase RESET [Xu et al’HPCA15]
  • Row address dependent logical regions [Zhang et al’DATE16]
  • No studies on impact of bitline data patterns

34

  • Using ReRAM crossbar to implement dot-product analogy calculations [Bojnordi et

al’HPCA16, Chi et al’ISCA16, Shafiee et al’ISCA16, Song et al’HPCA17]

  • Few studies on exploiting this feature to accelerate memory access
  • Correlation between voltage drop and data pattern in ReRAM crossbar [Liang et

al’TED2010]

  • Correlation between Read latency and bitline data pattern [Zhang et al’JETCAS15]
  • Correlation between RESET latency and the number of RESET bits [Xu et al’HPCA15]
  • No observations on correlation between RESET latency and bitline data pattern

Performance of RESET operation Current accumulation feature of ReRAM crossbar Data pattern in ReRAM crossbar

Pr Prior Arts

slide-35
SLIDE 35
  • Build and simulate crossbar ReRAM HSPICE model
  • Key parameters are summarized below

35

Metric Description Value A Mat Size: A wordlines A bitlines 512 ×512 n Number of bits to read/write 8 Rwire Wire resistance between adjacent cells 2.82! Kr Nonlinearity of the selector 200 Vw Full selected voltage during write 3V

  • Voltage biasing Scheme

DSGB

RE RESET la laten ency vs

  • vs. # of
  • f LR

LRS ce cells

slide-36
SLIDE 36
  • Profiling
  • Power consumption/area estimation by HSPICE and NVSim at 32nm
  • Profiling overhead is small: 3.7x read energy, negligible area overhead
  • Profiling overhead of one bank is summarized as below
  • Counters storage and RESET adjustment
  • 3-bit W-Flag and 6-bit W-Cnt for each bitline-sharing-set
  • 288KB storage overhead of all flags for a 8GB memory

36

Comp. Params Spec. Power/Energy Area (mm2) ADC Sampling speed Resolution Number 1.28GS/s 8-bit 8 24.48mW 0.012 S+H Number 8 ×64 5uW 0.00002 ReRAM Array Mat number Mat size 1024 512 × 512 Profiling: 267.178pJ Read: 72.842pJ Leakage: 255.233mW 2.078

Ov Overhead Analysi sis

slide-37
SLIDE 37
  • Trivial improvement by doubling the number of ADC units
  • Only 1.1% performance improvement observed

37 0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2

BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL 8 ADC units 16 ADC units 256x256 512x512

Normalized Dynamic Energy Normalized CPI

Normalized CPI Normalized Dynamic Energy

Sensitivity to # of ADC units Sensitivity to mat size 14%

Se Sensitivity St Study

  • The proposed scheme is slightly worse (only 1.6%) than RA for 256 x 256 mat

size

  • Profiling overhead is independent on mat size, and has larger impact on smaller mats