Near-Threshold Computing: How Close Should We Get? Alaa R. - - PowerPoint PPT Presentation

near threshold computing how close should we get
SMART_READER_LITE
LIVE PREVIEW

Near-Threshold Computing: How Close Should We Get? Alaa R. - - PowerPoint PPT Presentation

Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on near-threshold computing


slide-1
SLIDE 1

Near-Threshold Computing: How Close Should We Get?

Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014

slide-2
SLIDE 2

2 Workshop on Near-Threshold Computing ---- June 14, 2014

Overview

  • High-level talk summarizing my architectural perspective on

near-threshold computing

  • Near-threshold computing has gained popularity recently

– Mainly due to the quest for energy efficiency

  • Is it really justified?

+ Reduces static and dynamic power – Reduces frequency, adds reliability overhead

  • The case for selective near-threshold computing

– Use it , but not everywhere

  • Case Studies: VS-ECC and Mixed-Cell Cache Designs
slide-3
SLIDE 3

3 Workshop on Near-Threshold Computing ---- June 14, 2014

Why Near-threshold Computing?

  • Near-threshold computing has gained popularity recently.

Why?

– Mainly: Energy Efficiency – Running lots of cores with fixed power budget – Avoiding /delaying “dark silicon” – Spanning market segments from ultra-mobile to super computing

  • Theory:

– Dynamic power reduces quadratically with operating voltage – Static power reduces exponentially with operating voltage – The lower voltage we run, the less power we consume

slide-4
SLIDE 4

4 Workshop on Near-Threshold Computing ---- June 14, 2014

But Obviously, It Is Not Free…

  • Latency Cost:

Lower voltage leads to lower frequency

– Cores run slower, taking longer to run programs

– Energy = Power x Time. Lower power doesn’t always translate to lower energy

  • Reliability Cost:

Individual transistors and storage elements begin to fail due to smaller margins – Whole structures may fail – Lots of redundancy or other fault tolerance mechanisms needed (i.e., more area, power, complexity)

slide-5
SLIDE 5

5 Workshop on Near-Threshold Computing ---- June 14, 2014

Latency Cost

  • A lower voltage drives lower frequency
  • To the first order, at low voltages, V  f
  • Iron Law of processor performance:

Instructions Cycles Time Program Runtime = x x Program Instruction Cycle

  • Lower frequency increases Time/Cycle, therefore

increases program runtime

slide-6
SLIDE 6

6 Workshop on Near-Threshold Computing ---- June 14, 2014

Latency Impact on Energy Efficiency

  • A program that runs longer consumes more energy

Energy = Power x Time Program Energy = Average Power x Program Runtime

  • Even if average power is lower, it’s possible energy will be

higher

slide-7
SLIDE 7

7 Workshop on Near-Threshold Computing ---- June 14, 2014

And There is Also User Experience…

  • Not too many users will be happy with slower execution
  • Mobile users like longer battery life, but they absolutely

hate long wait times

– Especially if the system is idle most of the time – Response time really matters when the system is active

  • If voltage is too low, significant impact on user experience
slide-8
SLIDE 8

8 Workshop on Near-Threshold Computing ---- June 14, 2014

Reliability Cost

  • Getting too close to threshold significantly increases

failures for individual transistors and storage elements

  • Getting too close to tail of the distribution
slide-9
SLIDE 9

9 Workshop on Near-Threshold Computing ---- June 14, 2014

Example: SRAM Bit and 64B Failures

1.E-08 1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 0.4 0.45 0.5 0.55 0.6

Probability Vcc

pBitFail P(e=1) P(e=2) P(e=3) P(e=4)

slide-10
SLIDE 10

10 Workshop on Near-Threshold Computing ---- June 14, 2014

Cost of Lower Reliability

  • We need to make sure the whole chip works even if

individual components fail

– That is, we need to build reliable systems from unreliable components

  • To improve reliability, we either increase redundancy or

add other fault tolerance mechanisms

– More power, area, $ cost

slide-11
SLIDE 11

11 Workshop on Near-Threshold Computing ---- June 14, 2014

Simple Answer: TMR

  • Basically, include three copies of everything, use majority

vote

  • Extremely high cost

– More than 3x area increase – More than 3x power increase

  • But even that might not be sufficient

– Large structures may always fail, having three copies won’t help – Need to do at transistor/cell level – Majority voting gets really expensive at that level

slide-12
SLIDE 12

12 Workshop on Near-Threshold Computing ---- June 14, 2014

Another Answer: Error-Correcting Codes

  • Applies only to storage or state elements
  • At single-bit level, degenerates to TMR, but:
  • Mostly area efficient if amortized across more bits

– A small number of bits needed to detect/correct errors in large state elements

  • But latency inefficient

– Error correction requirements increase with larger blocks – SECDED on a 64B cache line may take a single cycle, but 4EC5ED might use ~ 15 cycles

  • For logic elements, RAZOR-style circuits needed to reduce
  • verhead
slide-13
SLIDE 13

13 Workshop on Near-Threshold Computing ---- June 14, 2014

This Seems Too Hard…

  • So why not relax our reliability requirements instead?
slide-14
SLIDE 14

14 Workshop on Near-Threshold Computing ---- June 14, 2014

Approximate Computing to the Rescue

  • If reliability is not absolutely required, then we can take a

best-effort approach

  • In other words

– If something works correctly, great – If it doesn’t, the incorrect outcome might be good enough

  • Background:

– Some applications don’t care for 100% accurate computations – Example: Individual pixels on a large screen – We could take advantage by using NTC for them

slide-15
SLIDE 15

15 Workshop on Near-Threshold Computing ---- June 14, 2014

But It Sounds Too Good To Be True…

  • In reality, too many applications care about reliability
  • And even applications that could tolerate errors need

some code to be reliable

– A pixel error on a bitmap is no big deal, but a pixel error in a compressed image (e.g., jpeg) causes too much noise – In a long sequence of computations, early computations need accuracy while later can tolerate errors

  • Too much overhead to allow NTC selectively

– Definitely needs programmer input – Could lead to too fine-grain control of reliability

slide-16
SLIDE 16

16 Workshop on Near-Threshold Computing ---- June 14, 2014

My Architectural Perspective

  • Near-threshold computing is great if power savings
  • utweigh latency and reliability cost
  • But in many cases, cost is too great
  • So we shouldn’t give up on NTC, but only use it in places

where it helps

  • Or alternatively, we shouldn’t get too close to threshold

to the point where costs outweigh benefits

  • Selective NTC requires architectural support
slide-17
SLIDE 17

17 Workshop on Near-Threshold Computing ---- June 14, 2014

Case Study: Mixed-Cell Cache Design

  • Optimize only part of cache for low (or near-threshold)

voltage, using more reliable (bigger) cells

  • Rest of cache uses normal cells
  • During normal mode, all cache is active
  • At low voltage, could only turn on reliable part
  • Causes significant performance drawbacks
slide-18
SLIDE 18

18 Workshop on Near-Threshold Computing ---- June 14, 2014

Speedup of Multi-Core over Single Core

0.5 1 1.5 2 2.5 3

400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII 445.gobmk 450.soplex 453.povray 454.calculix 456.hmmer 456.GemsFDTD 458.sjeng 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Gmean

2-core 4-core

Speedu dup over er 1-cor

  • re

Compared to 1P, 2P is 31% better, 4P is 37% better

slide-19
SLIDE 19

19 Workshop on Near-Threshold Computing ---- June 14, 2014

4P has Much Better Performance than 1P, But…

  • Design is TDP-limited

– To activate 4 cores, need to run at Vmin – Without separate power supplies, only robust cache lines will be active – 4P is where we really need the extra cache capacity for performance

  • Mixed caches include robust cells that could run at low

voltage, and regular cells that only work at high voltage

  • Our Mixed-Cell Architecture:

– All cache lines are active at Vmin – Architectural changes to ensure error-free execution

slide-20
SLIDE 20

20 Workshop on Near-Threshold Computing ---- June 14, 2014

Mixed-Cell Cache Design

  • Each cache set has two robust ways
  • Modified data only stored in robust ways
  • Clean data protected by parity
slide-21
SLIDE 21

21 Workshop on Near-Threshold Computing ---- June 14, 2014

Mixed-Cell Architectural Changes

  • Change cache insertion/replacement policy to allocate

modified data only to robust ways

  • What to do for Writes to a Clean Line?

– Writeb teback ack (MC_WB): WB): Convert dirty line to clean by writing back its data to the next cache level (all the way to memory) – Swap (MC_SWP) WP): : Swap newly-written line with the LRU robust line, and write back the data for victim line to next cache level – Duplic licati ation (MC_DUP): DUP): Duplicate modified line to another non- robust line by victimizing line in its partner way

slide-22
SLIDE 22

22 Workshop on Near-Threshold Computing ---- June 14, 2014

Changes to Cache Insertion/Replacement Policies

Cache Miss Type?

Choose Victim from Non-Robust Lines Allocate New Line in Victim’s Place

Read

Choose Victim from All Lines in Set Choose Victim_2 from Robust Lines Writeback Victim’s Data

Write Victim Type?

Copy Victim_2 to Victim’s Place Allocate New Line in Victim_2’s Place Allocate New Line in Victim’s Place

Non- Robust

Writeback Victim_2’s Data

Robust

slide-23
SLIDE 23

23 Workshop on Near-Threshold Computing ---- June 14, 2014

Cache Vmin for Mixed-Cell Caches

New MC_DUP and MC_SWP mechanisms are very close to building the cache with only robust cells (but much larger cache capacity)

1.E-09 09 1.E-06 06 1.E-03 03 1.E+00 00 0.55 0.6 0.65 0.7 0.75

BASE ROBUST MC_DISABLE MC_DUP/SWP

Vmin (V)

Prob. . of Failure (1=100% 0%)

slide-24
SLIDE 24

24 Workshop on Near-Threshold Computing ---- June 14, 2014

Evaluation

  • Used CMP$im
  • Cache configuration based on current Intel mainline cores
  • Compared our mechanisms to baseline and prior MC

proposals

– ROBUST: Cache only uses robust cells, much smaller capacity iso-area – MC_Disable: Only 1/4 of cache is operational at Vmin

  • Used 4-program mixes from SPEC workloads
slide-25
SLIDE 25

25 Workshop on Near-Threshold Computing ---- June 14, 2014

0.8 0.9 1 1.1 1.2 1.3 1.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Gmean

MC_WB MC_SWP MC_DUP Speedup vs. MC_DISABLE

Multi-core (4P) Performance

Geomean 17% speedup for MC_SWP over MC_DISABLE

slide-26
SLIDE 26

26 Workshop on Near-Threshold Computing ---- June 14, 2014

Mixed-Cell Cache Summary

  • Philosophy: Only part of cache is reliable enough to operate

at near-threshold

  • A multi-core system (at Vmin) needs larger cache capacity
  • Our mixed-cell architecture preserves cache capacity at

Vmin

– Improves performance – Reduces dynamic energy

  • Could be extended to other parts of the memory hierarchy,

and newer memory technologies

slide-27
SLIDE 27

27 Workshop on Near-Threshold Computing ---- June 14, 2014

27

Case Study: VS-ECC

  • Large caches and memories limit voltage scaling

– Many cells fail at low voltages – Need to account for weakest cell

  • Error-Correcting Codes (ECC) allow lower voltages by

recovering from (multiple) failures

  • Uniform ECC increases latency, power & area

 Our Proposal: Variable-Strength ECC (VS-ECC)

– Better performance, power and area vs. uniform ECC – Allocates ECC budget to lines that need it – Online testing identifies lines needing more protection

slide-28
SLIDE 28

28 Workshop on Near-Threshold Computing ---- June 14, 2014

VS-ECC Motivation

  • Most cache lines have 0-1 failures if we don’t get too close to threshold
  • But some lines (especially for large caches) have more failures

28

1.E-18 1.E-15 1.E-12 1.E-09 1.E-06 1.E-03 1.E+00 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Probability Vcc

pBitFail P(e=1) P(e=2) P(e=3) P(e=4)

64B lines

slide-29
SLIDE 29

29 Workshop on Near-Threshold Computing ---- June 14, 2014

1.E-08 1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 0.4 0.45 0.5 0.55 0.6

Probability Vcc

pBitFail P(e=1) P(e=2) P(e=3) P(e=4)

VS-ECC Motivation

  • Need a strong ECC code to protect worst lines
  • Uniform ECC for all lines is expensive AND unnecessary

29 64B lines

slide-30
SLIDE 30

30 Workshop on Near-Threshold Computing ---- June 14, 2014

30

Prior Low Voltage Solutions

  • Uniform-Strength Error Correction Codes

– SECDED (Single Error Correction, Double Error Detection) – DECTED (Double Error Correction, Triple Error Detection) – Two-dimensional ECC: Kim et al., MICRO 07 – Multi-bit segmented ECC (MS-ECC): Chishti et al., MICRO 09

  • Architectural solutions for persistent failures

– Word Disable: Wilkerson et al., ISCA 08, Roberts et al., DSD 07 – Bit Fix: Wilkerson et al., ISCA 08

  • Circuit Solutions: Larger cells, alternative cell designs

 All use same level of protection for all cache lines

slide-31
SLIDE 31

31 Workshop on Near-Threshold Computing ---- June 14, 2014

31

Variable-Strength ECC (VS-ECC)

  • Key idea: Provide strong ECC protection only for lines that

need it

– But still provide single-error correction for soft errors

  • VS-ECC achieves lower voltage at minimum cost
  • Three variations are explored
  • Need to identify which lines need stronger protection
slide-32
SLIDE 32

32 Workshop on Near-Threshold Computing ---- June 14, 2014

De Design gn 1: VS-ECC CC-Fixed Fixed

  • Fixed number of regular and extended ECC lines
  • Regular lines protected by SECDED
  • Extended ECC lines use 4-bit correction

SECDED ECC bits 32

slide-33
SLIDE 33

33 Workshop on Near-Threshold Computing ---- June 14, 2014

  • Add a disable bit to each line
  • Lines with 3 or more errors are disabled
  • Lines with zero errors use SECDED, 1-2 errors use 4-bit

correction

De Design gn 2: VS-ECC CC-Dis Disabl able

SECDED ECC bits 33

slide-34
SLIDE 34

34 Workshop on Near-Threshold Computing ---- June 14, 2014

Cache Characterization

  • We need to classify cache lines based on their number of

failures

  • Manufacturing-time testing expensive & needs non-

volatile on-die storage for fault map

  • Proposal: Online testing on 1st transition to low voltage

34

slide-35
SLIDE 35

35 Workshop on Near-Threshold Computing ---- June 14, 2014

Online Testing at Low Voltage

  • Cache is still functional during testing, but with reduced

capacity

  • Divide cache to working part (protected by 4-bit ECC) and

part under test, then switch roles

  • Use standard testing patterns, store error locations in tag
  • Note: Not all VS-ECC designs require the same testing

accuracy

– Optimizing test time is an opportunity for future work

35

slide-36
SLIDE 36

36 Workshop on Near-Threshold Computing ---- June 14, 2014

36

Simulated Configurations

  • Baseline

– 2MB 16-way L2 (12 cycles), SECDED ECC to recover from non- persistent errors (1 cycle)

  • Uniform-strength ECC

– DECTED: 1 cycle, corrects one persistent error per line – 4EC5ED: 15 cycles, corrects up to three persistent errors per line – MS-ECC: 64-bit segments, 4 corrections/segment, corrects up to three persistent errors per segment, cache becomes 1MB 8-way

  • Variable-strength ECC

– VS-ECC-Fixed: 12 lines with SECDED (1 cycle), 4 with 4EC5ED (15 cycles) – VS-ECC-Disable: VS-ECC-Fixed+disable lines with ≥ 3 errors

slide-37
SLIDE 37

37 Workshop on Near-Threshold Computing ---- June 14, 2014

Results: Reliability

37

  • VS-ECC has similar voltage scaling to 4EC5ED
  • VS-ECC-Disable achieves lowest voltage

1.E-15 1.E-12 1.E-09 1.E-06 1.E-03 1.E+00 0.4 0.5 0.6 0.7 0.8

2MB SECDED DECTED 4EC5ED VS-ECC-Fixed MS-ECC VS-ECC-Disable

Supply Voltage (V) Probability

  • bability

Vmin set at 1/1000 cache failure probability

slide-38
SLIDE 38

38 Workshop on Near-Threshold Computing ---- June 14, 2014

38

Results: Performance at Low Voltage

  • Similar IPC to baseline, better than uniform ECC

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 DH FSPEC ISPEC GM MM OFF PROD SERV WS KERN GMEAN 2MB Base VS-ECC-Dis 4EC5ED MS-ECC

Normalized IPC

slide-39
SLIDE 39

39 Workshop on Near-Threshold Computing ---- June 14, 2014

VS-ECC Summary

  • Near-threshold computing needs strong ECC capability in

large caches

  • Uniform ECC techniques are expensive (performance,

power, area)

  • Variable-Strength ECC provides strong protection only to

lines that need it

  • VS-ECC + Line Disable is the most cost-effective

mechanism

  • But it really needs practical online testing mechanisms

39

slide-40
SLIDE 40

40 Workshop on Near-Threshold Computing ---- June 14, 2014

Key Messages

  • Near-threshold computing : Sometimes benefits outweigh

costs, and some other times they don’t

  • It’s better to use near-threshold computing selectively

rather than for everything

  • Alternatively, we should not get too close to threshold,
  • nly as long as benefits outweigh costs

40

slide-41
SLIDE 41

41 Workshop on Near-Threshold Computing ---- June 14, 2014

Acknowledgments

  • Samira Khan (Intel/CMU)
  • Chris Wilkerson (Intel)
  • Ilya Wagner (Intel)
  • Zeshan Chishti (Intel)
  • Jaydeep Kulkarni (Intel)
  • Wei Wu (Intel)
  • Shih-Lien Lu (Intel)
  • Daniel Jiménez (Texas A&M)
  • Nam S. Kim (Wisconsin)
  • Hamid Ghasemi (Wisconsin)

41