Modeling the Implications of DRAM Failures and Protection Techniques - - PowerPoint PPT Presentation

modeling the implications of dram failures and protection
SMART_READER_LITE
LIVE PREVIEW

Modeling the Implications of DRAM Failures and Protection Techniques - - PowerPoint PPT Presentation

Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO Panagiota Nikolaou 1 , Yiannakis Sazeides 1 , Lorena Ndreu 1 , Marios Kleanthous 2 1 University of Cyprus , 2 MAP S.Platis MICRO 48, Waikiki, Hawaii , December


slide-1
SLIDE 1

Modeling the Implications of DRAM Failures and Protection Techniques

  • n Datacenter TCO

Panagiota Nikolaou1, Yiannakis Sazeides1, Lorena Ndreu 1, Marios Kleanthous 2

1University of Cyprus, 2MAP S.Platis

MICRO 48, Waikiki, Hawaii, December 5th 2015

  • P. Nikolaou

1

slide-2
SLIDE 2

Many Million $ per month

Today’s Datacenters

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 2

> 510,000 DC in all over the world [Emerson, 2011] > 285 Million Sqft [Emerson, 2011] Large scale Datacenters: >10,000 commodity servers

slide-3
SLIDE 3

Datacenter Cost

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 3

Other Capex Expenses 49% Other Opex Expenses 20% DRAM Opex & Capex Expenses 31%

[Analysis using COST‐ET tool, D. Hardy 2013]

Other Capex Expenses 49% Other Opex Expenses 20% DRAM Opex & Capex Expenses 31%

slide-4
SLIDE 4

DRAM Protection Cost

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 4 DRAM Protection Opex & Capex 8%

Other Capex & Opex Cost 69%

DRAM Opex & Capex Expenses 23% [Analysis using COST‐ET tool, D. Hardy 2013]

Data ECC

slide-5
SLIDE 5

Do we need DRAM protection?

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 5

  • Google Failure Study [Barroso, 2009]
  • DRAM large field studies [V. Shridharan 2012, 2013]

DRAM FITS/ Server [Borucki, IRPS 2008], [K. Lim ISCA 2009], [Daniel Bowers, “Server Trends”]

DRAM protection is essential !!

slide-6
SLIDE 6

DRAM protection choices

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 6

ChipkillDC ChipkillSC SECDED

Cost Reliability

+ +++

Performance

+

Cost Reliability

++ ++

Performance ++ Cost Reliability

+++ +

Performance +++

slide-7
SLIDE 7

DRAM protection selection

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 7

Application

*Analyzer of Memory Protection and Failures Implications on TCO (AMPRA tool), site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php

AMPRA*

slide-8
SLIDE 8

Our Proposition & our Contribution

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 8

AMPRA tool

Best DRAM Protection Technique

Availability/ MTTF Model DRAM SDC Model Thermal Model DIMM Cost Model Energy Model DIMM FIT Model Server Performance Model

Reliability Performance Application Characteristics Power Error protection techniques

TCO Model

slide-9
SLIDE 9

Related work

  • [Y. Luo DSN 2014] Proposes and analyzes cost of a

heterogeneous memory protection scheme Differences:

– Performance, power implications of memory protection techniques – Co‐located services – Datacenter cost

  • No other related work considers various parameters
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 9

slide-10
SLIDE 10

Outline

  • Proposed Framework (AMPRA tool)
  • Use Case
  • Experimental Framework
  • Results
  • Conclusions
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 10

slide-11
SLIDE 11

DRAM FIT Model Availability /MTTF Model

TCO Model

DIMM FITS_CE Total extra servers Component MTTF for replacement ECC technique System configuration DIMM FITS_DUE HW and SW repair options and their MTTR DIMM FITS_NDE Fits per mode(transient, permanent, physical location) #devices/DIMM DC configuration Device width/size DRAM Grade Factor Model for proactive replacement Maintenance model for replacement

  • n faulty components

#DC servers Energy Model Server Performance Model Server Energy DIMM Cost Model DIMM cost Device size DRAM frequency

Published Data

ECC technique #devices/DIMM Device width DRAM brand DRAM technology Server configuration (#cores,Interleaving type, #channels, DIMMs/channel) Performance Degradation (PD) #threads for Online Service #threads for Offline Service

Published Data

#threads for Online Service #threads for Offline Service DRAM SDC Derating Model DIMM Derated FITS_NDE Server Configuration

System Reliability (MTTF_SDC) TCO

#threads for Online Service #threads for Offline Service Utilization Profile per day for the online service Thermal Model Non DRAM component Reference MTTF #threads for Online Service #threads for Offline Service Component Temperature ECC technique Server configuration (#cores, Interleaving type, #channels, DIMMs/channel) ECC technique Server configuration (#cores,Interleaving type, #channels, DIMMs/channel) Reference ECC technique Component Reference Temperature Average Utilization

Published Data

NDE Derating Factor DIMM FITS_SDC Server configuration (#cores,Interleaving type, #channels, DIMMs/channel) Target Reliability

Proposed framework (AMPRA tool)

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 11

slide-12
SLIDE 12

Outline

  • Proposed Framework (AMPRA tool)
  • Use Case
  • Experimental Framework
  • Results
  • Conclusions
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 12

slide-13
SLIDE 13

Use Case

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 13

Bandwidth vs. Latency vs. Reliability vs. Power Chipkill with Dual Channel Implementation (ChipkillDC) Chipkill with Single Channel Implementation (ChipkillSC) 16 ECC bits for 128 Data bits‐144 bit codeword

Memory Controller Data ECC 8B 8B 8B 8B 8B 8B 8B 8B 64B Block

72b 72b 72b 72b

1

2

3

4

5

6

7

8

slide-14
SLIDE 14

FIT model

  • ChipkillDC:

– Detects all the errors in 2 devices – Corrects all the errors in 1 device

  • ChipkillSC:

– Cannot detect all the errors in 2 devices – Corrects all the errors in 1 device

ChipkillDC can provide better Reliability than ChipkillSC

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 14

slide-15
SLIDE 15

Performance and Power model

How it works: (ChipkillDC)

  • Read
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 15

  • Requires accessing two DIMMs
  • Codeword in a single burst
  • Latency short
  • Low Bandwidth
  • High Power Consumption

Data ECC Memory Controller Data ECC 8B 8B 8B 8B 8B 8B 8B 8B 64B Block

72b 72b 72b 72b

slide-16
SLIDE 16

Performance and Power model

How it works: (ChipkillSC)

  • Read
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 16

Data ECC Memory Controller

  • Requires accessing one DIMM
  • Codeword in two bursts
  • Latency long
  • High Bandwidth
  • Less Power Consumption

144 bits 8B 8B 8B 8B 8B 8B 8B 8B 64B Block

72b 72b

slide-17
SLIDE 17

Design Space

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 17

  • Application characteristics
  • Memory intensive, compute intensive etc.
  • Co‐running applications

What happens with the Cost? ChipkillDC ChipkillSC Reliability Bandwidth Latency

Cannot detect all the errors in 2 devices Corrects all the errors in 1 device

Access one DIMM Access two DIMMs Codeword in two bursts Codeword in one burst

Power

Access one DIMM Access two DIMMs

Detect all the errors in 2 devices Corrects all the errors in 1 device

slide-18
SLIDE 18
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 18

Online Services: High QoS requirements Offline Services: Do not have QoS constrains

Online and Offline Services

Co‐location: Improve server utilization and reduce TCO

Core 0 Core 1 Core 3 Core 2

Memory Controller

DRAM Online Service Online Service Core 0 Core 1 Offline Service Offline Service Core 2 Core 3 DRAM

slide-19
SLIDE 19

Outline

  • Proposed Framework (AMPRA tool)
  • Use Case
  • Experimental Framework
  • Results
  • Conclusions
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 19

slide-20
SLIDE 20

Experimental Framework

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 20

Fit Model

Analytical Models

Performance, Power and Thermal Model

ChipkillDC – Lockstep Mode ChipkillSC ‐ Advance Mode

Server Configuration

Intel Xeon E5‐5620 4 cores per CPU 2 channels per CPU 1 DIMM per Channel

Workloads

  • 1. Web Search (QoS requirements)
  • 2. MapReduce:
  • a. 500MB (CPU intensive)
  • b. 49000MB (memory intensive)

DIMM Cost

Public Data

TCO Model

Extension COST‐ET Tool

[D. Hardy 2013]

DC Configuration

Server Modules: 50,000 DC depreciation: 15year

slide-21
SLIDE 21
  • P. Nikolaou

DRAM Protection Implications on Performance

MICRO 48, Waikiki, Hawaii 21

WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB

slide-22
SLIDE 22
  • P. Nikolaou

DRAM Protection Implications on Power

MICRO 48, Waikiki, Hawaii 22

WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB

slide-23
SLIDE 23
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 23

DRAM Protection Implications on Cost

  • Underlines the importance of understanding the usage and characteristics of all

the services to be run in a DC before making memory protection design choices

  • Highlights the need of proposed framework !!

WS: Web Search MR500: Map Reduce 500MB MR49000: Map Reduce 49000MB

slide-24
SLIDE 24

Usage

  • Datacenter designers: Select processor and protection

technique

  • Researchers: Investigate the implications of new ideas

related to DRAM failures and DRAM protection techniques

  • Service providers: Find how to charge for running
  • ffline services and to makeup for the increase in TCO

due to co‐location

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 24

slide-25
SLIDE 25

More in the paper

  • Detailed explanation of each model
  • DRAM grades and how affect TCO
  • Results for other protection techniques (SECDED)
  • Power and performance results for more applications
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 25

slide-26
SLIDE 26
  • DRAM is one of the dominant cost consumers in a DC
  • Different protection techniques have different TCO

implications

  • Framework to encapsulates all the parameters and tries

to determine the cost‐effective protection technique for a DC

  • Highlight the need of the framework

– It is not straightforward to decide which DRAM protection technique is best for a DC setup in the lack of this framework

Conclusions

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 26

slide-27
SLIDE 27

Future Work

  • Evaluate TCO for more online and offline services
  • Explore the cost‐benefits of new ECC schemes
  • Validation of the framework by using detailed logs

from a real DC

  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 27

slide-28
SLIDE 28
  • P. Nikolaou

MICRO 48, Waikiki, Hawaii 28

AMPRA tool download site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php