tdlo CS 744: DATACENTER AS A COMPUTER Shivaram Venkataraman Fall - - PowerPoint PPT Presentation

tdlo
SMART_READER_LITE
LIVE PREVIEW

tdlo CS 744: DATACENTER AS A COMPUTER Shivaram Venkataraman Fall - - PowerPoint PPT Presentation

tdlo CS 744: DATACENTER AS A COMPUTER Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS - Assignments Piazza - Assignment zero is due! - Form groups for Assignment 1 on Piazza Thursday - Class format - Review - Lecture -


slide-1
SLIDE 1

CS 744: DATACENTER AS A COMPUTER

Shivaram Venkataraman Fall 2020

tdlo

slide-2
SLIDE 2

ANNOUNCEMENTS

  • Assignments
  • Assignment zero is due!
  • Form groups for Assignment 1 on Piazza
  • Class format
  • Review
  • Lecture
  • Discussion

Piazza

↳ Thursday

slide-3
SLIDE 3

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

Application

¥

'

[

Arch . >

Hardware

→ Architecture

slide-4
SLIDE 4

OUTLINE

  • Hardware Trends
  • Datacenter design
  • WSC workloads
  • Discussion
slide-5
SLIDE 5

Why is One Machine Not Enough?

  • limited

parallelism

not

enough

resources
  • Cost
could be

high

^

  • Redundancy

maqn.ge

contd

)

  • Data
volumes are

high

slow

slide-6
SLIDE 6

What’s in a Machine?

Interconnected compute and storage Newer Hardware

  • GPUs, FPGAs
  • RDMA, NVlink

Memory Bus Ethernet SATA PCIe v4

y Procecnpgr

f.

DRAM → Ssp

HDD

slide-7
SLIDE 7

Scale Up: Make More Powerful Machines

Moore’s law – Stated 52 years ago by Intel founder Gordon Moore – Number of transistors on microchip double every 2 years – Today “closer to 2.5 years” Intel CEO Brian Krzanich

O

? ?
  • /
slide-8
SLIDE 8

Dennard Scaling is the Problem

Suggested that power requirements are proportional to the area for transistors – Both voltage and current being proportional to length – Stated in 1974 by Robert H. Dennard (DRAM inventor) Broken since 2005

“Adapting to Thrive in a New Economy of Memory Abundance,” Bresniker et al

If

core
  • r
32 Core

¥

slide-9
SLIDE 9

Dennard Scaling is the Problem

Performance per-core is stalled Number of cores is increasing

“Adapting to Thrive in a New Economy of Memory Abundance,” Bresniker et al

I

slide-10
SLIDE 10

Memory TRENDS

ft

Copawk
  • r

tater

's

t loot of

pi

  • f
a B
  • lo - 15
GB Is per core

log

100 M = DRAM

O

'
slide-11
SLIDE 11

MEMORY TAKEAWAY

Growing +15% per year

Data access from memory is getting more expensive !

slide-12
SLIDE 12

HDD CAPACITY

storage

  • Back blaze
→ backup

O

O

O

slide-13
SLIDE 13

HDD BANDWIDTH

Disk bandwidth is not growing

HM

read bandwidth

I

100
  • 200
MB Is
slide-14
SLIDE 14

SSDs

Performance: – Reads: 25us latency – Write: 200us latency – Erase: 1,5 ms Steady state, when SSD full – One erase every 64 or 128 reads (depending on page size) Lifetime: 100,000-1 million writes per page

HDD
  • moms
  • f latency
~

deleting data

is

expensive

  • verwriting
slide-15
SLIDE 15

SSD VS HDD COST O

O qq.FEYsn.tn

""

O

O

slide-16
SLIDE 16

Ethernet Bandwidth

1998 1995 2002 2017

Growing 33-40% per year ! Disk I

100 MB Is

r.oas.e.fi

:

""

slide-17
SLIDE 17

AMAZON EC2 (2019)

  • t

tfYat

g

Flash

drive
slide-18
SLIDE 18

TRENDS SUMMARY

CPU speed per core is flat Memory bandwidth growing slower than capacity SSD, NVMe replacing HDDs Ethernet bandwidth growing

limitations

  • f

Single

machine

?

slide-19
SLIDE 19

DATACENTER ARCHITECHTURE

Memory Bus Ethernet SATA PCIe

Server Server
  • ffer
net

gas

rack

Rackswith

T

fitches

racks

now

µ

→ → →
slide-20
SLIDE 20

STORAGE HIERARCHY (DC AS A COMPUTER v2)

=

↳ I 201 Or

G

  • I
  • a ::¥¥÷

:

GBH

  • @→ 100MBb
slide-21
SLIDE 21

Warehouse-Scale Computers

Single organization Homogeneity (to some extent) Cost efficiency at scale – Multiplexing across applications and services – Rent it out! Many concerns – Infrastructure – Networking – Storage – Software – Power/Energy – Failure/Recovery – …

  • :

  • 19000getters
r

=

slide-22
SLIDE 22

SOFTWARE IMPLICATIONS

Workload Diversity Reliability Single organization Storage Hierarchy

Component

  • failures
slide-23
SLIDE 23 BigData

WORKLOAD: Partition-Aggregate

Top-level Aggregator Mid-level Aggregators Workers
  • low
  • latency

fry

ijhtkggiegeted

Index

sharded

slide-24
SLIDE 24

WORKLOAD: SCHOLAR SIMILARITY

Reduce Stage Map Stage

mapped

"

I quit

Not e

Mir

re

µ

.

I .÷÷:w

. .
slide-25
SLIDE 25

VIDEO ENCODING

paralleling

compute

intensive

f

TV

fragments

youtube

K"

f

v

ly

.

'

daleth

slide-26
SLIDE 26

MACHINE LEARNING

Wsc

grain

we

slide-27
SLIDE 27

DISCUSSION

https://forms.gle/CrrrhCPYHerwXNEt5

slide-28
SLIDE 28

Discussion

Scale-up vs Scale-out

Scale

up

sale Out

If

your

app doesn't

have

parallelism

communication

small

dataset

  • verkill

Fault tolerance

to

pay

  • as
  • you
  • 8

peggy

Miriam

>

coiffeur

10 . 000 I
slide-29
SLIDE 29

DISCUSSION

Microsoft Word vs. online document editor like Google Docs

Word

Docs

Yearly

release

. , collaboration consistency is a

challenge

monthly

path

, Access it

from anywhere Machine I hardware

  • nline

patches I release

compatibility

tag

Leek redundancy → permanent storage

99.99% uptime

slide-30
SLIDE 30

DISCUSSION

Even

having

99%

* servers work

well

Parallelism

makes

tail

latencies

worse O

C

)-

  • nly

tin

X

#

have slowdown
slide-31
SLIDE 31

NEXT STEPS

Next class: Storage Systems Assignment 1 out Thursday. Submit groups before that! Wait list