CS CS 754 754 Adv Advanced ed D Distribut uted S ed System - - PowerPoint PPT Presentation

cs cs 754 754 adv advanced ed d distribut uted s ed
SMART_READER_LITE
LIVE PREVIEW

CS CS 754 754 Adv Advanced ed D Distribut uted S ed System - - PowerPoint PPT Presentation

CS CS 754 754 Adv Advanced ed D Distribut uted S ed System ems Introduction to Data Centers Data Center Overview Why DC? Economy of scale (amortize capital and maintenance cost). Machines->Racks->Cluster Design Metrics


slide-1
SLIDE 1

CS CS 754 754 Adv Advanced ed D Distribut uted S ed System ems Introduction to Data Centers

slide-2
SLIDE 2

Data Center Overview

Why DC? Economy of scale (amortize capital and maintenance cost). Machines->Racks->Cluster

slide-3
SLIDE 3

Design Metrics

  • Performance (request per second)
  • Cost (capital and operation) (request per dollar)
  • Power (request per Watt)
slide-4
SLIDE 4

DC Node Design

Option 1: SMP: Symmetric Multi-processor Shared memory multiprocessor: set of CPU each with its own cache, sharing the main memory

  • ver a single bus.

+ High performance per node

  • Expensive
slide-5
SLIDE 5

DC Node Design

Option 2: Commodity Nodes Using off the shelf components. + Equal performance to SMP at scale + Lower cost

  • Fails more
slide-6
SLIDE 6

SMP vs Commodity

Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs. Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)]

Local access Remote access

slide-7
SLIDE 7

SMP vs Commodity

Execution time = CPU time + communication_time Assume local access takes 100ns, and remote access takes 100 μs. Communication time = #_operations * [100 ns* 1/# nodes + 100 μs * (1 − 1/# nodes)]

Local access Remote access

slide-8
SLIDE 8
slide-9
SLIDE 9

DC Node Design

Option 3: Wimpy nodes Using low-end CPUs (e.g., ARM processors) + Lower cost + Lower energy Disadvantage: Hard to use efficiently

slide-10
SLIDE 10

DC Node Design

Wimpy design disadvantages

  • Amdahl’s law bounds :

Task execution is T = (1-p)T + pT (p ratio of code that can run in parallel , 0≤ p ≤1) After parallelization on s cores: T’ = (1-p)T + (p/s)T Speed-up = T/T’ = 1/((1-p) + p/s) If sinf speed-up = 1/(1-p)

  • Higher number of threads --> higher serialization/communication cost
  • Harder to program --> higher software cost
  • Higher networking cost
  • Lower utilization

For I/O intensive workloads (e.g., for Google workloads) using commodity machines is a better choice.

slide-11
SLIDE 11

Storage Design

Design paradigms:

  • NAS: network attached storage, dedicated storage appliance
  • Distributed storage: aggregate storage space from nodes in cluster.

Design dimensions:

  • Reliability: replication or erasure coding (RS coding)
  • Reduce cost by using cheap disks: they fail more but we will replicate anyway
  • Consistency: varies depending on application
slide-12
SLIDE 12

Storage Design

Option 1: Network attached storage (NAS) A dedicated storage appliance.

+ Simpler deployment + Control and management (QoS) + Lower network overhead (appliance replication)

slide-13
SLIDE 13

Storage Design

Option 2: Distributed storage: aggregate storage space from nodes in cluster. Reduce cost by using cheap disks: they fail more but we will replicate anyway.

+ Lower cost + Higher availability + Higher performance + Higher Data locality

  • Higher network overhead
  • Lower component reliability
slide-14
SLIDE 14

Storage Design

NAS Distributed Storage (GFS) Simpler deployment +Lower cost Control and management (QoS) +Higher availability Lower network overhead (appliance replication) +Higher performance + data locality (at different levels and technology)

  • Higher write network overhead
slide-15
SLIDE 15

Network Design

Challenge: build high speed, scalable network at lower cost Optimizations tricks:

  • Reduce core bandwidth: 5:1 ratio is common
  • Multiple networks (SAN, supercomputer

example)

slide-16
SLIDE 16
slide-17
SLIDE 17

DC Design Implications

Software using DC needs to be aware of the storage hierarchy

Jeff Dean

slide-18
SLIDE 18

Example

Data location Latency Throughput RAM 100ns 20GBps Hard Disk 10ms 80MBps Network- Rack 70 µs 128 MBps (1Gbps) Network – DC 500 µs 25 MBps (subscription ratio of 5:1) RAM Disk Rack RAM Rack Disk DC RAM DC Disk Latency BW

slide-19
SLIDE 19

Example

Jeff Dean

slide-20
SLIDE 20

Example

Jeff Dean

slide-21
SLIDE 21

DC Design Implications

  • Software using DC needs to be aware of the network and storage hierarchy
  • Software fault tolerance is necessary
  • Programing framework to hide complexity

Technology changes:

  • Much more memory
  • New disks: Shingled, Kinetic, PCIeNV
  • SSD , NVM
  • SDN networks
  • Programmable NIC and switches
  • Faster network
slide-22
SLIDE 22

Large Scale Services

Two categories:

  • Online. e.g., ecommerce, instant messaging
  • Low latency
  • Highly-available
  • Mostly read operations
  • Offline. Batch processing. E.g., data processing
  • Compute and I/O intensive
  • Throughput centric
slide-23
SLIDE 23

Model

slide-24
SLIDE 24

Load Manager

  • DNS-based
  • May take hours to adapt
  • Not available to small clusters
  • Appliance or switch (L4)
  • Smart client (L7)

Load balancing techniques:

  • Round robin
  • Least number of connections
  • Response time
  • Source IP hash
  • SDN based
  • Chained failover
slide-25
SLIDE 25

High Availability

Metric (uptime): percent of time the system is available to answer client requests.

  • --|Fail |---Recover------|-------------available------------------|Fail|---Recover--|------------

Uptime: (MTBF – MTTR)/MTBF MTBF: Mean time between failures MTTR: Mean time to repair

slide-26
SLIDE 26

High Availability

Uptime: (MTBF – MTTR)/MTBF Brewer recommendation: Do your best effort to reduce MTBF but focus on reducing

  • MTTR. Why?
  • MTBF need weeks of testing.
  • MTTR is easier to improve. Easier to debug and measure.

Problem with uptime: Not all second as equal (idle vs peak time)

slide-27
SLIDE 27

High Availability

Yield = queries completed / queries offered Harvest = data available/complete data DQ principle: Data per query (D) x query per second (Q) --> constant The underlying limitation is data movement (seeks, I/O BW, ..etc) Good for:

  • Comparing system
  • Decide on upgrades
  • Measuring failure effect
slide-28
SLIDE 28

Graceful Degradation

Degradation of service under overload. (instead of complete system failure) Overload will happen: single event burst, peak-to-average ratio is 6:1, failures. Techniques:

  • Limit D (partial results) and maintain Q
  • Limit Q (by admission control) and maintain D
  • QoS, cost-based
  • Priorities
  • Reduce data quality (freshness)
slide-29
SLIDE 29

Evolution

Perfect software is hard, costly, takes a long time. Aim for: software that handles failures well (high MTBF, low MTTR, no cascading failures) Other bugs are less critical: memory leaks, slow …etc (try throwing more hardware at it) Reasoning: upgrades are controlled failures. Do it off-peak. Strategies (all have the same DQ loss over time):

  • Fast reboot of all cluster nodes. Easier ( jump between versions), risky (could be

buggy), downtime

  • Rolling upgrade: 5% at a time. More complex (two versions will run at the same time),

slow

  • Big Flip: jump from one version to the other half-a-cluster at a time.

Rolling upgrade is the most popular.

slide-30
SLIDE 30

Replication vs. Partitioning

Replication  higher harvest Partitioning  higher yield E.g., Two node cluster, one node fails: Replication: 100% harvest, 50% yield (but replication need more DQ for write) Partitioning: 50% harvest, 100% yield Same DQ value (lower by 50%) As capacity is not an issue (capacity is cheap), use replication: Better harvest, effects yield only under heavy load, easier to manage, scales, easier disaster recovery.