Towards General-Purpose Resource Management in Shared Cloud - - PowerPoint PPT Presentation

towards general purpose
SMART_READER_LITE
LIVE PREVIEW

Towards General-Purpose Resource Management in Shared Cloud - - PowerPoint PPT Presentation

Towards General-Purpose Resource Management in Shared Cloud Services Jonathan Mace , Brown University Peter Bodik, MSR Redmond Rodrigo Fonseca, Brown University Madanlal Musuvathi, MSR Redmond Shared-tenant cloud services Processes service


slide-1
SLIDE 1

Towards General-Purpose Resource Management in Shared Cloud Services

Jonathan Mace, Brown University Peter Bodik, MSR Redmond Rodrigo Fonseca, Brown University Madanlal Musuvathi, MSR Redmond

slide-2
SLIDE 2

Shared-tenant cloud services

Processes service requests from multiple clients ✓ Great for cost and efficiency

✘ Performance is a challenge Aggressive tenants and system maintenance tasks Resource starvation and bottlenecks Degraded performance, Violated SLOs, system outages

2

slide-3
SLIDE 3

Ideally manage resources to provide end-to-end guarantees and isolation Challenge OS/hypervisor mechanisms insufficient

✘ Shared threads & processes ✘ Application-level resource bottlenecks (locks, queues) ✘ Resources across multiple processes and machines

Today lack of guarantees, isolation some ad-hoc solutions

3

Shared-tenant cloud services

slide-4
SLIDE 4

This paper

  • 5 design principles for resource policies in shared-

tenant systems

  • Retro – prototype for principled resource

management

  • Preliminary demonstration of Retro in HDFS

4

slide-5
SLIDE 5

Hadoop Distributed File System (HDFS)

5

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

Filesystem metadata Replicated block storage

slide-6
SLIDE 6

6

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

Hadoop Distributed File System (HDFS)

Filesystem metadata Replicated block storage

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

9

slide-10
SLIDE 10

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

10

slide-11
SLIDE 11

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

11

slide-12
SLIDE 12

Principle 1: Consider all resources and request types

  • Fine-grained resources within processes
  • Resources shared between processes (disk, network)
  • Many different API calls
  • Bottlenecks can crop up in many places

hardware resources: disk, network, cpu, … software resources: locks, queues, … data structures: transaction logs, shared batches, …

12

slide-13
SLIDE 13

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

13

slide-14
SLIDE 14

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

14

slide-15
SLIDE 15

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

15

slide-16
SLIDE 16

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

16

slide-17
SLIDE 17

Principle 2: Distinguish between tenants

  • Tenants might send different types of

requests

  • Tenants might be utilizing different

machines

  • If a policy is efficient, it should be able

to target the cause of contention

e.g., if a tenant is causing contention, throttle

  • therwise leave the tenant alone

17

slide-18
SLIDE 18

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

18

slide-19
SLIDE 19

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode Admission Control

19

slide-20
SLIDE 20

HDFS NameNode

HDFS DataNode HDFS DataNode

while (!Thread.isInterrupted()){ sendPacket(); }

HDFS DataNode Admission Control

20

slide-21
SLIDE 21

HDFS NameNode

HDFS DataNode HDFS DataNode

while (!Thread.isInterrupted()){ rate_limit(); sendPacket(); }

Principle 5: Schedule early, schedule often

21

HDFS DataNode Admission Control

slide-22
SLIDE 22

Resource Management Design Principles

  • 1. Consider all request types and all resources
  • 2. Distinguish between tenants
  • 3. Treat foreground and background tasks uniformly
  • 4. Estimate resource usage at runtime
  • 5. Schedule early, schedule often

Retro – prototype for principled resource management in shared-tenant systems

22

slide-23
SLIDE 23

Retro: end-to-end tracing

23

Tenants

slide-24
SLIDE 24

Retro: end-to-end tracing

24

Tenants

slide-25
SLIDE 25

Retro: application-level resource interception

25

Tenants

slide-26
SLIDE 26

Retro: aggregation and centralized reporting

26

Tenants

slide-27
SLIDE 27

Retro: application-level enforcement

27

Tenants

slide-28
SLIDE 28

Retro: distributed scheduling

28

Tenants

slide-29
SLIDE 29

Tenants

29

Retro: distributed scheduling

slide-30
SLIDE 30

Early Results

30

Open Read Create Rename Delete Normalized Throughput HDFS HDFS w/ Retro

1.1 1 0.9

Open Read Create Rename Delete Normalized Latency

1.2 1 0.8

HDFS NNBench benchmark

0.01% to 2% average overhead

  • n end-to-end

latency, throughput

slide-31
SLIDE 31

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

31

slide-32
SLIDE 32

HDFS NameNode

HDFS DataNode HDFS DataNode HDFS DataNode

32

slide-33
SLIDE 33

Retrospective

Thus far:

  • Per-tenant identification
  • Resource measurements
  • Schedule enforcement

Next steps:

  • Abstractions for writing simplified high-level policies
  • Low-level enforcement mechanisms
  • Policies to monitor system, find bottlenecks, provide

guarantees

33