SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE - - PowerPoint PPT Presentation

soft container
SMART_READER_LITE
LIVE PREVIEW

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE - - PowerPoint PPT Presentation

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG 1 WHO ARE THOSE GUYS Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling and container technologies. Mail:


slide-1
SLIDE 1 1

SOFT CONTAINER

TOWARDS 100% RESOURCE UTILIZATION

ACCELA ZHAO, LAYNE PENG
slide-2
SLIDE 2 2 Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling and container technologies.

WHO ARE THOSE GUYS …

Layne Peng, Principal Technologist at EMC OCTO, experienced cloud architect, one of the earliest contributors to Cloud Foundry in China, 9 patents
  • wner and a book author.

Mail: accela.zhao@emc.com Mail: layne.peng@emc.com Twitter: @layne_peng

slide-3
SLIDE 3 3

WHAT IS RESOURCE UTILIZATION?

This is what we buy This is what we use A gap of $$$ wasted

slide-4
SLIDE 4 4

ENERGY AND RESOURCE UTILIZATION

Energy-related costs 42% of total (including buy new machines) An idle server consumes even 70% as much energy as running in full- speed

Low resource utilization is energy inefficient Waste energy, waste money

Real world resource utilization is usually low: around 20% or less

slide-5
SLIDE 5 5

A CLOSER LOOK TO CLOUD

The key advantage - cloud consolidation Less machines, more apps. Energy- efficient and saves money. Improved resource utilization

slide-6
SLIDE 6 6
  • Scheduling - choose the best resource placement

when app starts

– Examples: Green Cloud, Paragon. And the schedulers in Openstack, Kubernetes, Mesos, …

  • Migration - continuously optimize the resource

placement when app is running

– Examples: Openstack Watcher, VMware DRS

  • Soft Container - elastic, and dynamically adjust

resource constraints in response to co-located apps

– Related: Google Heracles

RESOURCE UTILIZATION ON CLOUD

Soft Container

slide-7
SLIDE 7 7

RESOURCE UTILIZATION ON CLOUD

Scheduler Migration

Apps

Soft Container

Manages resource utilization at app kick-off Manages resource utilization cross hosts while app running Manages resource utilization at fine granularity inside host
slide-8
SLIDE 8 8

RESOURCE UTILIZATION ON CLOUD

A battle of putting more apps in each host vs. guaranteed app SLA The key problem: resource interference

slide-9
SLIDE 9 9
  • What is resource interference?

– Apps co-located in one host share resources like CPU, cache, memory, … – They interfere with each other, result in poor performance compared to running standalone – Resource interference make SLA unenforceable

  • Related readings

– Google Heracles: an analysis of resource interference – Paragon: resource interference-aware scheduling – Bubble-up: to measure resource interference

THE KEY PROBLEM: RESOURCE INTERFERENCE

slide-10
SLIDE 10 10

RESOURCE INTERFERENCE: HOW IT LOOKS?

MySQL standalone running vs co-located with a CPU & disk hungry task

slide-11
SLIDE 11 11
  • Bubble-up

– The setup

  • Run app co-located with resource benchmarks, each benchmark
stresses one type of resource

– App tolerated resource interference

  • Slowly increase resource benchmark stress until app fails its SLA.
  • The critical point shows how much resource interference the app can
tolerate.

– App caused resource interference

  • Run app at what its SLA requires.
  • The stress it causes on each type of resource is the app’s caused
resource interference.
  • Where to use it?

– Better resource utilization management – Scheduling, Migration, Soft Container, …

RESOURCE INTERFERENCE: HOW TO MEASURE?

slide-12
SLIDE 12 12

RESOURCE INTERFERENCE: HOW TO MEASURE?

MySQL standalone running, vs co-located with CPU stress, vs disk stress. In my case, MySQL is much more sensitive to CPU interference.

slide-13
SLIDE 13 13
  • Motivations

– Increase resource utilization by co-locating more apps

  • E.g. Business services is critical but may not use all resources on the
  • host. Add the low priority hadoop batching tasks to fill what is left.

– Respond to the dynamic nature of time-varying workload

  • E.g. Business service may become more idle at lunch time, hadoop
tasks can then expand its resource bubble and utilize the leftover.

– Guarantee the SLA of critical apps

  • E.g. When the business service suddenly requires more resource for
processing, hadoop tasks will shrink instantly to give out resources.
  • Challenges

– Resource control and isolation of interference – Respond to dynamic workload change

INTRODUCING TO SOFT CONTAINER

slide-14
SLIDE 14 14
  • What does “Soft” mean?

– Varying container resources needs based upon neighbors and SLAs. (The container becomes elastic) – “Expanding” (bubble up) resources when idle resources exist – Shrinking resources on a specific container, when another critical app demands more resources

INTRODUCING TO SOFT CONTAINER

Container resource bubble

Time Resource
slide-15
SLIDE 15 15

THE FEEDBACK CONTROL LOOP

Controller Watcher Limiter

Containers

Soft Container

slide-16
SLIDE 16 16

RESOURCES TO LIMIT

  • CPU

– Core – Time Quota – …

  • Memory

– Size – Bandwidth – …

  • Disk I/O

– IOPS – Throughput – …

slide-17
SLIDE 17 17

RESOURCES TO LIMIT - MISSING

  • CPU

– Core – Time Quota – …

  • Cache

– LLC – …

  • Memory

– Size – Bandwidth* – …

  • GPU

– …

  • Device*

– …

  • Network

– Ulimit – Bandwidth – …

  • Disk I/O

– IOPS – Throughput – …

Kernel 3.6, most supports can be found in the community…

slide-18
SLIDE 18 18

ISOLATION THE RESOURCES - NAMESPACE

/proc/<pid>/ns:

  • lrwxrwxrwx 1 root root 0 Jun 21 18:38 ipc -> ipc:[4026532509]
  • lrwxrwxrwx 1 root root 0 Jun 21 18:38 mnt -> mnt:[4026532507]
  • lrwxrwxrwx 1 root root 0 Jun 16 18:24 net -> net:[4026532512]
  • lrwxrwxrwx 1 root root 0 Jun 21 18:38 pid -> pid:[4026532510]
  • lrwxrwxrwx 1 root root 0 Jun 21 18:38 user -> user:[4026531837]
  • lrwxrwxrwx 1 root root 0 Jun 21 18:38 uts -> uts:[4026532508]
  • clone(): create a new process and attached to a new namespace
  • unshare(): create a new namespace and attaches to a existed process
  • setns(): Set a a process to a existing namespace
  • security namespace
  • security keys namespace
  • device namespace
  • time namespace

We are still waiting…

slide-19
SLIDE 19 19

LIMIT THE RESOURCE - CGROUP

Task, Control Group & Hierarchy Subsystem – Control options

  • blkio
  • cpu
  • cpuacct
  • cpuset
  • devices
  • freezer
  • memory
  • net_cls
  • net_prio
  • ns

Create a cgroup subsystem Change the limitation… Usage

# echo 524288000 > /sys/fs/cgroup/memory/foo/memory.limit_in_b ytes
slide-20
SLIDE 20 20

MISSING - NETWORK

Isolation, does not means resource controlling 10

Suppose two containers in a machine, totally 100Gbps b/w

80 100Gbps

slide-21
SLIDE 21 21

MISSING - NETWORK

Isolation, does not means resource controlling 10

Suppose two containers in a machine, totally 100Gbps b/w

80 100Gbps 95 100Gbps

If the GREEN container consumes the majority of b/w, which may have a negative impact on the BLUE one… How we can avoid this case from happening?

slide-22
SLIDE 22 22

MISSING - NETWORK

Community attempts:

Base on Traffic Control (tc)

Nightmare of the PaaS providers…

slide-23
SLIDE 23 23

MISSING - NETWORK

Community attempts:

Base on Traffic Control (tc)

Nightmare of the PaaS providers…

slide-24
SLIDE 24 24

MISSING - GPU

Nvidia’s efforts:

  • a. GPU exposed as separated normal devices in /dev
Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation
  • b. devices cgroup:
  • Allow/Deny/List
  • Access
i. R ii. W
  • iii. M
slide-25
SLIDE 25 25

MISSING - GPU

Nvidia’s efforts:

  • a. GPU exposed as separated normal devices in /dev
Ref: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation
  • b. devices cgroup:
  • Allow/Deny/List
  • Access
i. R ii. W
  • iii. M
Usable, but insufficient…
  • 1. Launch multiple jobs in parallel, each one us a subset of avaiable GPUs;
  • 2. How about share GPU between Jobs with proper isolation? Can we share
a GPU like we can a CPU?
slide-26
SLIDE 26 26

MISSING - CACHE

Intel’s efforts:

Cache Monitor Technology (CMT)
  • For an OS or VMM to indicate a software-
defined ID for each of applications or VMs that are scheduled to run on a core. This ID is called the Resource Monitoring ID (RMID).
  • To Monitor cache occupancy on a per RMID
basis
  • For an OS or VMM to read LLC occupancy for a
given RMID at any time. Cache Allocation Technology (CAT)
  • The ability to enumerate the CAT capability and
the associated LLC allocation support via CPUID.
  • Interfaces for the OS/hypervisor to group
applications into classes of service (CLOS) and indicate the amount of last-level cache available to each CLOS. These interfaces are based on MSRs (Model-Specific Registers). Code and Data Prioritization (CDP)
  • Extension to CAT
  • a new CPUID feature flag is added within the
CAT sub-leaves at CPUID.0x10.[ResID=1]:ECx[bit 2] to indicate support
slide-27
SLIDE 27 27

MISSING – MEMORY BANDWIDTH

Monitor

Memory Bandwidth Monitoring (MBM)
  • Mechanisms in hardware to monitor cache
  • ccupancy and bandwidth statistics as
applicable to a given product generation on a per software-id basis.
  • Mechanisms for the OS or hypervisor to read
back the collected metrics such as L3
  • ccupancy or Memory Bandwidth for a given
software ID at any point during runtime.

Control

Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf Code: https://github.com/heechul/memguard
slide-28
SLIDE 28 28

MISSING – MEMORY BANDWIDTH

Monitor

Memory Bandwidth Monitoring (MBM)
  • Mechanisms in hardware to monitor cache
  • ccupancy and bandwidth statistics as
applicable to a given product generation on a per software-id basis.
  • Mechanisms for the OS or hypervisor to read
back the collected metrics such as L3
  • ccupancy or Memory Bandwidth for a given
software ID at any point during runtime.

Control

Ref Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platform: http://pertsserver.cs.uiuc.edu/~mcaccamo/papers/private/IEEE_TC_journal_submitted_C.pdf Code: https://github.com/heechul/memguard
slide-29
SLIDE 29 29
  • Latencies

– App request latency – Disk IO await – Network response time

  • Queue length

– CPU load average – Disk request queue size – Network queue length

  • Utilization

– CPU util rate – Disk util rate – Network util rate

WATCH THE WORKLOAD CHANGE

  • Bandwidth

– DRAM bandwidth – CPU bandwidth – Disk bandwidth

  • Request count

– App request count – Disk IOPS / req/s – Network IOPS / req/s

  • Granularity

– Global level – Per container level

slide-30
SLIDE 30 30

THE FEEDBACK CONTROL LOOP

Controller Watcher Limiter

Containers

Soft Container

slide-31
SLIDE 31 31

THE FEEDBACK CONTROL LOOP

Controller Watcher Limiter

Containers

Soft Container

Immediate response

slide-32
SLIDE 32 32

THE FEEDBACK CONTROL LOOP

Controller Watcher Limiter

Containers

Soft Container

Immediate response How to immediately resize the containers?

slide-33
SLIDE 33 33

HOW WE LOOK AT RESIZE?

a. Create a new container; b. Live migrate the contents to new container: 1. Transfer existed data to new container; 2. Transfer the instant data to new container. c. Stop the old container d. Start the new container e. Route the traffic to new container
slide-34
SLIDE 34 34

9527 /usr/sbin/httpd

Control Groups (cgroup):

  • CPU time: 20
  • System memory: 1G
  • Disk bandwidth: 2000
  • Network bandwidth: 100Mbs

Control Groups (cgroup):

  • CPU time: 70
  • System memory: 5G
  • Disk bandwidth: 8000
  • Network bandwidth: 1Gbs
a. Mount to new cgroup or change the value of the cgroup b. Done!

IN CONTAINER’S WORLD…

slide-35
SLIDE 35 35

9527 /usr/sbin/httpd

Control Groups (cgroup):

  • CPU time: 20
  • System memory: 1G
  • Disk bandwidth: 2000
  • Network bandwidth: 100Mbs

Control Groups (cgroup):

  • CPU time: 70
  • System memory: 5G
  • Disk bandwidth: 8000
  • Network bandwidth: 1Gbs
a. Mount to new cgroup or change the value of the cgroup b. Done!

IN CONTAINER’S WORLD…

We need to take a fresh look at the resources management from Container’s perspective.

slide-36
SLIDE 36 36

SOFT CONTAINER: IMPLEMENTATION

Controller

Algorithm ”expand” Algorithm ”pin_idle” Algorithm plugin N

Watcher

CPU plugin Disk plugin Watcher plugin N

Limiter

RunC plugin Docker plugin Limiter plugin N Metrics Store CPU statistics Disk … More …

Container Repo

RunC plugin Docker plugin Container type N Containers Auto discovery
slide-37
SLIDE 37 37
  • Early version
  • Support RunC and Docker containers
  • A few controller algorithms which are effective
  • Able to expand with more plugins

SOFT CONTAINER: CURRENT STATUS

Completely runnable!

slide-38
SLIDE 38 38

Demo Time :-)

slide-39
SLIDE 39 39

BENCHMARK RESULTS: BEFORE

If uncontrolled, MySQL workload is severely interfered by co-located low priority task

slide-40
SLIDE 40 40

BENCHMARK RESULTS: BEFORE

The CPU utilization is far from saturation while workload varies by time (Although in my case, disk IO is highly utilized)

slide-41
SLIDE 41 41

BENCHMARK RESULTS: SOFT CONTAINER

With Soft Container (green line), latency impact is controlled. (We can improve the algorithm to cope better with peak workload)

slide-42
SLIDE 42 42

BENCHMARK RESULTS: SOFT CONTAINER

Soft Container helps improve CPU utilization by co-locating new tasks with MySQL

slide-43
SLIDE 43 43

BENCHMARK RESULTS: SOFT CONTAINER

CPU utilization looks close to saturation, after adding in iowait time

slide-44
SLIDE 44 44
  • Soft Container monitors app resource needs and
  • verall resource utilization in realtime
  • Soft Container issues resource controls in realtime,

to guard app SLA and balance resource utilization

HOW DOES SOFT CONTAINER DID THIS?

slide-45
SLIDE 45 45

BENCHMARK RESULTS: SOFT CONTAINER

How the resource bubble floats under the control of Soft Container. (The vibration threshold are made very sensitive to workload change)

slide-46
SLIDE 46

Q&A