How to Tame your VM: an Automated Control System for Virtualized - - PowerPoint PPT Presentation

how to tame your vm
SMART_READER_LITE
LIVE PREVIEW

How to Tame your VM: an Automated Control System for Virtualized - - PowerPoint PPT Presentation

How to Tame your VM: an Automated Control System for Virtualized Services Akkarit Sangpetch Andrew Turner Hyong Kim asangpet@andrew.cmu.edu andrewtu@andrew.cmu.edu kim@ece.cmu.edu Department


slide-1
SLIDE 1

How to Tame your VM:

an Automated Control System for Virtualized Services

Akkarit Sangpetch Andrew Turner Hyong Kim asangpet@andrew.cmu.edu andrewtu@andrew.cmu.edu kim@ece.cmu.edu Department of Electrical and Computer Engineering Carnegie Mellon University LISA 2010 - November 11, 2010

slide-2
SLIDE 2

Virtualized Services

  • Virtualized Infrastructure

(vSphere,Hyper-V,ESX,KVM) – Easy to deploy – Easy to migrate – Easy to re-use virtual machines

  • Multi-tier services are

configured and deployed as virtual machines

2

DB2 App1

Hypervisor

App2 Web1

Hypervisor

Web2 DB1

Hypervisor

Db3 Web3 App3

slide-3
SLIDE 3

Problem: What about performance?

  • VMs share physical resource

(CPU time / Disk / Network / Memory bandwidth)

  • Need to maintain service quality

– Response time / Latency

  • Existing infrastructure provides

mismatch interface between the service’s performance goal (latency) and configurable parameters (resource shares)

– Admins need to fully understand the services before able to tune the system

3

DB2 App1

Hypervisor

App2 Web1

Hypervisor

Web2 DB1

Hypervisor

Db3 Web3 App3

slide-4
SLIDE 4

Existing Solutions – Managing Service Performance

  • Resource Provisioning

– Admins define resource shares / reservations / limits – The “correct” values depends on hardware / applications / workloads

  • Load Balancing

– Migrate VMs to balance host utilization – Free up capacity for each host (low utilization ~ better performance)

4

vm2 vm1 vm3

Host 1

vm5

Host 2

vm4 vm4 vm1 vm2 vm3 vm4 50 10 30 10

# Shares

slide-5
SLIDE 5

Automated Resource Control

  • Dynamically adjust resource

shares during run-time

  • Input = required service

response time

– I want to serve pages from vm1 in less than 4 seconds

  • The system calculates

number of shares required to maintain service level

– Adapt to changes in workload – Admins do not have to guess the number of shares

5

vm1 vm2 vm3 vm4 25 25 25 vm1 vm2 vm3 vm4 50 10 10 30 25

Control system

vm1 target vm1 response vm2 target vm2 response vm3 target vm3 response vm4 target vm4 response vm1 target vm1 response vm2 target vm2 response vm3 target vm3 response vm4 target vm4 response

slide-6
SLIDE 6

Control System

Controller Modeler

VM VM

Sensor Actuator

VM VM

Sensor Actuator

KVM Host VM VM

Sensor Actuator 6

KVM Host KVM Host

Record service response time Adjust host CPU shares Calculate model parameters Calculate # of shares required

slide-7
SLIDE 7

Sensor unit

Server VM

Sensor VM Host Bridge

Request Packets

xtables TEE tshark

To Modeler

analyzer

{service:blog, tier:web, vm: www1, response time 250 ms}, {service:blog, tier:db, vm:db1, response time 2.5 ms}

7

  • Record service response time
  • Starts when the server sees

“request” packet

  • Until the server sends

“response” packet

  • Based on libpcap / tshark – need

to recognize the application’s header

Request Packets Request Packets

slide-8
SLIDE 8

Actuator Unit

Actuator Agent Hypervisor Shares Allocation 8 CPU Cgroup

Server 2 VM

  • Adjust number of shares based
  • n controller’s allocation
  • We use Linux/KVM as our

hypervisor

  • Cgroup cpu.shares –

control CFS scheduling shares

  • Similar mechanism exists
  • n other platform (Xen

weight / ESX shares)

Server 1 VM 50

{server1:80, server2:20 }

80 50 20

slide-9
SLIDE 9

Modeler

Web server Database Client TDB Tweb

9

Tweb = f (TDB ) TDB = f (Scpu)

  • The modeler calculates the expected service response time based
  • n shares allocated to each VM
  • We use an intuitive model for our 2-tier services in the experiment

Scpu

slide-10
SLIDE 10

Resource Shares Effect

Host 1 Host 2 Web server VM

(Wordpress)

SQL Server VM

(mysql)

Cruncher VM

(100% cpu)

NAS Test Clients

10

slide-11
SLIDE 11

SQL Server CPU Allocation effect

50 100 150 200 250 300 350 400 450 500 100 200 300 400 500 600 700 800 1 101 201 301 401 501 601 701 801 901 # of DB CPU shares (out of 1000) Web Response Time (ms) Time step (5s)

Web Response

thttp-msec cpu 50 100 150 200 250 300 350 400 450 500 5 10 15 20 25 30 35 40 1 101 201 301 401 501 601 701 801 901 # of DB CPU shares (out of 1000) Daabase Response Time (ms) Time step (5s)

Database Response

tdb-msec cpu

11

slide-12
SLIDE 12

Database Response vs CPU

12

2 4 6 8 10 12 14 16 18 20 50 100 150 200 250 300 350 400 450 500 Database Response Time (ms) # of CPU Shares allocated to SQL VM (out of 1000)

TDB = a0 (Scpu)b0

slide-13
SLIDE 13

Database Response vs CPU (controlled)

13

5 10 15 20 25 30 100 200 300 400 500 600 700 800 900 1000 Database Response Time (ms) # of CPU Shares allocated to SQL server VM (out of 1000)

slide-14
SLIDE 14

HTTP vs Database response

100 200 300 400 500 600 700 800 2 4 6 8 10 12 14 16 18 20 Web Response (ms) Database Response (ms)

14

Tweb = a1 (TDB ) + b1

slide-15
SLIDE 15

Controllers

  • Model uses readings to estimate the service

parameters <a0,b0,a1,b1>

– TDB = a0 (Scpu)b0 – Tweb = a1 (TDB ) + b1

  • Controller finds the minimal Scpu such that

Tweb < specified response time

– Long-term control: uses moving average to find model parameters & shares – Short-term control: uses the last-period reading to find model parameters

  • Avoid excessive response time violation while waiting for the

long-term model to settle

15

slide-16
SLIDE 16

System Evaluations

Host 1 Host 2 NAS Test Clients Host 3 Web server 1 VM DB Server 1 VM

Web server 2 VM

DB server 2 VM

Actuator Sensor Sensor Sensor 16

slide-17
SLIDE 17

Static Workload

1000 2000 3000 4000 5000 6000 7000 8000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 web1 web1-target web2 web2-target

17

Service Response Time (ms) Service Response Time (ms) 1000 2000 3000 4000 5000 6000 7000 8000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 web1 web1-target web2 web2-target

slide-18
SLIDE 18

Dynamic Workload

18

1000 2000 3000 4000 5000 6000 7000 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2)

Web1 Load

1000 2000 3000 4000 5000 6000 7000 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2) Service Response Time (ms) Service Response Time (ms)

slide-19
SLIDE 19

Deviation from Target

Without Control With Control Mean Deviation (ms) Target Violation Period Mean Deviation (ms) Target Violation Period Static Load Instance 1 540 0% 109 (-80%) 9% Instance 2 1043 100% 282 (-73%) 26% Dynamic Load Instance 1 276 7% 182 (-34%) 9% Instance 2 354 27% 336 (-5%) 12%

19

Our control system helps track the target service response time

slide-20
SLIDE 20

Further Improvements

  • Enhance Service Model

– Service dependencies

  • Cache / Proxy / Load balancer effects

– Non-cpu resource (disk / network / memory)

  • Robust Control system

– Still susceptible to temporal system effects (garbage collections / cron jobs) – Need to determine feasibility of the target performance

20

slide-21
SLIDE 21

Conclusions

  • Automated Resource Control system

– Maintain target system’s response time – Allows admin to express allocation in term of performance target

  • Managing service performance could be automate

with sufficient domain knowledge

– Need basic framework to describe performance model – Leave the details to the machines (parameter fitting /

  • ptimization / resource monitoring)

21

slide-22
SLIDE 22

Sample Response & Share values

22

100 200 300 400 500 600 700 1 51 101 151 201 251 301 351 tWeb p95 target 100 200 300 400 500 1 51 101 151 201 251 301 351

shares

share