 
              How to Tame your VM: an Automated Control System for Virtualized Services Akkarit Sangpetch Andrew Turner Hyong Kim asangpet@andrew.cmu.edu andrewtu@andrew.cmu.edu kim@ece.cmu.edu Department of Electrical and Computer Engineering Carnegie Mellon University LISA 2010 - November 11, 2010
Virtualized Services • Virtualized Infrastructure (vSphere,Hyper-V,ESX,KVM) Web1 App1 DB1 – Easy to deploy App2 DB2 Web2 – Easy to migrate Db3 Web3 App3 – Easy to re-use virtual Hypervisor Hypervisor Hypervisor machines • Multi-tier services are configured and deployed as virtual machines 2
Problem: What about performance? • VMs share physical resource (CPU time / Disk / Network / Memory bandwidth) • Need to maintain service quality Web1 App1 DB1 – Response time / Latency App2 DB2 Web2 • Existing infrastructure provides Db3 Web3 App3 mismatch interface between the service’s performance goal (latency) Hypervisor Hypervisor Hypervisor and configurable parameters (resource shares) – Admins need to fully understand the services before able to tune the system 3
Existing Solutions – Managing Service Performance • • Resource Provisioning Load Balancing – – Admins define resource shares / Migrate VMs to balance host reservations / limits utilization – – The “correct” values depends on Free up capacity for each host (low hardware / applications / workloads utilization ~ better performance) # Shares vm1 vm3 vm1 50 vm2 vm4 vm2 10 Host 1 vm3 10 vm5 vm4 vm4 30 Host 2 4
Automated Resource Control • Dynamically adjust resource vm1 response vm1 25 vm1 target shares during run-time vm2 25 vm2 target vm2 response • Input = required service vm3 25 vm3 target vm3 response response time vm4 target vm4 25 vm4 response – I want to serve pages from vm1 in less than 4 seconds • The system calculates Control system number of shares required to maintain service level vm1 target – Adapt to changes in workload vm1 response vm1 50 – Admins do not have to guess vm2 10 vm2 target vm2 response the number of shares vm3 10 vm3 target vm3 response vm4 30 vm4 target vm4 response 5
Control System Calculate model parameters Record service Modeler response time Sensor Sensor Sensor VM VM VM VM VM VM Actuator Actuator Actuator KVM Host KVM Host KVM Host Adjust host CPU shares Controller Calculate # of shares required 6
Sensor unit Sensor VM Server To Modeler analyzer VM {service:blog, tier:web, vm: www1, response time 250 ms}, tshark {service:blog, tier:db, vm:db1, response time 2.5 ms} Request Request • Record service response time Packets Packets • Starts when the server sees Host Bridge xtables TEE “request” packet • Until the server sends “response” packet • Based on libpcap / tshark – need to recognize the application’s Request header Packets 7
Actuator Unit • Adjust number of shares based on controller’s allocation 50 Server 1 • We use Linux/KVM as our Shares VM 80 Allocation hypervisor • Cgroup cpu.shares – {server1:80, 50 Server 2 server2:20 } control CFS scheduling VM 20 shares • Similar mechanism exists Actuator Agent on other platform (Xen weight / ESX shares) CPU Hypervisor Cgroup 8
Modeler Client Web server Database T DB S cpu T web • The modeler calculates the expected service response time based on shares allocated to each VM • We use an intuitive model for our 2-tier services in the experiment T web = f (T DB ) T DB = f (S cpu ) 9
Resource Shares Effect Test Clients SQL Server Web VM server (mysql) VM Cruncher (Wordpress) VM (100% cpu) Host 1 Host 2 NAS 10
SQL Server CPU Allocation effect Web Response Database Response 800 500 40 500 450 450 700 35 400 400 # of DB CPU shares (out of 1000) # of DB CPU shares (out of 1000) 600 30 Daabase Response Time (ms) Web Response Time (ms) 350 350 500 25 300 300 400 250 20 250 200 200 300 15 150 150 200 10 100 100 100 5 50 50 0 0 0 0 1 101 201 301 401 501 601 701 801 901 1 101 201 301 401 501 601 701 801 901 Time step (5s) Time step (5s) thttp-msec cpu tdb-msec cpu 11
Database Response vs CPU 20 18 16 Database Response Time (ms) 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 450 500 # of CPU Shares allocated to SQL VM (out of 1000) T DB = a 0 (S cpu ) b 0 12
Database Response vs CPU (controlled) 30 25 Database Response Time (ms) 20 15 10 5 0 0 100 200 300 400 500 600 700 800 900 1000 # of CPU Shares allocated to SQL server VM (out of 1000) 13
HTTP vs Database response 800 700 600 Web Response (ms) 500 400 300 200 100 0 0 2 4 6 8 10 12 14 16 18 20 Database Response (ms) T web = a 1 (T DB ) + b 1 14
Controllers • Model uses readings to estimate the service parameters <a 0 ,b 0 ,a 1 ,b 1 > – T DB = a 0 (S cpu ) b 0 – T web = a 1 (T DB ) + b 1 • Controller finds the minimal S cpu such that T web < specified response time – Long-term control: uses moving average to find model parameters & shares – Short-term control: uses the last-period reading to find model parameters • Avoid excessive response time violation while waiting for the long-term model to settle 15
System Evaluations Test Clients Web server 2 VM Web server 1 DB server 2 VM VM DB Server 1 VM Sensor Sensor Actuator Sensor Host 1 Host 3 Host 2 NAS 16
Static Workload 8000 Service Response Time (ms) 7000 6000 5000 4000 3000 2000 1000 0 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 web1 web1-target web2 web2-target Service Response Time (ms) 8000 7000 6000 5000 4000 3000 2000 1000 0 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 web1 web1-target web2 web2-target 17
Dynamic Workload Web1 Load Service Response Time (ms) 7000 6000 5000 4000 3000 2000 1000 0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2) 7000 Service Response Time (ms) 6000 5000 4000 3000 2000 1000 0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 web1 web1-target web2 web2-target 50 per. Mov. Avg. (web1) 50 per. Mov. Avg. (web2) 18
Deviation from Target Without Control With Control Mean Target Mean Target Deviation Violation Deviation Violation (ms) Period (ms) Period Static Load Instance 1 540 0% 109 (-80%) 9% Instance 2 1043 100% 282 (-73%) 26% Dynamic Load Instance 1 276 7% 182 (-34%) 9% Instance 2 354 27% 336 (-5%) 12% Our control system helps track the target service response time 19
Further Improvements • Enhance Service Model – Service dependencies • Cache / Proxy / Load balancer effects – Non-cpu resource (disk / network / memory) • Robust Control system – Still susceptible to temporal system effects (garbage collections / cron jobs) – Need to determine feasibility of the target performance 20
Conclusions • Automated Resource Control system – Maintain target system’s response time – Allows admin to express allocation in term of performance target • Managing service performance could be automate with sufficient domain knowledge – Need basic framework to describe performance model – Leave the details to the machines (parameter fitting / optimization / resource monitoring) 21
Sample Response & Share values 700 600 500 400 tWeb p95 300 target 200 100 0 1 51 101 151 201 251 301 351 shares 500 400 300 share 200 100 0 1 51 101 151 201 251 301 351 22
Recommend
More recommend