Improving Scalability of Xen: the 3,000 domains experiment Wei Liu - PowerPoint PPT Presentation

Improving Scalability of Xen: the 3,000 domains experiment Wei Liu <wei.liu2@citrix.com>

Xen: the gears of the cloud ● large user base estimated more than 10 million individuals users ● power the largest clouds in production ● not just for servers

Xen: Open Source GPLv2 with DCO (like Linux) Diverse contributor community source: Mike Day http://code.ncultra.org

Xen architecture: PV guests Dom0 DomU DomU DomU PV backends PV Frontends PV Frontends PV Frontends HW drivers Xen Hardware

Xen architecture: PV protocol Backend Request Producer Frontend Request Consumer Response Consumer Response Producer Event channel for notification

Xen architecture: driver domains Disk Driver Network Dom0 DomU Domain Driver Domain BlockBack NetBack BlockFront Toolstack Disk Driver Network Driver NetFront Xen Hardware

Xen architecture: HVM guests Dom0 HVM DomU stubdom HVM DomU IO emulation IO emulation QEMU QEMU PV Frontends PV backends HW drivers Xen Hardware

Xen architecture: PVHVM guests Dom0 PVHVM PVHVM PVHVM DomU DomU DomU PV Frontends PV backends HW drivers Xen Hardware

Xen scalability: current status Xen 4.2 ● Up to 5TB host memory (64bit) ● Up to 4095 host CPUs (64bit) ● Up to 512 VCPUs for PV VM ● Up to 256 VCPUs for HVM VM ● Event channels ○ 1024 for 32-bit domains ○ 4096 for 64-bit domains

Xen scalability: current status Typical PV / PVHVM DomU ● 256MB to 240GB of RAM ● 1 to 16 virtual CPUs ● at least 4 inter-domain event channels: ○ xenstore ○ console ○ virtual network interface (vif) ○ virtual block device (vbd)

Xen scalability: current status ● From a backend domain's (Dom0 / driver domain) PoV: ○ IPI, PIRQ, VIRQ: related to number of cpus and devices, typical Dom0 has 20 to ~200 ○ yielding less than 1024 guests supported for 64-bit backend domains and even less for 32-bit backend domains ● 1K still sounds a lot, right? ○ enough for normal use case ○ not ideal for OpenMirage (OCaml on Xen) and other similar projects

Start of the story ● Effort to run 1,000 DomUs (modified Mini- OS) on a single host * ● Want more? How about 3,000 DomUs? ○ definitely hit event channel limit ○ toolstack limit ○ backend limit ○ open-ended question: is it practical to do so? * http://lists.xen.org/archives/html/xen-users/2012-12/msg00069.html

Toolstack limit xenconsoled and cxenstored both use select(2) ● xenconsoled: not very critical and can be restarted ● cxenstored: critical to Xen and cannot be shutdown otherwise lost information ● oxenstored: use libev so there is no problem switch from select(2) to poll(2) implement poll(2) for Mini-OS

Event channel limit Identified as key feature for 4.3 release. Two designs came up by far: ● 3-level event channel ABI ● FIFO event channel ABI

3-level ABI Motivation: aimed for 4.3 timeframe ● an extension to default 2-level ABI, hence the name ● started in Dec 2012 ● V5 draft posted Mar 2013 ● almost ready

Default (2-level) ABI 1 Upcall pending flag Selector (1 word, per cpu) 1 0 ... 1 0 0 Bitmap (shared)

3-level ABI 1 Upcall pending flag First level selector 1 0 (per cpu) ... 1 0 0 Second level selector (per cpu) ... ... 1 0 All 0 Bitmap (shared)

3-level ABI Number of event channels: ● 32K for 32 bit guests ● 256K for 64 bit guests Memory footprint: ● 2 bits per event (pending and mask) ● 2 / 16 pages for 32 / 64 bit guests ● NR_VCPUS pages for controlling structure Limited to Dom0 and driver domains

3-level ABI ● Pros ○ general concepts and race conditions are fairly well understood and tested ○ envisioned for Dom0 and driver domains only, small memory footprint ● Cons ○ lack of priority (inherited from 2-level design)

FIFO ABI Motivation: designed ground-up with gravy features ● design posted in Feb 2013 ● first prototype posted in Mar 2013 ● under development, close at hand

FIFO ABI Event word (32 bit) Shared event array Event 1 Event 2 Event 3 Selector for picking up event queue . . . Per CPU control structure Empty queue and non-empty queue (only showing the LINK field)

FIFO ABI Number of event channels: ● 128K (2^17) by design Memory footprint: ● one 32-bit word per event ● up to 128 pages per guest ● NR_VCPUS pages for controlling structure Use toolstack to limit maximum number of event channels a DomU can have

FIFO ABI ● Pros ○ event priority ○ FIFO ordering ● Cons ○ relatively large memory footprint

Community decision ● scalability issue not as urgent as we thought ○ only OpenMirage expressed interest on extra event channels ● delayed until 4.4 release ○ better to maintain one more ABI than two ○ measure both and take one ● leave time to test both designs ○ event handling is complex by nature

Back to the story 3,000 DomUs experiment

3,000 Mini-OS Hardware spec: ● 2 sockets, 4 cores, 16 threads ● 24GB RAM Software config: ● Dom0 16 VCPUs ● Dom0 4G RAM DEMO ● Mini-OS 1 VCPU ● Mini-OS 4MB RAM ● Mini-OS 2 event channels

3,000 Linux Hydramonster hardware spec: ● 8 sockets, 80 cores, 160 threads ● 512GB RAM Software config: ● Dom0 4 VCPUs (pinned) ● Dom0 32GB RAM ● DomU 1 VCPU ● DomU 64MB RAM ● DomU 3 event channels (2 + 1 VIF)

Observation Domain creation time: ● < 500 acceptable ● > 800 slow ● took hours to create 3,000 DomUs

Observation Backend bottleneck: ● network bridge limit in Linux ● PV backend drivers buffer starvation ● I/O speed not acceptable ● Linux with 4G RAM can allocate ~45k event channels due to memory limitation

Observation CPU starvation: ● density too high:1 PCPU vs ~20 VCPUs ● backend domain starvation ● should dedicate PCPUs to critical service domain

Summary Thousands of domains, doable but not very practical at the moment ● hypervisor and toolstack ○ speed up creation ● hardware bottleneck ○ VCPU density ○ network / disk I/O ● Linux PV backend drivers ○ buffer size ○ processing model

Beyond? Possible practical way to run thousands of domains: Disaggregation offload services to dedicated domains and trust Xen scheduler.

Happy hacking and have fun! Q&A

Acknowledgement Pictures used in slides: thumbsup: http://primary3.tv/blog/uncategorized/cal-state- university-northridge-thumbs-up/ hydra: http://www.pantheon. org/areas/gallery/mythology/europe/greek_peo ple/hydra.html

Improving Scalability of Xen: the 3,000 domains experiment Wei Liu - PowerPoint PPT Presentation

Improving Scalability of Xen: the 3,000 domains experiment Wei Liu <wei.liu2@citrix.com> Xen: the gears of the cloud large user base estimated more than 10 million individuals users power the largest clouds in production

Xen past, present and future Stefano Stabellini Xen architecture: PV domains Xen arch: driver

Growth in Known Compounds 70,000,000 63,175,733 60,000,000 54,675,250 50,000,000 50,000,000

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis <aron@hp.com> Xen/ia64

SJVIA Projected Cash Flows as of 10/15/15 $10,000,000 $9,000,000 $8,000,000 $7,000,000

State funding remains below pre-recession levels $300,000,000 $290,000,000 $280,000,000 $273.1M

Xen 4.6 and beyond Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6 timeline Development

APRIL 30, 2019 $14,000,000.00 $12,000,000.00 $10,000,000.00 $8,000,000.00 $6,000,000.00

Virtualization in the Cloud: Featuring Xen and XCP Lars Kurth Xen Community Manager

Xen on ARM A success story Stefano Stabellini - Citrix Xen Project Team Achievements of one year

10 Years of Xen and beyond Lars Kurth Xen Project Community Manager lars.kurth@xen.org

PAPA Technical Meetings - 2017 HMA PRODUCTION BY YEAR 1,200,000 1,000,000 980,000 1,000,000

CFR Data- State-Wide Fiscal Losses State Wide Losses - Education Programs 93,700,000

Camping units 300,000 290,000 280,000 270,000 260,000 250,000 240,000 230,000 220,000

Industrial Robot Outlook 1,000,000 900,000 800,000 700,000 600,000 500,000 400,000 300,000

Lars Kurth Community Manager, Xen Project Chairman, Xen Project Advisory Board lars_kurth

Xen on ARM Stefano Stabellini What is Xen? a type-1 hypervisor small footprint (less

Chapter 7 Manipulator Control 7.1 Introduction This chapter starts with a review of the basics

2018 107IST Annual General Meeting Agenda Welcome / Presidents Comments / Introductions -

SOLR-8542 #haystackconf EU keynote Doug Turnbull We need to step into our time machines

CNN Black Hole-based 'Black Holes' In Ocean Exist, Malaysian airplane theory' Scientists Say

2018 Wahhshon Shiann Whitebean Sh Shewalksabout. t.com I made a digital clearing in

Globalization of Service ShinMing Guo NKFUST Domestic Growth & Expansion

A MySQL Perspective John Scott Mailchimp What is Mailchimps secret sauce? Hint: Its not

Pushing Left, Like a Boss Application Security Foundations Tanya Janca Tanya.Janca@owasp.org

Improving Scalability of Xen: the 3,000 domains experiment Wei Liu - PowerPoint PPT Presentation

Improving Scalability of Xen: the 3,000 domains experiment Wei Liu <wei.liu2@citrix.com> Xen: the gears of the cloud large user base estimated more than 10 million individuals users power the largest clouds in production

Xen past, present and future Stefano Stabellini Xen architecture: PV domains Xen arch: driver

Growth in Known Compounds 70,000,000 63,175,733 60,000,000 54,675,250 50,000,000 50,000,000

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis &lt;aron@hp.com&gt; Xen/ia64

SJVIA Projected Cash Flows as of 10/15/15 $10,000,000 $9,000,000 $8,000,000 $7,000,000

State funding remains below pre-recession levels $300,000,000 $290,000,000 $280,000,000 $273.1M

Xen 4.6 and beyond Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6 timeline Development

APRIL 30, 2019 $14,000,000.00 $12,000,000.00 $10,000,000.00 $8,000,000.00 $6,000,000.00

Virtualization in the Cloud: Featuring Xen and XCP Lars Kurth Xen Community Manager

Xen on ARM A success story Stefano Stabellini - Citrix Xen Project Team Achievements of one year

10 Years of Xen and beyond Lars Kurth Xen Project Community Manager lars.kurth@xen.org

PAPA Technical Meetings - 2017 HMA PRODUCTION BY YEAR 1,200,000 1,000,000 980,000 1,000,000

CFR Data- State-Wide Fiscal Losses State Wide Losses - Education Programs 93,700,000

Camping units 300,000 290,000 280,000 270,000 260,000 250,000 240,000 230,000 220,000

Industrial Robot Outlook 1,000,000 900,000 800,000 700,000 600,000 500,000 400,000 300,000

Lars Kurth Community Manager, Xen Project Chairman, Xen Project Advisory Board lars_kurth

Xen on ARM Stefano Stabellini What is Xen? a type-1 hypervisor small footprint (less

Chapter 7 Manipulator Control 7.1 Introduction This chapter starts with a review of the basics

2018 107IST Annual General Meeting Agenda Welcome / Presidents Comments / Introductions -

SOLR-8542 #haystackconf EU keynote Doug Turnbull We need to step into our time machines

CNN Black Hole-based 'Black Holes' In Ocean Exist, Malaysian airplane theory' Scientists Say

2018 Wahhshon Shiann Whitebean Sh Shewalksabout. t.com I made a digital clearing in

Globalization of Service ShinMing Guo NKFUST Domestic Growth &amp; Expansion

A MySQL Perspective John Scott Mailchimp What is Mailchimps secret sauce? Hint: Its not

Pushing Left, Like a Boss Application Security Foundations Tanya Janca Tanya.Janca@owasp.org

Xen/ia64 Status Update Xen Summit, November 2007 Aron Griffis <aron@hp.com> Xen/ia64

Globalization of Service ShinMing Guo NKFUST Domestic Growth & Expansion