Stratus: Clouds with Microarchitectural Resource Management Kaveh - - PowerPoint PPT Presentation
Stratus: Clouds with Microarchitectural Resource Management Kaveh - - PowerPoint PPT Presentation
Stratus: Clouds with Microarchitectural Resource Management Kaveh Razavi and Animesh Trivedi Once Upon a Time in the Cloud Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB 2 Once Upon a Time in the Cloud Large: 4 cores, 16
Once Upon a Time in the Cloud
2
Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB
Once Upon a Time in the Cloud
3
Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB
CPU DRAM
allocate(small)
cloud provider tenant
Once Upon a Time in the Cloud
4
Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB
CPU DRAM
allocate(small)
tenant cloud provider
Once Upon a Time in the Cloud
5
Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB
CPU DRAM
a l l
- c
a t e ( m e d i u m ) allocate(small)
tenant tenant cloud provider
Once Upon a Time in the Cloud
6
Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB
CPU
Shared L3 Cache
DRAM
allocate(small) a l l
- c
a t e ( m e d i u m )
What could possibly go wrong here when two tenant share the L3 cache?
cloud provider
Problems with (Unsupervised) Sharing
7
dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours
Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018.
CPU
Shared L3 Cache
NetCAT (S&P’20): detect activity
- f another tenant over network
(a) Performance (b) Security
Problems with (Unsupervised) Sharing
8
dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours
Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.
CPU
Shared L3 Cache
NetCAT (S&P’20): detect activity
- f another tenant over network
(a) Performance (b) Security
Problems with (Unsupervised) Sharing
9
dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours
Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.
CPU
Shared L3 Cache
NetCAT (S&P’20): detect activity
- f another tenant over network
(a) Performance (b) Security
Are these two examples L3 Cache specific?
New Classes of Microarchitectural Resources
10
Resource Microarchitectural Resource
CPU Caches, TLBs, Hyperthreads, ALUs Smart NICs Caches (memory, requests, connection), TLBs, RMT pipelines, DMA engines NVM Storage Blocks, pages, internal r/w ports, programmable cores, SRAM GPUs Memories, caches, execution units In-Network Switches SRAM and TCAM memories, Match-action Unit processors, ALUs
And more - TPUs, FPGAs, near-memory compute elements …
- Diverse microarchitectural resources are here to stay
- They have security and performance ramifications
- The root cause is unsupervised sharing (or lack of isolation)
Key challenge
How to manage microarchitectural resources in a principled manner?
What is happening
11
Stratus: Clouds with Principled Microarchitectural Resource Management
12
Key property : isolation
Stratus: Clouds with Principled Microarchitectural Resource Management
A cloud resource allocation framework Security and performance requirements are the two sides of isolation Captures and reasons about microarchitectural resource isolation
13
Stratus: Clouds with Principled Microarchitectural Resource Management
14
Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?
Stratus: Clouds with Principled Microarchitectural Resource Management
15
Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?
Capturing Isolation: A Declarative Interface
16
Isolation captured as constraints on resource allocations
handle = ISOLATE (resource, scale, quantity);
Capturing Isolation: A Declarative Interface
17
hard: discrete allocation LLC slots, ALUs, TLBs, soft: contented (in time) DRAM bandwidth Extent of isolation requested {0,1} Number of resources for which this constraint must be satisfied
Isolation captured as constraints on resource allocations
handle = ISOLATE (resource, scale, quantity);
Capturing Isolation: A Declarative Interface
18
handle = ISOLATE (resource, scale, quantity); ATTACH (handle1, handle2, ...);
Multiple constraints can be “attached” (AND) by their handles
Capturing Isolation: A Declarative Interface
19
ATTACH (handle1, handle2, ...);
Pass multiple microarchitectural constraints during cloud resource allocations
(also labeled grouped constraints -- see the paper)
ALLOCATE cloud_resource, .. where constraints handle = ISOLATE (resource, scale, quantity);
Capturing Isolation: A Declarative Interface
20
ATTACH (handle1, handle2, ...); ALLOCATE cloud_resource, .. where constraints E.g., mitigating NetCAT: ALLOCATE small where h1 = ISOLATE (CPU.LLC, 1.0, 64), // the first 64 lines h2 = ISOLATE (NIC.*, *, ...), // all NIC-level uarch ATTACH (h1, h2); // both for small VM allocations handle = ISOLATE (resource, scale, quantity);
Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?
Stratus: Clouds with Principled Microarchitectural Resource Management
21
Building a Reasoning Framework
22
CKB
Resources
- Topology, numbers, types
- Datasheets information
Online measurements
- Utilization, spare
- Occupancy
- Similar in the spirit as SKB
in Barrelfish OS (SOSP’09)
- Structured representation of
knowledge in one place
- Can be queried
Cloud Knowledge Base
Building a Reasoning Framework
23
CKB
Resources
- Topology, numbers, types
- Datasheets information
Online measurements
- Utilization, spare
- Occupancy
Cloud Knowledge Base
Tenant’s constraints (CNF format) Allocation strategy
Stratus: Clouds with Principled Microarchitectural Resource Management
24
Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?
Charging for Isolation: Isolation Credits
25
Low isolation + High utilization + Better efficiency
- Low perf/security
Isolation spectrum
High isolation + Low interference + Better perf/security
- Low utilization
Charging for Isolation: Isolation Credits
26
Isolation spectrum
Isolation credit: a currency to capture this tradeoff
- Encourages tenants to only specify relevant constraints
- Encourages providers to innovate in better isolation mechanisms
Low isolation + High utilization + Better efficiency
- Low perf/security
High isolation + Low interference + Better perf/security
- Low utilization
Charging for Isolation: Isolation Credits
27
Isolation spectrum
Stratus Cloud Tenant who knows
constraints 42 credits
Tenant with a budget
c r e d i t b u d g e t = 4 2 constraints
Low isolation + High utilization + Better efficiency
- Low perf/security
High isolation + Low interference + Better perf/security
- Low utilization
Managing microarchitectural resources in a principled manner Three key ideas: 1. A declarative interface 2. A Cloud Knowledge Base (CKB) 3. Isolation Credits
Summary: Stratus
28
Challenges and Discussion Points
29
Enforcing isolation
- Mechanisms
- Policies
Right constraints
- Profile guided
- Security libraries
Scalability
- CKB
- O(1-10ms)