Stratus: Clouds with Microarchitectural Resource Management Kaveh - - PowerPoint PPT Presentation

stratus clouds with microarchitectural resource management
SMART_READER_LITE
LIVE PREVIEW

Stratus: Clouds with Microarchitectural Resource Management Kaveh - - PowerPoint PPT Presentation

Stratus: Clouds with Microarchitectural Resource Management Kaveh Razavi and Animesh Trivedi Once Upon a Time in the Cloud Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB 2 Once Upon a Time in the Cloud Large: 4 cores, 16


slide-1
SLIDE 1

Stratus: Clouds with Microarchitectural Resource Management

Kaveh Razavi and Animesh Trivedi

slide-2
SLIDE 2

Once Upon a Time in the Cloud

2

Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB

slide-3
SLIDE 3

Once Upon a Time in the Cloud

3

Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB

CPU DRAM

allocate(small)

cloud provider tenant

slide-4
SLIDE 4

Once Upon a Time in the Cloud

4

Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB

CPU DRAM

allocate(small)

tenant cloud provider

slide-5
SLIDE 5

Once Upon a Time in the Cloud

5

Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB

CPU DRAM

a l l

  • c

a t e ( m e d i u m ) allocate(small)

tenant tenant cloud provider

slide-6
SLIDE 6

Once Upon a Time in the Cloud

6

Large: 4 cores, 16 GB Medium: 2 cores, 8 GB Small: 1 core, 4 GB

CPU

Shared L3 Cache

DRAM

allocate(small) a l l

  • c

a t e ( m e d i u m )

What could possibly go wrong here when two tenant share the L3 cache?

cloud provider

slide-7
SLIDE 7

Problems with (Unsupervised) Sharing

7

dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours

Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018.

CPU

Shared L3 Cache

NetCAT (S&P’20): detect activity

  • f another tenant over network

(a) Performance (b) Security

slide-8
SLIDE 8

Problems with (Unsupervised) Sharing

8

dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours

Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.

CPU

Shared L3 Cache

NetCAT (S&P’20): detect activity

  • f another tenant over network

(a) Performance (b) Security

slide-9
SLIDE 9

Problems with (Unsupervised) Sharing

9

dCat (EuroSys’18): 57.6% improvements for Redis with noisy neighbours

Cong Xu, et al. DCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. ACM EuroSys 2018. NetCAT: Practical Cache Attacks from the Network. Kurth, M.; Gras, B.; Andriesse, D.; Giuffrida, C.; Bos, H.; and Razavi, K. In S&P, 2020.

CPU

Shared L3 Cache

NetCAT (S&P’20): detect activity

  • f another tenant over network

(a) Performance (b) Security

Are these two examples L3 Cache specific?

slide-10
SLIDE 10

New Classes of Microarchitectural Resources

10

Resource Microarchitectural Resource

CPU Caches, TLBs, Hyperthreads, ALUs Smart NICs Caches (memory, requests, connection), TLBs, RMT pipelines, DMA engines NVM Storage Blocks, pages, internal r/w ports, programmable cores, SRAM GPUs Memories, caches, execution units In-Network Switches SRAM and TCAM memories, Match-action Unit processors, ALUs

And more - TPUs, FPGAs, near-memory compute elements …

slide-11
SLIDE 11
  • Diverse microarchitectural resources are here to stay
  • They have security and performance ramifications
  • The root cause is unsupervised sharing (or lack of isolation)

Key challenge

How to manage microarchitectural resources in a principled manner?

What is happening

11

slide-12
SLIDE 12

Stratus: Clouds with Principled Microarchitectural Resource Management

12

Key property : isolation

slide-13
SLIDE 13

Stratus: Clouds with Principled Microarchitectural Resource Management

A cloud resource allocation framework Security and performance requirements are the two sides of isolation Captures and reasons about microarchitectural resource isolation

13

slide-14
SLIDE 14

Stratus: Clouds with Principled Microarchitectural Resource Management

14

Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?

slide-15
SLIDE 15

Stratus: Clouds with Principled Microarchitectural Resource Management

15

Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?

slide-16
SLIDE 16

Capturing Isolation: A Declarative Interface

16

Isolation captured as constraints on resource allocations

handle = ISOLATE (resource, scale, quantity);

slide-17
SLIDE 17

Capturing Isolation: A Declarative Interface

17

hard: discrete allocation LLC slots, ALUs, TLBs, soft: contented (in time) DRAM bandwidth Extent of isolation requested {0,1} Number of resources for which this constraint must be satisfied

Isolation captured as constraints on resource allocations

handle = ISOLATE (resource, scale, quantity);

slide-18
SLIDE 18

Capturing Isolation: A Declarative Interface

18

handle = ISOLATE (resource, scale, quantity); ATTACH (handle1, handle2, ...);

Multiple constraints can be “attached” (AND) by their handles

slide-19
SLIDE 19

Capturing Isolation: A Declarative Interface

19

ATTACH (handle1, handle2, ...);

Pass multiple microarchitectural constraints during cloud resource allocations

(also labeled grouped constraints -- see the paper)

ALLOCATE cloud_resource, .. where constraints handle = ISOLATE (resource, scale, quantity);

slide-20
SLIDE 20

Capturing Isolation: A Declarative Interface

20

ATTACH (handle1, handle2, ...); ALLOCATE cloud_resource, .. where constraints E.g., mitigating NetCAT: ALLOCATE small where h1 = ISOLATE (CPU.LLC, 1.0, 64), // the first 64 lines h2 = ISOLATE (NIC.*, *, ...), // all NIC-level uarch ATTACH (h1, h2); // both for small VM allocations handle = ISOLATE (resource, scale, quantity);

slide-21
SLIDE 21

Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?

Stratus: Clouds with Principled Microarchitectural Resource Management

21

slide-22
SLIDE 22

Building a Reasoning Framework

22

CKB

Resources

  • Topology, numbers, types
  • Datasheets information

Online measurements

  • Utilization, spare
  • Occupancy
  • Similar in the spirit as SKB

in Barrelfish OS (SOSP’09)

  • Structured representation of

knowledge in one place

  • Can be queried

Cloud Knowledge Base

slide-23
SLIDE 23

Building a Reasoning Framework

23

CKB

Resources

  • Topology, numbers, types
  • Datasheets information

Online measurements

  • Utilization, spare
  • Occupancy

Cloud Knowledge Base

Tenant’s constraints (CNF format) Allocation strategy

slide-24
SLIDE 24

Stratus: Clouds with Principled Microarchitectural Resource Management

24

Q1: How to capture isolation? Q2: How to reason about isolation? Q3: How to charge for isolation?

slide-25
SLIDE 25

Charging for Isolation: Isolation Credits

25

Low isolation + High utilization + Better efficiency

  • Low perf/security

Isolation spectrum

High isolation + Low interference + Better perf/security

  • Low utilization
slide-26
SLIDE 26

Charging for Isolation: Isolation Credits

26

Isolation spectrum

Isolation credit: a currency to capture this tradeoff

  • Encourages tenants to only specify relevant constraints
  • Encourages providers to innovate in better isolation mechanisms

Low isolation + High utilization + Better efficiency

  • Low perf/security

High isolation + Low interference + Better perf/security

  • Low utilization
slide-27
SLIDE 27

Charging for Isolation: Isolation Credits

27

Isolation spectrum

Stratus Cloud Tenant who knows

constraints 42 credits

Tenant with a budget

c r e d i t b u d g e t = 4 2 constraints

Low isolation + High utilization + Better efficiency

  • Low perf/security

High isolation + Low interference + Better perf/security

  • Low utilization
slide-28
SLIDE 28

Managing microarchitectural resources in a principled manner Three key ideas: 1. A declarative interface 2. A Cloud Knowledge Base (CKB) 3. Isolation Credits

Summary: Stratus

28

slide-29
SLIDE 29

Challenges and Discussion Points

29

Enforcing isolation

  • Mechanisms
  • Policies

Right constraints

  • Profile guided
  • Security libraries

Scalability

  • CKB
  • O(1-10ms)

Is microarchitectural resource management really worth it? More efforts on mechanisms or better policies? Can we have better hardware support from vendors? What are we missing from a cloud-operation point of view?