The Case for the Vector Operating System Vijay Vasudevan , David - - PowerPoint PPT Presentation

the case for the vector operating system
SMART_READER_LITE
LIVE PREVIEW

The Case for the Vector Operating System Vijay Vasudevan , David - - PowerPoint PPT Presentation

The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2)


slide-1
SLIDE 1

The Case for the Vector Operating System

Vijay Vasudevan, David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs

slide-2
SLIDE 2

A webserver

2

accept(...) stat(...)

  • pen(f1)

fcntl(...) fcntl(...) ...

Req1

accept(...) stat(f2)

  • pen(f2)

fcntl(...) fcntl(...) ...

Req2

slide-3
SLIDE 3

A webserver

2

accept(...) stat(...)

  • pen(f1)

fcntl(...) fcntl(...) ...

Req1

accept(...) stat(f2)

  • pen(f2)

fcntl(...) fcntl(...) ...

Req2

slide-4
SLIDE 4

2

accept(...) stat(...)

  • pen(f1)

fcntl(...) fcntl(...) ...

Req1

accept(...) stat(f2)

  • pen(f2)

fcntl(...) fcntl(...) ...

Req2

accept(...) stat(f3)

  • pen(f3)

fcntl(...) fcntl(...) ...

Req3

A scalable, parallel webserver

slide-5
SLIDE 5

Req1 Req3

... ... ...

Req2

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

A scalable, parallel webserver

slide-6
SLIDE 6

Req1 Req3

... ... ...

Req2

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

A scalable, parallel webserver

slide-7
SLIDE 7

Req1 Req3

... ... ...

Req2

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

A scalable, parallel webserver

slide-8
SLIDE 8

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

vec_

A scalable, parallel webserver

slide-9
SLIDE 9

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

vec_ vec_stat([f1,f2,f3])

A scalable, parallel webserver

slide-10
SLIDE 10

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

{ context switch alloc() copy(f1) path_resolve(f1) acl_check(f1) h = hash(f1) lookup(h) read(f1) dealloc() context switch }

  • pen(f1)

vec_ vec_stat([f1,f2,f3])

A scalable, parallel webserver

slide-11
SLIDE 11

{ context switch alloc() copy(f3) path_resolve(f3) acl_check(f3) h = hash(f3) lookup(h) read(f3) dealloc() context switch } { context switch alloc() copy(f2) path_resolve(f2) acl_check(f2) h = hash(f2) lookup(h) read(f2) dealloc() context switch }

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

{ context switch alloc() copy(f1) path_resolve(f1) acl_check(f1) h = hash(f1) lookup(h) read(f1) dealloc() context switch }

  • pen(f1)

vec_ vec_stat([f1,f2,f3])

A scalable, parallel webserver

slide-12
SLIDE 12

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

3

accept(...) stat(f3)

  • pen(f3)

accept(...) stat(f2)

  • pen(f2)

accept(...) stat(f1)

  • pen(f1)

vec_ vec_stat([f1,f2,f3]) vec_open([f1,f2,f3]) {

A scalable, parallel webserver

}

slide-13
SLIDE 13

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

4

accept(...) stat([f1,f2,f3])

  • pen(f2)

vec_ vec_ vec_open([f1,f2,f3]) {

A vectored webserver

}

slide-14
SLIDE 14

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

4

accept(...) stat([f1,f2,f3])

  • pen(f2)

vec_ vec_ vec_open([f1,f2,f3]) {

Eliminate N-1 context switches

A vectored webserver

}

slide-15
SLIDE 15

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

4

accept(...) stat([f1,f2,f3])

  • pen(f2)

vec_ vec_ vec_open([f1,f2,f3]) {

A vectored webserver

Reduce path resolutions

}

slide-16
SLIDE 16

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

4

accept(...) stat([f1,f2,f3])

  • pen(f2)

vec_ vec_ vec_open([f1,f2,f3]) {

A vectored webserver

Use SSE to hash filenames

}

slide-17
SLIDE 17

context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch

4

accept(...) stat([f1,f2,f3])

  • pen(f2)

vec_ vec_ vec_open([f1,f2,f3]) {

A vectored webserver

Search dentry list once

}

slide-18
SLIDE 18

VOS core ideas

Known: Batching syscalls improves throughput

๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work

“SIMD” vectorization improves efficiency

๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations

5

slide-19
SLIDE 19

VOS core ideas

Known: Batching syscalls improves throughput

๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work

“SIMD” vectorization improves efficiency

๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations

5

One concrete example: mprotect One difficult challenge: managing divergence One possible implementation path

slide-20
SLIDE 20

Speeding up memory protection

6

375000 750000 1125000 1500000 mprotect vec_mprotect page protections sec

Data courtesy of Iulian Moraru

slide-21
SLIDE 21

Speeding up memory protection

6

vec_mprotect techniques:

375000 750000 1125000 1500000 mprotect vec_mprotect page protections sec

Data courtesy of Iulian Moraru

slide-22
SLIDE 22

Speeding up memory protection

6

vec_mprotect techniques:

๏ Amortize context switches (async batching)

375000 750000 1125000 1500000 mprotect vec_mprotect page protections sec

Data courtesy of Iulian Moraru

slide-23
SLIDE 23

Speeding up memory protection

6

vec_mprotect techniques:

๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting)

375000 750000 1125000 1500000 mprotect vec_mprotect page protections sec

Data courtesy of Iulian Moraru

slide-24
SLIDE 24

Speeding up memory protection

6

vec_mprotect techniques:

๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call

375000 750000 1125000 1500000 mprotect vec_mprotect page protections sec

Data courtesy of Iulian Moraru

slide-25
SLIDE 25

Speeding up memory protection

6

vec_mprotect techniques:

๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call

mprotect vec_mprotect 375000 750000 1125000 1500000 page protections sec

3x performance improvement

Data courtesy of Iulian Moraru

slide-26
SLIDE 26

Speeding up memory protection

6

vec_mprotect techniques:

๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call

mprotect vec_mprotect 375000 750000 1125000 1500000 page protections sec

3x performance improvement

Data courtesy of Iulian Moraru

30% { 170% {

slide-27
SLIDE 27

One difficult challenge

7

  • pen(f2)

vec_open([f1,f2,f3])

Handling convergence and divergence

vec_path_resolve([f2,f3]) acl_check([f2,f3]) lookup(h[1..2]) read([f2,f3]) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) acl_check([f1]) h = hash([f1,f2,f3]) lookup(h[0]) read([f1]) dealloc() context switch

slide-28
SLIDE 28

One difficult challenge

7

  • pen(f2)

vec_open([f1,f2,f3])

fork() ? join() ? Handling convergence and divergence

vec_path_resolve([f2,f3]) acl_check([f2,f3]) lookup(h[1..2]) read([f2,f3]) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) acl_check([f1]) h = hash([f1,f2,f3]) lookup(h[0]) read([f1]) dealloc() context switch

slide-29
SLIDE 29

One difficult challenge

7

  • pen(f2)

vec_open([f1,f2,f3])

fork() ? join() ? messages? Handling convergence and divergence

vec_path_resolve([f2,f3]) acl_check([f2,f3]) lookup(h[1..2]) read([f2,f3]) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) acl_check([f1]) h = hash([f1,f2,f3]) lookup(h[0]) read([f1]) dealloc() context switch

slide-30
SLIDE 30

One difficult challenge

7

  • pen(f2)

vec_open([f1,f2,f3])

fork() ? join() ? messages? Handling convergence and divergence

Is this worth joining for?

vec_path_resolve([f2,f3]) acl_check([f2,f3]) lookup(h[1..2]) read([f2,f3]) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) acl_check([f1]) h = hash([f1,f2,f3]) lookup(h[0]) read([f1]) dealloc() context switch

slide-31
SLIDE 31

OS as staged event system

Ideal interface for vectorization

๏ Use message passing as underlying primitive

8

accept

  • n packet

process

is_new_connection

slide-32
SLIDE 32

OS as staged event system

Ideal interface for vectorization

๏ Use message passing as underlying primitive

8

accept

  • n packet

process

is_new_connection

Programming interface? Event abstraction is nice Who vectorizes? Static analysis, compiler OS or App developer? Efficiency vs. Latency

slide-33
SLIDE 33

Summary of VOS

Vectors fundamentally improve efficiency by

๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges

Challenges

๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge

9

slide-34
SLIDE 34

Summary of VOS

Vectors fundamentally improve efficiency by

๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges

Challenges

๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge

9

Don’t let embarrassingly parallel become embarrassingly wasteful

slide-35
SLIDE 35

Related ideas

10

Community Idea Reason HPC Multicollective I/O readx/writex group open I/O coalescing Reduced synch Databases Work Sharing Query Optimization Reuse “results”, better I/O sched OS FlexSC Cassyopia, Cosy Batching (all) system calls