the case for the vector operating system
play

The Case for the Vector Operating System Vijay Vasudevan , David - PowerPoint PPT Presentation

The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2)


  1. The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs

  2. A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2) open(f2) fcntl(...) Req2 fcntl(...) ... 2

  3. A webserver accept(...) accept(...) stat(f2) stat(...) open(f2) open(f1) fcntl(...) fcntl(...) Req2 Req1 fcntl(...) fcntl(...) ... ... 2

  4. A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f2) stat(...) stat(f3) open(f2) open(f1) open(f3) fcntl(...) fcntl(...) fcntl(...) Req2 Req3 Req1 fcntl(...) fcntl(...) fcntl(...) ... ... ... 2

  5. A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3

  6. A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3

  7. A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3

  8. A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) 3

  9. A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) 3

  10. A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) { context switch alloc() copy(f1) path_resolve(f1) acl_check(f1) h = hash(f1) lookup(h) read(f1) dealloc() context switch } 3

  11. A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f1) stat(f3) open(f1) open(f2) open(f3) { { { context switch context switch context switch alloc() alloc() alloc() copy(f2) copy(f1) copy(f3) path_resolve(f2) path_resolve(f1) path_resolve(f3) acl_check(f2) acl_check(f1) acl_check(f3) h = hash(f2) h = hash(f1) h = hash(f3) lookup(h) lookup(h) lookup(h) read(f2) read(f1) read(f3) dealloc() dealloc() dealloc() context switch context switch context switch } } } 3

  12. A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) vec_open([f1,f2,f3]) { open(f3) open(f2) open(f1) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 3

  13. A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

  14. A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) Eliminate N-1 context switches context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

  15. A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) Reduce path resolutions vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

  16. A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) Use SSE to hash filenames h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

  17. A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) Search dentry list once lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

  18. VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations 5

  19. VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations One concrete example: mprotect One difficult challenge: managing divergence One possible implementation path 5

  20. Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect 6 Data courtesy of Iulian Moraru

  21. Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: 6 Data courtesy of Iulian Moraru

  22. Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 6 Data courtesy of Iulian Moraru

  23. Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) 6 Data courtesy of Iulian Moraru

  24. Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

  25. Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

  26. Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 30% { ๏ Optimized data structure allocation (sorting) 170% { ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

  27. One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) dealloc() context switch 7

  28. One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

  29. One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

  30. One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) Is this worth h = hash([f1,f2,f3]) joining for? lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

  31. OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive on packet is_new_connection accept process 8

  32. OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive Programming interface? on packet Event abstraction is nice is_new_connection Who vectorizes? Static analysis, compiler OS or App developer? accept Efficiency vs. Latency process 8

  33. Summary of VOS Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9

  34. Summary of VOS Don’t let embarrassingly parallel become embarrassingly wasteful Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend