SLIDE 1 The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors
Austin T. Clements
Nickolai Zeldovich Robert Morris Eddie Kohler † MIT CSAIL and † Harvard
SLIDE 2 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Current approach to scalable software development
SLIDE 3 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Workload
Current approach to scalable software development
SLIDE 4 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Workload Plot scalability
Current approach to scalable software development
SLIDE 5 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Workload Plot scalability Differential profile
x()
Current approach to scalable software development
SLIDE 6 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Workload Plot scalability Differential profile Fix top bottleneck
x() +++
Current approach to scalable software development
SLIDE 7 Linux scalability
OSDI '10
Bonsai VM
ASPLOS '12
RadixVM
EuroSys '13
Corey
OSDI '08 2008 2009 2010 2011 2012 2013 2014
Workload Plot scalability Differential profile Fix top bottleneck
x() +++
Current approach to scalable software development
SLIDE 8 Successful in practice because it focuses developer effort Disadvantages
- New workloads expose new bottlenecks
- More cores expose new bottlenecks
- The real bottlenecks may be in the interface design
Current approach to scalable software development
SLIDE 9 Successful in practice because it focuses developer effort Disadvantages
- New workloads expose new bottlenecks
- More cores expose new bottlenecks
- The real bottlenecks may be in the interface design
Current approach to scalable software development
SLIDE 10
creat("x") creat("y") creat("z")
Interface scalability example
SLIDE 11
creat("x") creat("y") creat("z")
stdin stdout stderr
Interface scalability example
SLIDE 12
creat("x") creat("y") creat("z")
stdin stdout stderr
Interface scalability example
SLIDE 13
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule
Approach: Interface-driven scalability
SLIDE 14
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule ?
creat with lowest FD Commutes Scalable implementation exists
Approach: Interface-driven scalability
SLIDE 15
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule ?
creat with lowest FD Commutes Scalable implementation exists creat → 3 creat → 4
Approach: Interface-driven scalability
SLIDE 16
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule
creat with lowest FD Commutes Scalable implementation exists
✗
Approach: Interface-driven scalability
SLIDE 17
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule
creat with lowest FD Commutes Scalable implementation exists
✗ ?
creat with any FD creat → 42 creat → 17
Approach: Interface-driven scalability
SLIDE 18
Whenever interface operations commute, they can be implemented in a way that scales. The scalable commutativity rule
creat with lowest FD Commutes Scalable implementation exists
✗
creat with any FD
✓ ✓
rule
Approach: Interface-driven scalability
SLIDE 19
Design Implement Test The rule enables reasoning about scalability throughout the software design process Guides design of scalable interfaces Sets a clear implementation target Systematic, workload-independent scalability testing
Advantages of interface-driven scalability
SLIDE 20 The scalable commutativity rule
- Formalization of the rule and proof of its correctness
- State-dependent, interface-based commutativity
Commuter: An automated scalability testing tool sv6: A scalable POSIX-like kernel
Contributions
SLIDE 21 Defining the rule
- Definition of scalability
- Intuition
- Formalization
Applying the rule
Outline
SLIDE 22 5 10 15 20 25 30 35 40 1 6 12 18 24 30 36 42 48 Normalized throughput Cores gmake Exim
A scalability bottleneck
SLIDE 23 5 10 15 20 25 30 35 40 1 6 12 18 24 30 36 42 48 Normalized throughput Cores gmake Exim
One contended cache line
A single contended cache line can wreck scalability
A scalability bottleneck
SLIDE 24 5k 10k 15k 20k 25k 1 10 20 30 40 50 60 70 80 Cycles to read 1 writer + N readers
Cost of a contended cache line
SLIDE 25 5k 10k 15k 20k 25k 1 10 20 30 40 50 60 70 80 Cycles to read 1 writer + N readers
Cost of a contended cache line
SLIDE 26 ✗ ✗ ✗ Core X Core Y W R
R
✓ ✓
✓
What scales on today's multicores?
SLIDE 27 ✗ ✗ ✗ Core X Core Y W R
R
✓ ✓
✓ ✓
What scales on today's multicores?
SLIDE 28 ✗ ✗ ✗ Core X Core Y W R
R
✓ ✓
✓ ✗
What scales on today's multicores?
SLIDE 29 ✗ ✗ ✗ Core X Core Y W R
R
✓ ✓
✓ We say two or more operations are scalable if they are conflict-free.
What scales on today's multicores?
SLIDE 30
Whenever interface operations commute, they can be implemented in a way that scales. Operations commute results independent of order communication is unnecessary without communication, no conflicts ⇒ ⇒ ⇒
The intuition behind the rule
SLIDE 31
Y SI-commutes in X | | Y ≔ Y SIM-commutes in X | | Y ≔ An implementation m is a step function: state ⨯ inv ↦ state ⨯ resp. Given a specification 𝒯, a history X | | Y in which Y SIM-commutes, and a reference implementation M that can generate X | | Y, ∃ an implementation m of 𝒯 whose steps in Y are conflict-free. Proof by simulation construction. ∀ Y' ∈ reorderings(Y), Z: X | | Y | | Z ∈ 𝒯 ⇔ X | | Y' | | Z ∈ 𝒯. ∀ P ∈ prefixes(reorderings(Y)): P SI-commutes in X | | P.
Formalizing the rule
SLIDE 32 Y SI-commutes in X | | Y ≔ Y SIM-commutes in X | | Y ≔ An implementation m is a step function: state ⨯ inv ↦ state ⨯ resp. Given a specification 𝒯, a history X | | Y in which Y SIM-commutes, and a reference implementation M that can generate X | | Y, ∃ an implementation m of 𝒯 whose steps in Y are conflict-free. Proof by simulation construction. ∀ Y' ∈ reorderings(Y), Z: X | | Y | | Z ∈ 𝒯 ⇔ X | | Y' | | Z ∈ 𝒯. ∀ P ∈ prefixes(reorderings(Y)): P SI-commutes in X | | P. Commutativity is sensitive to
- perations, arguments, and state
Formalizing the rule
SLIDE 33
Commutes Scalable implementation exists P1: creat P1: creat
✗
Example of using the rule
SLIDE 34
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
Example of using the rule
SLIDE 35
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
✓ ✓ (Linux)
Example of using the rule
SLIDE 36
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
✓ ✓ (Linux)
P1: creat("/x") P2: creat("/y")
Example of using the rule
SLIDE 37
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
✓ ✓ (Linux)
P1: creat("/x") P2: creat("/y")
✓ ✓
Example of using the rule
SLIDE 38
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
✓ ✓ (Linux)
P1: creat("/x") P2: creat("/y")
✓ ✓
P1: creat("x", O_EXCL) P2: creat("x", O_EXCL)
Example of using the rule
SLIDE 39
Commutes Scalable implementation exists P1: creat P1: creat
✗
P1: creat("/tmp/x") P2: creat("/etc/y")
✓ ✓ (Linux)
P1: creat("/x") P2: creat("/y")
✓ ✓
P1: creat("x", O_EXCL) P2: creat("x", O_EXCL) Same CWD
✗
Different CWD
✓ ✓
Example of using the rule
SLIDE 40
Interface specification (e.g., POSIX) Commuter Implementation (e.g., Linux) All scalability bottlenecks
Applying the rule to real systems
SLIDE 41 SymInode = tstruct(data = tlist(SymByte), nlink = SymInt) SymIMap = tdict(SymInt, SymInode) SymFilename = tuninterpreted('Filename') SymDir = tdict(SymFilename, SymInt) class POSIX: def __init__(self): self.fname_to_inum = SymDir.any() self.inodes = SymIMap.any() @symargs(src=SymFilename, dst=SymFilename) def rename(self, src, dst): if src not in self.fname_to_inum: return (-1, errno.ENOENT) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0
Symbolic model
Input: Symbolic model
SLIDE 42 rename(a, b) and rename(c, d) commute if:
- Both source files exist and all names are different
- Neither source file exists
- a xor c exists, and it is not the other rename's destination
- Both calls are self-renames
- One call is a self-rename of an existing file and a != c
- a & c are hard links to the same inode, a != c, and b == d
def __init__(self): self.fname_to_inum = SymDir.any() self.inodes = SymIMap.any() @symargs(src=SymFilename, dst=SymFilename) def rename(self, src, dst): if src not in self.fname_to_inum: return (-1, errno.ENOENT) if src == dst: return 0 if dst in self.fname_to_inum: self.inodes[self.fname_to_inum[dst]].nlink -= 1 self.fname_to_inum[dst] = self.fname_to_inum[src] del self.fname_to_inum[src] return 0
Symbolic model Analyzer Commutativity conditions
Commutativity conditions
SLIDE 43 Symbolic model Analyzer Commutativity conditions Testgen Test cases rename(a, b) and rename(c, d) commute if:
- Both source files exist and all names are different
- Neither source file exists
- a xor c exists, and it is not the other rename's destination
- Both calls are self-renames
- One call is a self-rename of an existing file and a != c
- a & c are hard links to the same inode, a != c, and b == d
del self.fname_to_inum[src] return 0
void setup() { close(creat("f0", 0666)); close(creat("f2", 0666)); } void test_opA() { rename("f0", "f1"); } void test_opB() { rename("f2", "f3"); }
Test cases
SLIDE 44 Symbolic model Analyzer Commutativity conditions Testgen Test cases Linux Conflicting cache lines Mtrace/QEMU
- One call is a self-rename of an existing file and a != c
- a & c are hard links to the same inode, a != c, and b == d
void setup() { close(creat("f0", 0666)); close(creat("f2", 0666)); } void test_opA() { rename("f0", "f1"); } void test_opB() { rename("f2", "f3"); }
test_opA test_opB
010100010111001110010110011010101010101
Output: Conflicting cache lines
SLIDE 45
Does the rule help build scalable systems?
Evaluation
SLIDE 46 (Linux 3.8, ramfs)
link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link
All tests conflict-free All tests conflicted
13,664 total test cases 68% are conflict-free Many are "corner cases," many are not.
Commuter finds non-scalable cases in Linux
SLIDE 47 (Linux 3.8, ramfs)
link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link
All tests conflict-free All tests conflicted
13,664 total test cases 68% are conflict-free Many are "corner cases," many are not. Directory-wide locking File descriptor reference counts Address space-wide locking
Commuter finds non-scalable cases in Linux
SLIDE 48
POSIX-like operating system File system and virtual memory system follow commutativity rule Implementation using standard parallel programming techniques, but guided by Commuter
sv6: A scalable OS
SLIDE 49
link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link
All tests conflict-free All tests conflicted
Zero cache lines shared
13,664 total test cases 99% are conflict-free Remaining 1% are mostly "idempotent updates"
Commutative operations can be made to scale
SLIDE 50
link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link
All tests conflict-free All tests conflicted
Zero cache lines shared
13,664 total test cases 99% are conflict-free Remaining 1% are mostly "idempotent updates"
Two pwrites of same data to same offset Two lseeks of same FD to the same offset
Commutative operations can be made to scale
SLIDE 51
- Lowest FD versus any FD
- stat versus xstat
- Unordered sockets
- Delayed munmap
- fork+exec versus posix_spawn
Refining POSIX with the rule
SLIDE 52 qmail-like multithreaded mail server
Non-commutative APIs: Lowest FD Ordered sockets fork+exec
10k 20k 30k 40k 50k 60k 70k 1 10 20 30 40 50 60 70 80 Total emails/sec # cores
Commutative operations matter to app scalabiliy
SLIDE 53 qmail-like multithreaded mail server
Non-commutative APIs: Lowest FD Ordered sockets fork+exec
10k 20k 30k 40k 50k 60k 70k 1 10 20 30 40 50 60 70 80 Total emails/sec # cores
Commutative APIs: Any FD Unordered sockets posix_spawn
Commutative operations matter to app scalabiliy
SLIDE 54 Commutativity and concurrency
- [Bernstein '81]
- [Weihl '88]
- [Steele '90]
- [Rinard '97]
- [Shapiro '11]
Laws of Order [Attiya '11] Disjoint-access parallelism [Israeli '94] Scalable locks [MCS '91] Scalable reference counting [Ellen '07, Corbet '10]
Related work
SLIDE 55
Check it out at http://pdos.csail.mit.edu/commuter Whenever interface operations commute, they can be implemented in a way that scales.
Design Implement Test
Conclusion