An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, - - PowerPoint PPT Presentation

an analysis of linux scalability to many cores
SMART_READER_LITE
LIVE PREVIEW

An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, - - PowerPoint PPT Presentation

An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL What is scalability? Application does N times as much


slide-1
SLIDE 1

An Analysis of Linux Scalability to Many Cores

Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev,

  • M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich

MIT CSAIL

slide-2
SLIDE 2

What is scalability?

  • Application does N times as much work on N

cores as it could on 1 core

  • Scalability may be limited by Amdahl's Law:
  • Locks, shared data structures, ...
  • Shared hardware (DRAM, NIC, ...)
slide-3
SLIDE 3

Why look at the OS kernel?

  • Many applications spend time in the kernel
  • E.g. On a uniprocessor, the Exim mail server

spends 70% in kernel

  • These applications should scale with more

cores

  • If OS kernel doesn't scale, apps won't scale
slide-4
SLIDE 4

Speculation about kernel scalability

  • Several kernel scalability studies indicate existing

kernels don't scale well

  • Speculation that fixing them is hard
  • New OS kernel designs:
  • Corey, Barrelfish, fos, Tessellation, …
  • How serious are the scaling problems?
  • How hard is it to fix them?
  • Hard to answer in general, but we shed some light on

the answer by analyzing Linux scalability

slide-5
SLIDE 5

Analyzing scalability of Linux

  • Use a off-the-shelf 48-core x86 machine
  • Run a recent version of Linux
  • Used a lot, competitive baseline scalability
  • Scale a set of applications
  • Parallel implementation
  • System intensive
slide-6
SLIDE 6

Contributions

  • Analysis of Linux scalability for 7 real apps.
  • Stock Linux limits scalability
  • Analysis of bottlenecks
  • Fixes: 3002 lines of code, 16 patches
  • Most fixes improve scalability of multiple apps.
  • Remaining bottlenecks in HW or app
  • Result: no kernel problems up to 48 cores
slide-7
SLIDE 7

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-8
SLIDE 8

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-9
SLIDE 9

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-10
SLIDE 10

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-11
SLIDE 11

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-12
SLIDE 12

Method

  • Run application
  • Use in-memory file system to avoid disk bottleneck
  • Find bottlenecks
  • Fix bottlenecks, re-run application
  • Stop when a non-trivial application fix is

required, or bottleneck by shared hardware (e.g. DRAM)

slide-13
SLIDE 13

Off-the-shelf 48-core server

  • 6 core x 8 chip AMD

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

slide-14
SLIDE 14

Exim memcached Apache PostgreSQL gmake Psearchy Metis

4 8 12 16 20 24 28 32 36 40 44 48

Poor scaling on stock Linux kernel

Y-axis: (throughput with 48 cores) / (throughput with one core)

perfect scaling terrible scaling

slide-15
SLIDE 15

Exim on stock Linux: collapse

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000

Throughput

Cores

Throughput (messages/second)

slide-16
SLIDE 16

Exim on stock Linux: collapse

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000

Throughput

Cores

Throughput (messages/second)

slide-17
SLIDE 17

Exim on stock Linux: collapse

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000 3 6 9 12 15

Throughput Kernel time

Cores

Throughput (messages/second) Kernel CPU time (milliseconds/message)

slide-18
SLIDE 18

Oprofile shows an obvious problem

samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2197 6.1746 vmlinux filemap_fault 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 40 cores: 10000 msg/sec 48 cores: 4000 msg/sec

slide-19
SLIDE 19

Oprofile shows an obvious problem

samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2197 6.1746 vmlinux filemap_fault 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 40 cores: 10000 msg/sec 48 cores: 4000 msg/sec

slide-20
SLIDE 20

Oprofile shows an obvious problem

samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2197 6.1746 vmlinux filemap_fault 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 40 cores: 10000 msg/sec 48 cores: 4000 msg/sec

slide-21
SLIDE 21

Bottleneck: reading mount table

  • sys_open eventually calls:

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

slide-22
SLIDE 22

Bottleneck: reading mount table

  • sys_open eventually calls:

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

slide-23
SLIDE 23

Bottleneck: reading mount table

  • sys_open eventually calls:

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; } Critical section is short. Why does it cause a scalability bottleneck?

slide-24
SLIDE 24

Bottleneck: reading mount table

  • sys_open eventually calls:

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; } Critical section is short. Why does it cause a scalability bottleneck?

  • spin_lock and spin_unlock use many more

cycles than the critical section

slide-25
SLIDE 25

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-26
SLIDE 26

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Allocate a ticket

slide-27
SLIDE 27

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Allocate a ticket

slide-28
SLIDE 28

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Allocate a ticket

slide-29
SLIDE 29

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Allocate a ticket

slide-30
SLIDE 30

120 – 420 cycles

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Allocate a ticket

slide-31
SLIDE 31

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Spin until it's my turn

slide-32
SLIDE 32

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-33
SLIDE 33

Linux spin lock implementation

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Update the ticket value

slide-34
SLIDE 34

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-35
SLIDE 35

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-36
SLIDE 36

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-37
SLIDE 37

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-38
SLIDE 38

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-39
SLIDE 39

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } 500 – 4000 cycles!!

slide-40
SLIDE 40

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-41
SLIDE 41

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-42
SLIDE 42

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; } Previous lock holder notifies next lock holder after sending out N/2 replies

slide-43
SLIDE 43

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-44
SLIDE 44

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-45
SLIDE 45

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-46
SLIDE 46

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-47
SLIDE 47

Scalability collapse caused by non-scalable locks [Anderson 90]

void spin_lock(spinlock_t *lock) { t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) ; /* Spin */ } void spin_unlock(spinlock_t *lock) { lock->current_ticket++; } struct spinlock_t { int current_ticket; int next_ticket; }

slide-48
SLIDE 48

Bottleneck: reading mount table

  • sys_open eventually calls:

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

  • Well known problem, many solutions
  • Use scalable locks [MCS 91]
  • Use message passing [Baumann 09]
  • Avoid locks in the common case
slide-49
SLIDE 49

Solution: per-core mount caches

  • Observation: mount table is rarely modified

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }

slide-50
SLIDE 50

Solution: per-core mount caches

  • Observation: mount table is rarely modified

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }

slide-51
SLIDE 51

Solution: per-core mount caches

  • Common case: cores access per-core tables
  • Modify mount table: invalidate per-core tables

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }

  • Observation: mount table is rarely modified
slide-52
SLIDE 52

Solution: per-core mount caches

  • Common case: cores access per-core tables
  • Modify mount table: invalidate per-core tables

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }

  • Observation: mount table is rarely modified
slide-53
SLIDE 53

Per-core lookup: scalability is better

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000 14000

Throughput with per-core lookup Throughput of stock Linux

Cores Throughput (messages/second)

slide-54
SLIDE 54

Per-core lookup: scalability is better

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000 14000

Throughput with per-core lookup Throughput of stock Linux

Cores Throughput (messages/second)

slide-55
SLIDE 55

No obvious bottlenecks

samples % app name symbol name 3319 5.4462 vmlinux radix_tree_lookup_slot 3119 5.2462 vmlinux unmap_vmas 1966 3.3069 vmlinux filemap_fault 1950 3.2800 vmlinux page_fault 1627 2.7367 vmlinux unlock_page 1626 2.7350 vmlinux clear_page_c 1578 2.6542 vmlinux kmem_cache_free samples % app name symbol name 4207 5.3145 vmlinux radix_tree_lookup_slot 4191 5.2943 vmlinux unmap_vmas 2632 3.3249 vmlinux page_fault 2525 3.1897 vmlinux filemap_fault 2210 2.7918 vmlinux clear_page_c 2131 2.6920 vmlinux kmem_cache_free 2000 2.5265 vmlinux dput 32 cores: 10041 msg/sec 48 cores: 11705 msg/sec

  • Functions execute more slowly on 48 cores
slide-56
SLIDE 56

No obvious bottlenecks

samples % app name symbol name 3319 5.4462 vmlinux radix_tree_lookup_slot 3119 5.2462 vmlinux unmap_vmas 1966 3.3069 vmlinux filemap_fault 1950 3.2800 vmlinux page_fault 1627 2.7367 vmlinux unlock_page 1626 2.7350 vmlinux clear_page_c 1578 2.6542 vmlinux kmem_cache_free samples % app name symbol name 4207 5.3145 vmlinux radix_tree_lookup_slot 4191 5.2943 vmlinux unmap_vmas 2632 3.3249 vmlinux page_fault 2525 3.1897 vmlinux filemap_fault 2210 2.7918 vmlinux clear_page_c 2131 2.6920 vmlinux kmem_cache_free 2000 2.5265 vmlinux dput 32 cores: 10041 msg/sec 48 cores: 11705 msg/sec

  • Functions execute more slowly on 48 cores
slide-57
SLIDE 57

No obvious bottlenecks

samples % app name symbol name 3319 5.4462 vmlinux radix_tree_lookup_slot 3119 5.2462 vmlinux unmap_vmas 1966 3.3069 vmlinux filemap_fault 1950 3.2800 vmlinux page_fault 1627 2.7367 vmlinux unlock_page 1626 2.7350 vmlinux clear_page_c 1578 2.6542 vmlinux kmem_cache_free samples % app name symbol name 4207 5.3145 vmlinux radix_tree_lookup_slot 4191 5.2943 vmlinux unmap_vmas 2632 3.3249 vmlinux page_fault 2525 3.1897 vmlinux filemap_fault 2210 2.7918 vmlinux clear_page_c 2131 2.6920 vmlinux kmem_cache_free 2000 2.5265 vmlinux dput 32 cores: 10041 msg/sec 48 cores: 11705 msg/sec

  • Functions execute more slowly on 48 cores
slide-58
SLIDE 58

No obvious bottlenecks

samples % app name symbol name 3319 5.4462 vmlinux radix_tree_lookup_slot 3119 5.2462 vmlinux unmap_vmas 1966 3.3069 vmlinux filemap_fault 1950 3.2800 vmlinux page_fault 1627 2.7367 vmlinux unlock_page 1626 2.7350 vmlinux clear_page_c 1578 2.6542 vmlinux kmem_cache_free samples % app name symbol name 4207 5.3145 vmlinux radix_tree_lookup_slot 4191 5.2943 vmlinux unmap_vmas 2632 3.3249 vmlinux page_fault 2525 3.1897 vmlinux filemap_fault 2210 2.7918 vmlinux clear_page_c 2131 2.6920 vmlinux kmem_cache_free 2000 2.5265 vmlinux dput 32 cores: 10041 msg/sec 48 cores: 11705 msg/sec dput is causing other functions to slow down

  • Functions execute more slowly on 48 cores
slide-59
SLIDE 59

Bottleneck: reference counting

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); }

  • Ref count indicates if kernel can free object
  • File name cache (dentry), physical pages, ...
slide-60
SLIDE 60

Bottleneck: reference counting

  • Ref count indicates if kernel can free object
  • File name cache (dentry), physical pages, ...

A single atomic instruction limits scalability?! void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); }

slide-61
SLIDE 61

Bottleneck: reference counting

  • Reading the reference count is slow
  • Reading the reference count delays memory
  • perations from other cores
  • Ref count indicates if kernel can free object
  • File name cache (dentry), physical pages, ...

A single atomic instruction limits scalability?! void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); }

slide-62
SLIDE 62

Reading reference count is slow

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-63
SLIDE 63

Reading reference count is slow

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-64
SLIDE 64

Reading reference count is slow

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … }; 120 – 4000 cycles depending on congestion

slide-65
SLIDE 65

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-66
SLIDE 66

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … }; Hardware cache line lock

slide-67
SLIDE 67

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-68
SLIDE 68

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-69
SLIDE 69

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-70
SLIDE 70
  • Contention on a reference count congests the

interconnect

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-71
SLIDE 71
  • Contention on a reference count congests the

interconnect

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-72
SLIDE 72
  • Contention on a reference count congests the

interconnect

Reading the reference count delays memory

  • perations from other cores

void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); } struct dentry { … int ref; … };

slide-73
SLIDE 73

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references
slide-74
SLIDE 74

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

shared

Core 0 Core 1

dentry sloppy counter

slide-75
SLIDE 75

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

shared dentry sloppy counter

1

slide-76
SLIDE 76

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

per-core per-core

Core 0 Core 1

shared dentry sloppy counter

1

slide-77
SLIDE 77

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

1

slide-78
SLIDE 78

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

1

slide-79
SLIDE 79

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

2

slide-80
SLIDE 80

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

2

slide-81
SLIDE 81

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

3

slide-82
SLIDE 82

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

4

slide-83
SLIDE 83

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

4

slide-84
SLIDE 84

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

rm /tmp/foo

Core 0 Core 1

per-core per-core shared dentry sloppy counter

4

slide-85
SLIDE 85

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

1

rm /tmp/foo

slide-86
SLIDE 86

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

1

rm /tmp/foo

slide-87
SLIDE 87

Solution: sloppy counters

  • Observation: kernel rarely needs true value of

ref count

  • Each core holds a few “spare” references

Core 0 Core 1

per-core per-core shared dentry sloppy counter

rm /tmp/foo

slide-88
SLIDE 88

Properties of sloppy counters

  • Simple to start using:
  • Change data structure
  • atomic_inc → sloppy_inc
  • Scale well: no cache misses in common case
  • Memory usage: O(N) space
  • Related to: SNZI [Ellen 07] and distributed

counters [Appavoo 07]

slide-89
SLIDE 89

Sloppy counters: more scalability

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000 14000

Throughput with sloppy counters Throughput with per-core lookup Throughput of stock Linux Cores

Throughput (messages/second)

slide-90
SLIDE 90

Sloppy counters: more scalability

1 4 8 12 16 20 24 28 32 36 40 44 48 2000 4000 6000 8000 10000 12000 14000

Throughput with sloppy counters Throughput with per-core lookup Throughput of stock Linux Cores

Throughput (messages/second)

slide-91
SLIDE 91

Summary of changes

  • 3002 lines of changes to the kernel
  • 60 lines of changes to the applications

Apache Mount tables X X Open file table X X Sloppy counters X X X X X X X Super pages X DMA buffer allocation X X Network stack false sharing X X X Parallel accept X Application modifications X X X memcached Exim PostgreSQL gmake Psearchy Metis inode allocation Lock-free dentry lookup

slide-92
SLIDE 92

Handful of known techniques [Cantrill 08]

  • Lock-free algorithms
  • Per-core data structures
  • Fine-grained locking
  • Cache-alignment
  • Sloppy counters
slide-93
SLIDE 93

Better scaling with our modifications

Y-axis: (throughput with 48 cores) / (throughput with one core)

Exim memcached Apache PostgreSQL gmake Psearchy Metis

4 8 12 16 20 24 28 32 36 40 44 48 Stock Patched

  • Most of the scalability is due to the Linux

community's efforts

slide-94
SLIDE 94

Current bottlenecks

Application Bottleneck memcached HW: transmit queues on NIC Apache HW: receive queues on NIC Exim App: contention on spool directories gmake App: serial stages and stragglers PostgreSQL App: spin lock Psearchy HW: cache capacity Metis HW: DRAM throughput

  • Kernel code is not the bottleneck
  • Further kernel changes might help apps. or hw
slide-95
SLIDE 95

Limitations

  • Results limited to 48 cores and small set of

applications

  • Looming problems
  • fork/virtual memory book-keeping
  • Page allocator
  • File system
  • Concurrent modifications to address space
  • In-memory FS instead of disk
  • 48-core AMD machine ≠ single 48-core chip
slide-96
SLIDE 96

Related work

  • Linux and Solaris scalability studies [Yan 09,10]

[Veal 07] [Tseng 07] [Jia 08] ...

  • Scalable multiprocessor Unix variants
  • Flash, IBM, SGI, Sun, …
  • 100s of CPUs
  • Linux scalability improvements
  • RCU, NUMA awareness, …
  • Our contribution:
  • In-depth analysis of kernel intensive applications
slide-97
SLIDE 97

Conclusion

  • Linux has scalability problems
  • They are easy to fix or avoid up to 48 cores

http://pdos.csail.mit.edu/mosbench

slide-98
SLIDE 98