April 6, 2016 ASPLOS 2016 Atlanta, Georgia.
Yaron Weinsberg
IBM Research
Idit Keidar
Technion
Hagar Porat
Technion
Eran Harpaz
Technion
Noam Shalev
Technion
April 6, 2016 ASPLOS 2016 Atlanta, Georgia. Technology scaling - - PowerPoint PPT Presentation
Noam Shalev Technion Hagar Porat Idit Keidar Yaron Weinsberg Eran Harpaz Technion Technion Technion IBM Research April 6, 2016 ASPLOS 2016 Atlanta, Georgia. Technology scaling Many core is here Machines with a thousand cores
Yaron Weinsberg
IBM Research
Idit Keidar
Technion
Hagar Porat
Technion
Eran Harpaz
Technion
Noam Shalev
Technion
Technology scaling
2
Technology scaling
Nano scale phenomena Hardware reliability decreases [Radetzki et al., 2013] Faults more likely
3
4
5
A strategy for overcoming Core Surprise Removal (CSR)
6
A strategy for overcoming Core Surprise Removal (CSR)
Implementation in the Linux kernel. Use Hardware Transactional Memory to cope with
Provide a proof of concept on a real system.
7
Chip Multi-Processor System
Fault-prone cores
Reliable Failure Detection Unit (FDU) [Weis et al. ,2012]
8
Fail-Stop Model
9
Core Core
L1 L1 L2 Cache L3 Cache
Reliable Shared Memory
Core Core
L1 L1 L2 Cache
Core Core
L1 L1 L2 Cache
Core Core
L1 L1 L2 Cache
Flag as faulty
Reset interrupt affinities
Migrate tasklets, work-queues Update kernel services
Terminate the running process
Migrate processes.
10
OS dependent
Flag as faulty Reset interrupt affinities Migrate tasklets, work-queues Update kernel services Terminate the running process Migrate processes.
11
Close Task Reset Interrupts Migrate Tasklets Mark Faulty Migrate Workqueues Migrate Processes Update Services
Close Task Reset Interrupts Migrate Tasklets Mark Faulty Migrate Workqueues Migrate Processes Update Services
Recovery Workqueue Tasklet Queue Close Task Reset Interrupts Migrate Tasklets Mark Faulty Migrate Workqueues Migrate Processes Update Services Queue Work
15
Recovery Workqueue Tasklet Queue Close Task Reset Interrupts Migrate Tasklets Mark Faulty Migrate Workqueues Migrate Processes Update Services Queue Work Recovery Ops Verify Visibility Inform FDU Queue Tasklets Resume FDU Triggered Ack
16 Migrate Tasklets Migrate Tasklets
Use tasklets and work-queues to execute the recovery process In a cascading failure case:
Mark as faulty
Reset Interrupts
Queue Work Close Task Migrate Workqueues Update Kernel Services Migrate Tasks
𝑫𝑬
Queue Tasklets Verify Visibility Inform FDU Execute Tasklets
𝑫𝑬
17
Designed to integrate into commodity operating systems No overhead when the system is correct
Tolerates cascading failures Scalable
18
19
Modified QEMU
Run different workloads
Recovery validation
20
21
Core#0 Core#1 Core#2 Core#3
22
70%
8%
5%
17%
401.bzip2 x4 88%
4% 6%
410.bwaves x4
86%
8% 6%
K-means x8
88%
8%
4%
K-means x16
Successful Recovery Scheduler Locks FS/MM Locks Other Locks
70%
5% 10% 15%
429.mcf x4
68%
10% 12% 10%
Postmark x4
Postmark 429.mcf K-means 410.bwaves 401.bzip2 5% 22% 99% 99% 99% 21% 14% 1% 1% 1% 45% 19%
Workload Properties
User System IOWait Idle
23
70%
8%
5%
17%
401.bzip2 x4 88%
4% 6%
410.bwaves x4
86%
8% 6%
K-means x8
88%
8%
4%
K-means x16
Successful Recovery Scheduler Locks FS/MM Locks Other Locks
70%
5% 10% 15%
429.mcf x4
68%
10% 12% 10%
Postmark x4
Postmark 429.mcf K-means 410.bwaves 401.bzip2 5% 22% 99% 99% 99% 21% 14% 1% 1% 1% 45% 19%
Workload Properties
User System IOWait Idle
24
A strategy for overcoming Core Surprise Removal (CSR)
Implementation in the Linux kernel Use Hardware Transactional Memory to cope with
Provide a proof of concept on a real system.
25
26
Replace scheduler locks with lock elision code TSX is a best effort HTM
Energ rgy Saving Performa formance ce Gain Commit Rate Workloa
4%
Idle 1% 0% 99.9% 16-threads 3% 3% 99.9% 32-threads 2% 4% 99.8% 64-threads
27
interrupts_disable(); //unresponsive If (fault_injection()==smp_processor_id()) while(TRUE); //”stops” executing
28
Crash simulation on a real system
29
Failure is detected
Core #13 has no tasks Tasks migrated to core #0
Load is balanced
30
Initial correct state cloud setting After a crash, original kernel Real Time: 7:58 Real Time: 8:00
31
Initial correct state cloud setting After a crash, CSR on host Real Time: 8:00 Real Time: 7:58
32
33