Online Teaching Lectures are delivered live over Zoom at class time. - - PowerPoint PPT Presentation

online teaching
SMART_READER_LITE
LIVE PREVIEW

Online Teaching Lectures are delivered live over Zoom at class time. - - PowerPoint PPT Presentation

Online Teaching Lectures are delivered live over Zoom at class time. q Also recorded for offline viewing after class. q Time to become a Zoom master J Project demos will be done live in class (preferred if possible), or through prerecorded


slide-1
SLIDE 1

Online Teaching

Ø Lectures are delivered live over Zoom at class time.

q Also recorded for offline viewing after class. q Time to become a Zoom master J

Ø Project demos will be done live in class (preferred if possible),

  • r through prerecorded videos to be played in class.

q Prepare a prerecorded video as backup even if you plan a live demo. q Send your video and slides to Ruixuan by 11am on demo day.

  • Use

YouTube or WU Box.

q Feel free to adapt your projects.

Ø Your feedbacks are welcome!

q We will work together to optimize online learning. 1

slide-2
SLIDE 2

Coming Up

Ø Demo II: 3/31 (next Tuesday)

q 10 min per team. q Email Ruixuan your slides and videos by 11am. q Gearing up for the final demo.

Ø Critique #3: 4/7

q S. Xi, M. Xu, C. Lu, L. Phan, C. Gill, O. Sokolsky and I. Lee, Real-Time

Multi-Core Virtual Machine Scheduling in Xen, ACM International Conference on Embedded Software (EMSOFT'14), October 2014.

2

slide-3
SLIDE 3

Parallel Real-Time Systems for Latency-Critical Applications

Chenyang Lu

CSE 520S

slide-4
SLIDE 4

Cyber-Physical Systems (CPS)

4

Since the application interacts with the physical world, its computation must be completed under a time constraint. NSF Cyber-Physical Systems Program Solicitation: CPS are built from, and depend upon, the seamless integration of computational algorithms and physical components.

Cyber-Physical Boundary ^ Robert L. and Terry L. Bowen Large Scale Structures Laboratory at Purdue University

Real-Time Hybrid Simulation (RTHS)

slide-5
SLIDE 5

Cyber-Physical Systems (CPS)

5

Cyber-Physical Boundary

slide-6
SLIDE 6

Interactive Cloud Services (ICS)

Need to respond within100ms for users to find responsive*.

6

Search the web * Jeff Dean et al. (Google) "The tail at scale." Communications of the ACM 56.2 (2013) 2nd phase ranking Snippet generator doc

  • Doc. index search

Response Query

slide-7
SLIDE 7

Interactive Cloud Services (ICS)

Need to respond within100ms for users to find responsive*. E.g., web search, online gaming, stock trading etc.

7

* Jeff Dean et al. (Google) "The tail at scale." Communications of the ACM 56.2 (2013) Search the web

slide-8
SLIDE 8

Real-Time Systems

The performance of the systems depends not only upon their functional aspects, but also upon their temporal aspects. Real-time performance: 1) Provide hard guarantee of meeting jobs’ deadlines (e.g. CPS) 2) Optimize latency-related objectives for jobs (e.g. ICS)

8

cores single multi-core machine jobs

Job 1 Job 2 Job 3

slide-9
SLIDE 9

New Generation of Real-Time Systems

Characteristics: Ø New classes of applications with complex functionalities Ø Increasing computational demand of each application Ø Consolidating multiple applications onto a shared platform Ø Rapid increase in the number of cores per chip Demand: leverage parallelism within the applications, to improve real-time performance and system efficiency

9

cores single multi-core machine jobs

slide-10
SLIDE 10

Parallelism Improves RTHS Accuracy

10

A RTHS simulates a nine stories building, with first story damper Ø Previously, sequential processing power limits a rate of 575Hz Ø Parallel execution now allows a rate of 3000Hz

slide-11
SLIDE 11

Parallelism Improves RTHS Accuracy

11

A RTHS simulates a nine stories building, with first story damper Ø Previously, sequential processing power limits a rate of 575Hz Ø Parallel execution now allows a rate of 3000Hz Ø Reduction in error for acceleration and displacement Ø Parallelism increases accuracy via faster actuation and sensing

Sequential (575 Hz) Parallel (3000 Hz) Time (sec) Normalized Error (%)

slide-12
SLIDE 12

State of the Art

Ø Real-time systems

q Schedule multiple sequential jobs on a single core q Schedule multiple sequential jobs on multiple cores

Ø Parallel runtime systems

q Schedule a single parallel job q Schedule multiple parallel jobs to optimize fairness or throughput

Ø New: parallel real-time systems for latency-critical applications

12

slide-13
SLIDE 13

Challenges for Parallel Real-Time Systems

13

Develop provably good and practically efficient real-time systems for parallel applications

Theory How to provide real-time performance for multiple parallel jobs? Systems How to build parallel real-time systems that are efficient and scalable?

slide-14
SLIDE 14

Parallel Job – Directed Acyclic Graph (DAG)

Naturally captures programs generated by parallel languages such as Cilk Plus, Thread Building Blocks and OpenMP. Node: sequential computation Edge: dependence between nodes Work Ci : execution time on one core

14

Ci = 18 Li = 9

slide-15
SLIDE 15

Naturally captures programs generated by parallel languages such as Cilk Plus, Thread Building Blocks and OpenMP. Node: sequential computation Edge: dependence between nodes Work Ci : execution time on one core Span (critical-path length) Li : execution time on ∞ cores

Parallel Job – Directed Acyclic Graph (DAG)

Ci = 18 Li = 9

15

slide-16
SLIDE 16

Parallel Real-Time Task Model

A task periodically releases DAG jobs with deadlines.

Di = 12 Di = 12

Task 1 Job 1 Job 2 deadline Di = period

16

slide-17
SLIDE 17

Parallel Real-Time Task Model

A task periodically releases DAG jobs with deadlines.

Di = 12 Di = 12

deadline Di = period worst-case span Li worst-case work Ci Task 1 Job 1 Job 2

17

slide-18
SLIDE 18

Parallel Real-Time Task Model

A task periodically releases DAG jobs with deadlines. Multiple tasks scheduled on multi-core system. Goal of system: guarantee all tasks can meet all their deadlines.

Di = 9 Di = 12 Di = 12 Di = 9

Task 1 Task 2

18

slide-19
SLIDE 19

Federated Scheduling

For parallel tasks, FS has the best bound in term of schedulability FS assigns ni dedicated cores to each parallel task ni – the minimum #cores needed for a task to meet its deadline cores

  • deadline Di = period
  • worst-case span Li
  • worst-case work Ci

19

tasks ni = Ci − Li Di − Li ⎡ ⎢ ⎢ ⎤ ⎥ ⎥

slide-20
SLIDE 20

Empirical Comparison

FS platform Ø Middleware platform providing FS service in Linux Ø Work with GNU OpenMP runtime system Ø Run OpenMP programs with minimum modification Compare with our Global Earliest Deadline First platform (GEDF)

20

  • Linux kernel 3.10.5 with

LITMUSRT patch

  • 16-core machine with 2 Intel

Xeon E5-2687W processors

  • GCC version 4.6.3. with OpenMP
  • Each data point has 100 task sets
  • Each task is randomly generated

with parallel for-loops

slide-21
SLIDE 21

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 Normalized System Utilization GEDF FS

Empirical Comparison

21

  • Linux kernel 3.10.5 with

LITMUSRT patch

  • 16-core machine with 2 Intel

Xeon E5-2687W processors

  • GCC version 4.6.3. with OpenMP
  • Each data point has 100 task sets
  • Each task is randomly generated

with parallel for-loops Harder to schedule Better performance

normalized system utilization

Fraction of Task Sets Missing Deadlines

= Ci Di

i

m

m: #cores

slide-22
SLIDE 22

Empirical Comparison

22

  • Linux kernel 3.10.5 with

LITMUSRT patch

  • 16-core machine with 2 Intel

Xeon E5-2687W processors

  • GCC version 4.6.3. with OpenMP
  • Each data point has 100 task sets
  • Each task is randomly generated

with parallel for-loops Harder to schedule 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 Normalized System Utilization GEDF FS 52% tasks sets become schedulable under FS Better performance Fraction of Task Sets Missing Deadlines

slide-23
SLIDE 23

Summary of Federated Scheduling

For parallel real-time systems with guarantee of meeting deadlines, Federated Scheduling has: Ø the best theoretical bound in term of schedulability Ø better empirical performance compared to GEDF RTHS has used FS platform to improve system performance cores

23

tasks

The End?

slide-24
SLIDE 24

Issue with the Classic System Model

The classic system model uses the worst-case work for analysis. The worst-case work is significantly larger than the average work. à The average system utilization is very low in practice. To guarantee that all tasks can meet all deadlines at all cases.

10ms core 1 core 2 core 3 100ms 40 40

Very rare cases Work 100ms Most cases Work 10ms

core 1 core 2 core 3

24

slide-25
SLIDE 25

Mixed-Criticality in Cars

Features with different criticality levels:

q Safety-critical features q Infotainment features Display system with Car Navigation and Infotainment

25

slide-26
SLIDE 26

Toy Example of MC System

High criticality task deadline 40ms Low criticality task deadline 40ms

80ms core 1 core 2 40 Most-case work 80ms 10ms core 1 core 2 core 3 100ms core 1 40 40 Worst-case work 100ms Very rare cases Most-case work 10ms Most cases

26

slide-27
SLIDE 27

Most-Case vs. Worst-Case Scenarios

Single-criticality systems: need to model worst-case scenario

core 1 core 2 core 3 100ms

Very rare cases

80ms 40 core 4 core 5 core 1 core 2 core 3

Most cases

core 4 core 5

27

10ms 80ms 40

slide-28
SLIDE 28

MC Model Improves Resource Efficiency

Mixed-criticality system: Provide different levels of real-time guarantees

100ms

Very rare cases:

  • nly guarantee that

high-criticality tasks meet deadlines

core 1 core 2 core 3

Most cases: guarantee that both high and low-criticality tasks meet deadlines

。。。 。。。 400 440

  • verrun

10ms 80ms 40

28

slide-29
SLIDE 29

MCFS Algorithm at a High Level

For each parallel task, calculate and assign: (1) dedicated cores in typical-state

Low-Criticality

m cores

High-Criticality High-Criticality

dedicated cores in typical-state Typical-state (most cases)

29

slide-30
SLIDE 30

MCFS Algorithm at a High Level

For each parallel task, calculate and assign: (1) dedicated cores in typical-state (2) dedicated cores in critical-state

High-Criticality Low-Criticality

Critical-state (rare case) Typical-state (most cases) m cores

High-Criticality High-Criticality High-Criticality

30

slide-31
SLIDE 31

MCFS Algorithm at a High Level

For each parallel task, calculate and assign: (0) virtual deadline (1) dedicated cores in typical state (2) dedicated cores in critical state If a job has not completed by its virtual deadline, it transitions to critical-state.

High-Criticality Low-Criticality

Critical-state (rare case) Typical-state (most cases) m cores

High-Criticality High-Criticality High-Criticality

Virtual deadline

31

slide-32
SLIDE 32

MCFS Algorithm at a High Level

For each parallel task, calculate and assign: (0) virtual deadline (1) dedicated cores in typical state (2) dedicated cores in critical state If a job has not completed by its virtual deadline, it transitions to critical-state.

High-Criticality Low-Criticality

Critical-state (rare case) Typical-state (most cases) m cores

High-Criticality High-Criticality High-Criticality

Virtual deadline

32

MCFS jointly assigns virtual deadlines and cores to maximize utilization while guaranteeing task deadlines.

slide-33
SLIDE 33

MCFS Implementation

In typical-state, MCFS assigns dedicated cores to all tasks. cores Linux MCFS OpenMP Runtime OpenMP Runtime OpenMP Runtime Low-Criticality High-Criticality High-Criticality HC thread LC thread

33

slide-34
SLIDE 34

MCFS Implementation

In critical-state, MCFS increases cores assigned to high-crit. tasks. cores Linux MCFS OpenMP Runtime OpenMP Runtime OpenMP Runtime Low-Criticality High-Criticality High-Criticality HC thread more HC thread

34

slide-35
SLIDE 35

MCFS Implementation

Put additional HC threads to sleep on higher priority cores Linux MCFS OpenMP Runtime OpenMP Runtime OpenMP Runtime Low-Criticality High-Criticality High-Criticality HC thread Sleeping HC thread LC thread

35

slide-36
SLIDE 36

Empirical Evaluations

36

1 2 3 4 100% 80% 60% 40% 20% 0%

  • Linux with RT_PREEMPT

patch version 4.1.7-rt8

  • 16-core machine with 2 Intel

Xeon E5-2687W processors

  • GCC version 4.6.3. with OpenMP
  • Each data point has 100 task sets
  • Each task is randomly generated

with parallel for-loops

slide-37
SLIDE 37

Empirical Evaluations

1 2 3 4 100% 80% 60% 40% 20% 0%

37

  • Linux with RT_PREEMPT

patch version 4.1.7-rt8

  • 16-core machine with 2 Intel

Xeon E5-2687W processors

  • GCC version 4.6.3. with OpenMP
  • Each data point has 100 task sets
  • Each task is randomly generated

with parallel for-loops

slide-38
SLIDE 38

Issue with the Analysis of Parallel Jobs

38

worker threads centralized queue

Centralized greedy scheduler Ø Threads get work (nodes) from a centralized queue Implicit assumption of parallel real-time scheduling theory: when a thread (core) is allowed to work on a job, it must be able to find the available nodes immediately Bottleneck for scalability

  • f large scale systems

(within bounded time)

slide-39
SLIDE 39

Issue with the Analysis of Parallel Jobs

39

worker threads local queues randomly steal worker threads centralized queue

Centralized greedy scheduler

Ø Threads get work (nodes) from a centralized queue

Randomized work-stealing

Ø Threads usually get work locally; Ø If local queue is empty, it steals randomly from another queue

Predictable Scalable

Good scalability Does not scale well Unbounded worst-case Bounded worst-case

slide-40
SLIDE 40

Empirical Comparisons

Randomized work-stealing for large-scale soft real-time system? Ø FS Implementations (with scheduling overheads incorporated):

q FSCG with centralized greedy scheduler in GNU OpenMP q FSWS with randomized work-stealing in GNU Cilk Plus

40

  • Linux with RT_PREEMPT

patch version r14

  • 32-core machine with 4 Intel

Xeon E5-4620 processors

  • GCC 5.1 with OpenMP, Cilk Plus
  • Each data point is one task set
  • Each task is randomly generated

using benchmark program Heat

slide-41
SLIDE 41

Empirical Comparisons

Randomized work-stealing for large-scale soft real-time system?

41

20% 30% 40% 50% 56% 62% 71% 83%

Percentage of Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Deadline Miss Ratio RTWS RTCG

Higher load Better performance

FSCG and FSWS Ø Same computation Ø Same resources Ø Only difference: internal scheduling of parallel tasks

FSWS FSCG

slide-42
SLIDE 42

Empirical Comparisons

Randomized work-stealing for large-scale soft real-time system?

42

20% 30% 40% 50% 56% 62% 71% 83%

Percentage of Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Deadline Miss Ratio RTWS RTCG

Better performance

FSCG and FSWS Ø Same computation Ø Same resources Ø Only difference: internal scheduling of parallel tasks

FSWS FSCG

slide-43
SLIDE 43

Empirical Comparisons

Randomized work-stealing for large-scale soft real-time system?

43

20% 30% 40% 50% 56% 62% 71% 83%

Percentage of Utilization

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Deadline Miss Ratio RTWS RTCG

Better performance

The benefit of scalability in work- stealing dominates the increased variation in parallel execution times. FSCG and FSWS Ø Same computation Ø Same resources Ø Only difference: internal scheduling of parallel tasks

FSWS FSCG

slide-44
SLIDE 44

Outline

Ø Contributions Ø System Guaranteed to Meet Deadlines for Parallel Jobs in CPS Ø System Optimized to Meet Target Latency for ICS Ø Future Work

44

Search the web

slide-45
SLIDE 45

System for Interactive Cloud Services

Online system: do not know when jobs arrive Objective: optimize latency-related objectives for the service e.g. , average latency, max latency

45

Search the web

slide-46
SLIDE 46

System for Interactive Cloud Services

Online system: do not know when jobs arrive Objective: maximize the number of jobs that meet a target latency T

46

2nd phase ranking Snippet generator doc

  • Doc. index search

Query

Aggregator Aggregator

slide-47
SLIDE 47

Workload Distribution Has a Long Tail

47

Job Sequential Execution Time (ms) (work)

Bing search workload Ø Large jobs must run in parallel to meet target latency Ø Always run large jobs in full parallelism? Target latency

slide-48
SLIDE 48

Parallelize Large Jobs According to Load

Tail-Control Strategy: when load is low, run all jobs in parallel; when load is high, run large jobs sequentially. Latency = Processing Time + Waiting time At low load: processing time dominates latency At high load: waiting time dominates latency

time core 1 core 2 core 3

Miss 0 request

core 1 core 2 core 3 time

Miss 1 request

48

target target

slide-49
SLIDE 49

The Inner Workings of Tail-Control

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

49

slide-50
SLIDE 50

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

The Inner Workings of Tail-Control

Target Latency

50

default work-stealing ≥

slide-51
SLIDE 51

The Inner Workings of Tail-Control

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload. Target Latency

51

default work-stealing ≥

slide-52
SLIDE 52

The Inner Workings of Tail-Control

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload. Target Latency

52

default work-stealing ≥

slide-53
SLIDE 53

Conclusion

Exploit the untapped efficiency in parallel computing platforms and drastically improve the real-time performance of applications. Ø System Guaranteed to Meet Deadlines for CPS

q Develop provably good schedulers for parallel applications q Incorporate real-time scheduling into parallel runtime system q Improve system efficiency by dealing with uncertainty in jobs q Address system scalability issue due to internal scheduling

Ø System Optimized to Meet Target Latency for ICS

q Design and implement strategy to optimize real-time performance

53

slide-54
SLIDE 54

References

Ø

  • J. Li, S. Dinh, K. Kieselbach, K. Agrawal, C. Gill and C. Lu, Randomized Work Stealing for Large Scale Soft Real-

time Systems, IEEE Real-Time Systems Symposium (RTSS’ 16) Ø

  • J. Li, D. Ferry, S. Ahuja, K. Agrawal, C. Gill and C. Lu, Mixed-Criticality Federated Scheduling for Parallel Real-

Time Tasks, RTAS'16. Outstanding Paper Award Ø

  • J. Li,
  • Y. He, S. Elnikety, K.S McKinley, K. Agrawal, A. Lee and C. Lu, Work Stealing for Interactive Services to

Meet Target Latency, ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'16). Ø

  • J. Li, Z. Luo, D. Ferry, K. Agrawal, C. Gill and C. Lu, Global EDF Scheduling for Parallel Real-Time Tasks, Real-

Time Systems (RTS), 51(4): 395-439, July 2015. Ø

  • J. Li, K. Agrawal, C.D. Gill and C.Lu, Federated Scheduling for Stochastic Parallel Real-time Tasks, IEEE

International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'14). Ø

  • J. Li, J-J Chen, K. Agrawal, C.Lu, C.D. Gill and A. Saifullah, Analysis of Federated and Global Scheduling for

Parallel Real-Time Tasks, Euromicro Conference on Real-Time Systems (ECRTS'14). Ø

  • A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu and C.D. Gill, Parallel Real-Time Scheduling of DAGs, IEEE

Transactions on Parallel and Distributed Systems (TPDS), 2014. Ø

  • J. Li, K. Agrawal, C.Lu and C.D. Gill, Analysis of Global EDF for Parallel Tasks, Euromicro Conference on Real-

Time Systems (ECRTS'13). Outstanding Paper Award Ø

  • A. Saifullah, J. Li, K. Agrawal, C. Lu and C.D. Gill, Multi-core Real-Time Scheduling for Generalized Parallel Task

Models, Real-Time Systems (RTS), Volume 49, Issue 4, pages 404-435, July 2013. Ø

  • D. Ferry, J. Li, M. Mahadevan, K. Agrawal, C.D. Gill and C. Lu, A Real-Time Scheduling Service for Parallel Tasks,

IEEE Real-Time and Embedded Technology and Applications Symposium, (RTAS’ 13).

54