IBM Research
12/3/2009
An Empirical Study of High Availability in Stream Processing Systems - - PowerPoint PPT Presentation
IBM Research An Empirical Study of High Availability in Stream Processing Systems Yu Gu, Zhe Zhang , Fan Ye, Hao Yang, Minkyong Kim, Hui Lei, Zhen Liu 12/3/2009 IBM Research Stream Processing Model software operators (PEs)
12/3/2009
IBM Research
2 12/3/2009
∩ ∑ ∆ ∞ Ω
∫
An Empirical Study of High Availability in DSPS
IBM Research
3 12/3/2009
∩ ∑ ∆ ∩ ∑ ∆
∩ ∑ ∆ ∩ ∆ ∑ ∩ ∑ ∆ ∩ ∆ ∑
An Empirical Study of High Availability in DSPS
IBM Research
4 12/3/2009
IBM Research
5 12/3/2009
An Empirical Study of High Availability in DSPS
IBM Research
6 An Empirical Study of High Availability in DSPS 12/3/2009
IBM Research
7 An Empirical Study of High Availability in DSPS 12/3/2009
IBM Research
8 12/3/2009
1 2 3 4 5
1 1 1
upstream node U downstream node D
An Empirical Study of High Availability in DSPS
IBM Research
9 12/3/2009
2 3 4 5
1 1 1
upstream node U downstream node D
An Empirical Study of High Availability in DSPS
IBM Research
10 12/3/2009
3 4 5
1 1 2 2 2
upstream node U downstream node D
An Empirical Study of High Availability in DSPS
IBM Research
11 12/3/2009
3 4 5
1 1 2 2 2
1 2
checkpoint
upstream node U downstream node D
An Empirical Study of High Availability in DSPS
IBM Research
12 12/3/2009
3 4 5
1 1 2 2 2
1 2
checkpoint 1 and 2 have been processed and checkpointed
upstream node U downstream node D
An Empirical Study of High Availability in DSPS
IBM Research
13 12/3/2009
∆ ∩ ∑ ∑ ∆ ∩
Site 2
≡ √ ≡ √
Site 1
√ ≡
snapshot of the whole sub job
sub job 1 sub job 2
An Empirical Study of High Availability in DSPS
IBM Research
14 12/3/2009
∆ ∩ ∑ ∑ ∆ ∩
Site 2
≡ √ ≡ √
Site 1
sub job 1 sub job 2
An Empirical Study of High Availability in DSPS
IBM Research
15 12/3/2009
∆ ∩ ∑ ∑ ∆ ∩
Site 2
≡ √ ≡ √
Site 1
sub job 1 sub job 2
An Empirical Study of High Availability in DSPS
IBM Research
16 An Empirical Study of High Availability in DSPS 12/3/2009
IBM Research
17 12/3/2009
REC CM FM JMN
– manage HA protection for distributed jobs
– manage job deployment
– manage checkpoint tasks according to assigned checkpoint mechanism
–
monitor other nodes and initiate recovery
– take data from upstream, execute processing tasks, and send results to downstream
– A distributed job consists of multiple subjobs, each of which can choose its own specific HA mechanism (AS, PS) – The system coordinates the deployment and protection of subjobs among all machines
∑ ∆ ∩
Job Job
An Empirical Study of High Availability in DSPS
IBM Research
18 An Empirical Study of High Availability in DSPS 12/3/2009
IBM Research
19 12/3/2009
An Empirical Study of High Availability in DSPS
IBM Research
20 12/3/2009 20 12/3/2009
3000 elements/second
An Empirical Study of High Availability in DSPS
IBM Research
21 12/3/2009 21 12/3/2009
An Empirical Study of High Availability in DSPS
checkpoint interval = 500 ms
IBM Research
22 12/3/2009 22 12/3/2009
An Empirical Study of High Availability in DSPS
IBM Research
23 12/3/2009 23 12/3/2009
An Empirical Study of High Availability in DSPS
IBM Research
24 An Empirical Study of High Availability in DSPS 12/3/2009
IBM Research
25 12/3/2009 25 12/3/2009
1. “Fault tolerance in the Borealis distributed stream processing system” (SIGMOD ‘05)
A variant of AS
Achieving flexible trade-off between availability and consistency by introducing tentative data concept
2. “Fast and reliable stream processing over wide area networks” (ICDE ’07)
A variant of AS
Most expensive variant; upstream sending to all downstream replicas
No switch required when failure occurs
3. “A cooperative, self-configuring high-availability solution for stream processing” (ICDE ‘07)
A variant of PS
Novel checkpoint scheduling and backup assignment
Balances recovery load over multiple servers
4. “Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications” (SIGMOD ‘08)
A variant of AS
Same technique as in [2]
Novel mechanism to allow replicas execute without coordination but still produce consistent results
An Empirical Study of High Availability in DSPS
IBM Research
26 12/3/2009 26 12/3/2009
5. “Towards automatic fault recovery in System-S” (ICAC ‘07)
Checkpoint state
Recovery of JMN, not jobs
6. “Failure recovery in cooperative data streaming analysis” (ARES ’07)
How to select a backup site on demand, not recovery technique
7. “Online failure forecast for fault-tolerant data stream processing” (ICDE ‘08)
Prediction of potential failures, a monitoring technique
Leverages varies system metrics (system productivity, available CPU, etc.) to predict failures before they occur
8. “High-availability algorithms for distributed stream processing” (ICDE ‘05)
Valuable summaries of basic tradeoffs
PS variant has large overhead
Evaluation mainly based on simulations
An Empirical Study of High Availability in DSPS
IBM Research
27 12/3/2009
An Empirical Study of High Availability in DSPS
IBM Research
28 An Empirical Study of High Availability in DSPS 12/3/2009