Recursive Restarts for HA We have crash-only components now what? - PDF document

1/14/2003 Recursive Restarts for HA � We have crash-only components – now what? � Reduce recovery time by doing partial restarts: attempt recovery of a minimal subset of components � What if restart ineffective? recover progressively larger subsets Automatic Failure-Path Inference � Chase fault through successive boundaries � Demonstrated 4x improvement in recovery time on Mercury George Candea, Mauricio Delgado, Michael Chen, (stateless, crash-proof satellite ground station) Fang Sun, Armando Fox, Pedram Keyani � How do we navigate the fault boundaries? ... Stanford University George Candea 2 Fault Dependency Graph Automatic Failure-Path Inference Look at what people do: train by placing themselves in unexpected Use a graph that depicts how faults propagate in the system (f-map) � � situations; self-managing systems should do the same � introspection Challenges: � Staging phase (active/invasive): 1. Problem-determination literature assumes graph is magically available 1. inject faults � Internet systems evolve rapidly � hard to keep sys and graph in sync 2. observe system's reaction � Many failures result from idiosyncratic system/environment interactions, add inferred propagation paths to global failure propagation map 3. � which can't be guessed just by looking at the app Production phase (passive/orthogonal): 2. Desired process properties: observe system's reaction to "naturally occurring" faults � � don’t use explicit model augment failure propagation map � � application generic/independent � deploy automatic � dynamic � minor fixes, Staging Production reconfigs major upgrades George Candea George Candea 3 4 Staging Phase Algorithm Internet Systems / J2EE Bring system up (infrastructure and application) Large scale + HA requirements 1. � Each deployment of a component � inspect its interface and infer possible application-visible 2. Heterogeneous, individually � faults; place potential faults in a global fault list packaged components Add environment-related faults (e.g., network partitions, disk I/O faults, out-of-memory) 3. (web servers, application servers, databases, etc.) Iterate through list of (component C , method M , fault F ) and 4. schedule fault F to be raised by C on invocation of M Rapid and perpetual evolution � Generate workload externally to exercise system 5. � impossible to build and As components fail, build f-map = directed graph of edges ( u , v ) indicating that a fault in 6. maintain consistent model (key component u propagated and caused component v to fail (if v handles fault, then no edge) J2EE enterprise apps = collection of reusable Java modules � difference from other mission- Save f-map and fault list to stable storage, restart app, continue with the next (C,M,F) triplet 7. JSPs / servlets invoke EJBs, which invoke other EJBs, ... � critical apps) EJB = Java component that complies to a certain interface and � provides a service Workload = large numbers of � Injection ends when entire list of faults has been exhausted � Deployment descriptor (XML file) conveys run-time characteristics and relatively short tasks, rather � dependencies; used in deploying the application Multi-point injections (truly independent faults are seldom in reality): than long-running operations � App srv = operating system for Internet applications (instantiates app Take cross product of list of faults with itself and obtain (C1, M1, F1, C2, M2, F2) � 1. components in containers, provides runtime system services, Clients are web browsers talking 2. Eliminate tuples that have C1=C2 integrates with web server to make app web-accessible) � 3. Iterate through list and inject faults HTTP We use JBoss (open-source J2EE app srv) = microkernel with � Add previously unseen paths to f-map 4. components held together through JMX George Candea George Candea 5 6 1

1/14/2003 Modifications (JBoss � RR-JBoss) Experiments � PetStore 1.1.2 � freely available J2EE “tutorial application” from Sun Include 2 new JMX services for injection and monitoring: 1. � simulates e-commerce site w/ user accounts, profiles, payments, FaultInjector and FailureMonitor merchandise catalog, shopping cart, purchases, etc. Add hook: whenever a new EJB is deployed, FaultInjector is 2. � Derive vanilla f-map from deployment descriptors invoked, to reflect EJB interface and populate list w/ exceptions � Chose to inject Java exceptions = high level, JVM-visible faults Modify generic EJB container to provide method for scheduling 3. � low-level bit flips � nondeterministic behavior a fault � most manifest low-level problems turn into Java exceptions � Two types of exceptions: Modify EJB container's log interceptor to capture stack trace 4. when exception is thrown, parse it, and inform FailureMonitor � “expected” : declared in bean interfaces � “environmental” : resulting from runtime issues (OutOfMemoryError, StackOverflowError, IOException, RemoteException, SQLException) George Candea George Candea 7 8 Comparing f-maps Fault-Specific f-maps Are our f-maps at least � as good as those obtained by other means? If yes, are they better ? � Auto tomatic tic F FPI � Zoom in on dependencies resulting from a specific fault or class of faults � Targeted recovery when we know the fault that occurred Missing edges: � AccountEJB � OrderEJB: maintained � � f-map obtained by injecting exclusively app-declared exceptions reference, but never used it CatalogEJB � ShoppingClientCtlEJB: � reflects what happens when we isolate it from the environment � didn't even have reference Depl eploy oyme ment nt D Descript ptor ors � Much simpler (thus more useful) f-map EStoreDB � web tier: only exercised � at DB population time � some components missing (ProfileManagerEJB, OrderEJB, InventoryEJB) so no propagation through them Additional nodes + edges: � HttpJspBase, MainServlet, 6 JSPs: � higher resolution, dissected web tier George Candea George Candea 9 10 Discussion Summary � Automatic Failure-Propagation Inference: � AFPI required no application knowledge + automatically and dynamically generates f-maps with no � No performance overhead (we’re faster, but that’s noise: 94.8 sec vanilla JBoss vs. 93.0 sec RR-JBoss, with 5.8 std. dev.) performance overhead + no application knowledge required � Deployment descriptors can be incorrect; even if correct, will capture paths that might manifest, not only the + finds dependencies that other analyses might miss, ones that do manifest omits dependencies that don’t manifest � Use a true call graph tool ? PetStore has 233 Java files w/ 11 KLOC; + accommodates app evolution descriptors are 16 files with 1.5 Klines of XML + obtain high-resolution per-fault-type graphs � Call graph: - staging phase may take a long time � might manifest vs. do manifest � misses paths that are not due to calls (e.g., memory-gobbling thread) � static call graph � need to regenerate every time you change app � requires access to source code George Candea George Candea 11 12 2

1/14/2003 Future Work More… � Make RR-JBoss crash-only � Separate J2EE services into separate components � Include J2EE services in f-maps http://RR.stanford.edu http:// RR.stanford.edu � More complex apps: ECperf (alternately Trade-2, TPC-W, Nile) � Automatic recursive restarts based on f-maps George Candea George Candea 13 14 3

Recursive Restarts for HA We have crash-only components now what? - PDF document

1/14/2003 Recursive Restarts for HA We have crash-only components now what? Reduce recovery time by doing partial restarts: attempt recovery of a minimal subset of components What if restart ineffective? recover progressively

Aging Beyond Restarts Thomas Jansen University College Cork joint work with Christine Zarges

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

GLUCOSE 2.1 Aggressive but Reactive Clause Database Management, Dynamic Restarts Gilles

Assessing the Stability of Forecasting Models: Recursive Parameter Estimation and Recursive

Non-Recursive In-Place FFT Algorithm Idea: "Unwind the in-place recursive algorithm and work

Recursion Announcements Recursive Functions Recursive Functions Definition : A function is

OUTLINE CHAPTER 10 Recursive Hierarchies Table of contents Recursive Hierarchies and Bridges

Review Recursion Factorial (Iterative and Recursive versions) Call Stack (Last-in,

Python: Recursive Functions Recursive Functions Recall factorial function: Iterative Algorithm

Recursion 15-110 - Friday 2/21 Learning Objectives Define and recognize base cases and

Recursive Analysis And Real Recursive Functions Emmanuel Hainry Joint work with Olivier Bournez

1 Recursive Definitions Infinite Recursion The recursive part of the LIST definition is used

Fault-tolerant protocols and trace spaces Eric Goubault CEA LIST, Ecole Polytechnique MMTDC,

Comparative Causality: Explaining the Differences Between Executions William N. Sumner Xiangyu

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Fault-tolerant quantum computing with color codes Andrew J. Landahl with Jonas T. Anderson and

Inferring Test Models from Kates Bug Reports using Multi-objective Search Yuanyuan Zhang

hhhh Agenda O/S Applications RMS System management Troubleshooting tools

Swapping Segmented paging allows us to have non- contiguous allocations But it still

15-410 My other car is a cdr -- Unknown Exam #1 Mar. 16, 2009 Dave Eckhardt Dave

Sambuz

Useful Links

Newsletter

Mail Us

Recursive Restarts for HA We have crash-only components now what? - PDF document

1/14/2003 Recursive Restarts for HA We have crash-only components now what? Reduce recovery time by doing partial restarts: attempt recovery of a minimal subset of components What if restart ineffective? recover progressively

Aging Beyond Restarts Thomas Jansen University College Cork joint work with Christine Zarges

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

GLUCOSE 2.1 Aggressive but Reactive Clause Database Management, Dynamic Restarts Gilles

Assessing the Stability of Forecasting Models: Recursive Parameter Estimation and Recursive

Non-Recursive In-Place FFT Algorithm Idea: &quot;Unwind the in-place recursive algorithm and work

Recursion Announcements Recursive Functions Recursive Functions Definition : A function is

OUTLINE CHAPTER 10 Recursive Hierarchies Table of contents Recursive Hierarchies and Bridges

Review Recursion Factorial (Iterative and Recursive versions) Call Stack (Last-in,

Python: Recursive Functions Recursive Functions Recall factorial function: Iterative Algorithm

Recursion 15-110 - Friday 2/21 Learning Objectives Define and recognize base cases and

Recursive Analysis And Real Recursive Functions Emmanuel Hainry Joint work with Olivier Bournez

1 Recursive Definitions Infinite Recursion The recursive part of the LIST definition is used

Fault-tolerant protocols and trace spaces Eric Goubault CEA LIST, Ecole Polytechnique MMTDC,

Comparative Causality: Explaining the Differences Between Executions William N. Sumner Xiangyu

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Fault-tolerant quantum computing with color codes Andrew J. Landahl with Jonas T. Anderson and

Inferring Test Models from Kates Bug Reports using Multi-objective Search Yuanyuan Zhang

hhhh Agenda O/S Applications RMS System management Troubleshooting tools

Swapping Segmented paging allows us to have non- contiguous allocations But it still

15-410 My other car is a cdr -- Unknown Exam #1 Mar. 16, 2009 Dave Eckhardt Dave

Sambuz

Useful Links

Newsletter

Mail Us

Non-Recursive In-Place FFT Algorithm Idea: "Unwind the in-place recursive algorithm and work