How to Keep Critical Applications up and running 24x7 Linda Wang - - PowerPoint PPT Presentation

how to keep critical applications up and running 24x7
SMART_READER_LITE
LIVE PREVIEW

How to Keep Critical Applications up and running 24x7 Linda Wang - - PowerPoint PPT Presentation

How to Keep Critical Applications up and running 24x7 Linda Wang Red Hat, Inc. October 6, 2016 1 LinuxConf Europe 2016 - How to keep application up 24x7 Background Computer industry has been evolving Decades of improvement


slide-1
SLIDE 1

LinuxConf Europe 2016 - How to keep application up 24x7

1

“How to Keep Critical Applications up and running 24x7”

Linda Wang Red Hat, Inc. October 6, 2016

slide-2
SLIDE 2

LinuxConf Europe 2016 - How to keep application up 24x7

2

Background

  • Computer industry has been evolving
  • Decades of improvement
  • Various OS's claimed to be able to achieve Zero down

time for their users, through various of individual mechanisms..

  • System monitoring
  • Predictive Self Healing
  • Without indepth analysis the fundamental causes of

down time, do these features really help?

slide-3
SLIDE 3

LinuxConf Europe 2016 - How to keep application up 24x7

3

Today

  • Open Source community
  • Ease of access to source
  • Linux - lot of research and development in research

institutes

  • Opens doors and paths to different approaches and allows

experimentation

  • Advanced Kernel development
slide-4
SLIDE 4

LinuxConf Europe 2016 - How to keep application up 24x7

4

How to Achieve 24x7 Uptime

  • Analysis the reasons behind down time
  • Planned vs Unplanned
  • With unplanned, we want to proactively avoid it
  • Predictable vs Unpredictable
slide-5
SLIDE 5

LinuxConf Europe 2016 - How to keep application up 24x7

5

How to achieve 24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down-Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash Operating System Panic Hardware Failure

slide-6
SLIDE 6

LinuxConf Europe 2016 - How to keep application up 24x7

6

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down-Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) Hardware Failure * Error detection (HERM)

slide-7
SLIDE 7

LinuxConf Europe 2016 - How to keep application up 24x7

7

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down-Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) Hardware Failure * Error detection (HERM)

slide-8
SLIDE 8

LinuxConf Europe 2016 - How to keep application up 24x7

8

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) * Kernel security, bugfix updates Hardware Failure * Error detection (HERM)

slide-9
SLIDE 9

LinuxConf Europe 2016 - How to keep application up 24x7

9

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) * Kernel security, bugfix updates Hardware Failure * Error detection (HERM) * Hardware replacement

slide-10
SLIDE 10

LinuxConf Europe 2016 - How to keep application up 24x7

10

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates * Live patching security fixes (systemtap) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) * Kernel security, bugfix updates Hardware Failure * Error detection (HERM) * Hardware replacement

slide-11
SLIDE 11

LinuxConf Europe 2016 - How to keep application up 24x7

11

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable;

Unpredictable/ Unplanned Predictable/ Planned Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates * Live patching security fixes (systemtap) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) * Kernel security, bugfix updates * Live patching known kernel issues (kpatch) Hardware Failure * Error detection (HERM) * Hardware replacement

slide-12
SLIDE 12

LinuxConf Europe 2016 - How to keep application up 24x7

12

24x7 Uptime

  • Reasons behind Down Times
  • Two types of Down Time: unplanned vs. planned
  • Unplanned: predictable, unpredictable

Unplanned Down Time Planned Down Time Proactive Planning Application Crash * Diag. - (gdb) * Auto restart - (systemd ufile) * Security updates * Live patching security fixes (systemtap) Operating System Panic * Diagnostic tool (kdump/crash) * Auto restart (NMI timeout) * Kernel security, bugfix updates * Live patching known kernel issues (kpatch) Hardware Failure * Error detection (HERM) * Hardware replacement *Checkpoint/R estore (criu)

slide-13
SLIDE 13

LinuxConf Europe 2016 - How to keep application up 24x7

13

Prepare for DownTime Scenarios

  • Preventive Measures
  • For security fixes and known issues to avoid crashes
  • Live Patches - for both kernel and userspace
  • To avoid Down Times due to Hardware Failure or

Regular Maintenance

  • Containerize critical applications, and use Live

Migration to move to alternative systems while

  • riginal systems under-going maintenance to avoid

down time

slide-14
SLIDE 14

LinuxConf Europe 2016 - How to keep application up 24x7

14

Kernel Live Patching Enhancements

  • Demo
slide-15
SLIDE 15

LinuxConf Europe 2016 - How to keep application up 24x7

15

Use Space Live Patching

  • Demo
slide-16
SLIDE 16

LinuxConf Europe 2016 - How to keep application up 24x7

16

Container Migration

  • Demo
slide-17
SLIDE 17

LinuxConf Europe 2016 - How to keep application up 24x7

17

For more information...

Kernel Live Patching: ■ http://rhelblog.redhat.com/?s=live+patching ■ questions: kpatch@redhat.com

  • Checkpoint Restore/Live Migration:

■ http://rhelblog.redhat.com/?s=criu ■ questions: criu@redhat.com

slide-18
SLIDE 18

LinuxConf Europe 2016 - How to keep application up 24x7

18

Thank-you!