CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI - - PowerPoint PPT Presentation

cray sv1 supercluster resiliency
SMART_READER_LITE
LIVE PREVIEW

CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI - - PowerPoint PPT Presentation

CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI 41st Cray User Group Conference Minneapolis, Minnesota Resiliency Goals Maintain cluster operations after a panic Ring Resiliency Auto-Recovery Failover SuperCluster


slide-1
SLIDE 1

CRAY SV1 SuperCluster Resiliency

Mike Wolf

I/O development

SGI

41st Cray User Group Conference Minneapolis, Minnesota

slide-2
SLIDE 2

Resiliency Goals

Maintain cluster operations after a panic ¥ Ring Resiliency

¥ Auto-Recovery ¥ Failover

slide-3
SLIDE 3

SuperCluster Resiliency

Ring Resiliency ¥ Operating System resets client chip

¥ Checkxxx commands resetting client chip ¥ Proxy locking ¥ Dring monitor

slide-4
SLIDE 4

SuperCluster Resiliency

Auto-Recovery ¥ Foundation / Monitoring ¥ User exits in checkxxx commands ¥ Recovery ¥ Notification

slide-5
SLIDE 5

SuperCluster Resiliency

Failover ¥ NFS ¥ UDB ¥ DCE/DFS ¥BDS

slide-6
SLIDE 6

Resiliency Example 1

SV1 SuperCluster Basic Building Block

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-7
SLIDE 7

Resiliency Example 1

Mainframe 1 panics

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-8
SLIDE 8

Resiliency Example 1

Mainframe 2 has packet backup

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-9
SLIDE 9

Resiliency Example 1

Mainframe 2 hangs

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-10
SLIDE 10

Resiliency Example 1

Mainframes 3 and 4 hang

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-11
SLIDE 11

Resiliency Example 2

Mainframe 1 panics

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-12
SLIDE 12

Resiliency Example 2

SWS stabilizes ring

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet

slide-13
SLIDE 13

Resiliency Example 2

Mainframe 1 is back in service

Mainframe 1 Mainframe 2 Mainframe 3 Mainframe 4

SWS MPN MPN MPN MPN FCN

GigaRing Ethernet