FOR OR IM IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF - - PowerPoint PPT Presentation

β–Ά
for or im imdg
SMART_READER_LITE
LIVE PREVIEW

FOR OR IM IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF - - PowerPoint PPT Presentation

HI HIGH H AVA VAILAB ILABILITY ILITY AND DIS ISASTER ASTER RECO ECOVERY VERY FOR OR IM IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA 1 ABOUT UT SP SPEA EAKER ERS Vladimir imir Komaro marov in Sberbank since 2010.


slide-1
SLIDE 1

1

HI HIGH H AVA VAILAB ILABILITY ILITY AND DIS ISASTER ASTER RECO ECOVERY VERY FOR OR IM IMDG

VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA

slide-2
SLIDE 2

2

ABOUT UT SP SPEA EAKER ERS

Vladimir imir Komaro marov

Enterprise IT Architect vikomarov@sberbank.ru

Mik ikhail hail Gorelov relov

Operations expert & manager magorelov@sberbank.ru

in Sberbank since 2010. He realized the concepts of operational data store (ODS) and retail risk data mart as a part of enterprise data warehouse. In 2015 performed the testing of 10+ distributed in-memory platforms for transaction processing. Now responsible for grid-based core banking infrastructure architecture including high availability and disaster recovery. in Sberbank since 2012. He is responsible for building the infrastructure landscape for the major mission-critical applications as core banking and cards processing including new grid-based banking platform. Now he acts as both expert and project manager in β€œ18+” core banking transformation program.

slide-3
SLIDE 3

3

ABOUT UT SB SBER ERBANK NK The la large gest st bank in in Russian ian Federa ration tion

  • 16K+ offices in Russia, 11 time zones
  • 110M+ retail clients
  • 1M+ corporate clients
  • 90K+ ATMs & POS terminals
  • 50M+ active web & mobile banking users
slide-4
SLIDE 4

4

OUR UR GOA GOALS

Availability =

π‘ˆπ‘π‘’π‘π‘š 𝑒𝑗𝑛𝑓 βˆ’ 𝐸𝑝π‘₯π‘œπ‘’π‘—π‘›π‘“ π‘ˆπ‘π‘’π‘π‘š 𝑒𝑗𝑛𝑓

Γ— 100 %

Avail ilabi ability lity Yearly ly downtime ntime 99 % 3d 15:39:29.5 99.9 % 8:45:57.0 99.99 % 0:52:35.7 99.999 % 0:05:15.6 99.9999 % 0:00:31.6 target et for 20 2018 18

slide-5
SLIDE 5

5

OUR UR ME METH THODS

  • additional co

contro rol l and ch check ckin ing g tools ls;

  • monito

itorin ing improvement:

  • new metrics design;
  • new visualizations;
  • continuous testin

ing:

  • operational acceptance tests;
  • performance tests;
  • 45+ scenarios of destructive testing;
  • keeping in

incid ident ent response nse pla lan up-to-date.

slide-6
SLIDE 6

6

TH THRE REATS TS AN AND FA FACI CILITIE TIES

Datace cente nter loss ss DC interc ercon

  • nne

nect ct failure re Applicati tion

  • n

bugs, admin errors rs User r data corrupti uption HW/OS OS/J /JVM VM failures res

On-disk data persistence οƒΌ Data redundancy οƒΌ οƒΌ Distributed cluster οƒΌ οƒΌ Data snapshots οƒΌ οƒΌ Point-in-time recovery οƒΌ οƒΌ Health self-check οƒΌ Data replication οƒΌ οƒΌ οƒΌ

slide-7
SLIDE 7

7

TH THE E LEG EGACY CY GR GRID-EN ENABLED ED ARC RCHI HITE TECT CTUR URE

Applicati tion

  • n servers

vers

compute

In In-memo memory y data grid

caching & temporary storage

Relati ational nal DBMS

persistence & compute

Strength engths Weaknes aknesses ses

  • Robust and stable persistence layer
  • A grid hasn’t to be highly available
  • The write performance is limited by database
  • The persistence layer is not horizontally scalable
  • Data need to be converted from object

representation to relational model

  • Database and grid can become inconsistent if

data is changed directly in the database

  • The database requires high-end hardware
slide-8
SLIDE 8

8

SB SBER ERBANK NK CO CORE RE BA BANK NKING NG PLATF TFORM RM ARC RCHI HITE TECT CTUR URE

Applicati tion

  • n servers

vers

compute

In In-memo memory y data grid

compute & data persistence

Opportun portunit ities ies Challen hallenges ges

  • The grid has to persist the data
  • The grid has to be fault tolerant
  • Fully horizontally scalable architecture
  • n commodity hardware
  • The data is stored as objects,

no conversion required

  • The only instance of the data
slide-9
SLIDE 9

9

SE SERV RVICE CE CO CONT NTINU NUITY TY TH THRE REATS TS

Continuity threats Errors Local failures Hardwa dware/OS /OS/J /JVM VM failures Netw twor

  • rk

k failures Data corruption due to user/admin action Disasters Cluster breakdown due to application errors and/or admin action Datac acente ter loss Datac acente ter intercon

  • nnect

t loss Service jobs Cluster topology change Software update Firmwar are/OS /OS/JV /JVM upgr grade ade Platf tfor

  • rm upgr

grade ade Application upgrade

  • The above tree does not consider security issues
  • Application and user issues cannot be solved at platform level
  • Let’s focus on system issues!
slide-10
SLIDE 10

10

TH THE E CO CONC NCEP EPT T OF SE SERV RVICE CE P PRO ROVI VIDER ER INT NTER ERFACE CE (SP SPI)

Cust stom

  • m service

vice imple plementa ntatio ion SPI GridGa Gain in IMDG API Appli licat atio ion API vs

  • vs. SPI

API SPI Defin ined by Platform Platform Impleme mente nted by Platform System software (custom code) Called by Application (custom code) Platform

Sberbank implements GridGain SPI:

TopologyValidator AffinityFunction

slide-11
SLIDE 11

11

Data/compute grid

TH THE E CO CONC NCEP EPT T OF AFFINI NITY TY

Data Data area 1

(e. . g. cli client ents)

Data area 2

(e. g.

  • g. acco

ccoun unti ting) ng)

nodeFilter

AffinityFunction SPI

partition() assignPartitions()

the property of the cache that defines the set of nodes where the cache’s data can reside the fast, simple and deterministic function usually division reminder mapping

  • bject to the partition (chunk)

the function distributing partitions (chunks) across the nodes

slide-12
SLIDE 12

12

Cell

TH THE E CO CONC NCEP EPT T OF CE CELL; NEW NEW A AFFINIT NITY FUN UNCT CTION

1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 1 7 8 1 2 8 1 2 3 Datacenter ter 2 Datacenter ter 1

Broke ken n node:

more nodes in the cluster β†’ faster recovery

Semi-bro broke ken n node: e:

more linked nodes β†’ stronger performance impact

Find a b balanc nce!

  • The grid is distributed across 2 datacenters.
  • Data connectivity is limited to 8

8 nodes (a cell).

  • Every partition has the master copy and

3 backups.

  • Each datacenter has 2 copies of a partition.
  • Both datacenters are active.

Sberbank’s affinity nity impleme lementation ntation

slide-13
SLIDE 13

13

SB SBER ERBANK NK CO CORE RE BA BANK NKING NG INF NFRA RAST STRU RUCT CTURE URE

  • Nodes of a cell reside in different

rent racks ks.

  • Clos netw

twork rk provides stable high-speed connectivity.

  • Doubled datacenter

ter interconne connect ct reduces split-brain probability.

  • Every server contains NVMe

Me flash h and HDDs.

DC1 C1 DC2 C2

slide-14
SLIDE 14

14

LET’S SPEAK ABOUT NETWORK FRAGMENTATION…

DC1 DC2

DC1 DC2 DC1 DC2

DC1 DC2

DC1 DC2 DC1 DC2

Regular operation Datacenter loss DC interconnect loss Fragmentation type 1 Fragmentation type 2 Fragmentation type 3

slide-15
SLIDE 15

15

HO HOW D W DOES ES GR GRIDGA GAIN RE RECO COVE VER R A B BRO ROKEN EN CL CLUST USTER ER?

End (commit or rollback) all the active transactions. Choose new cluster coordinator Call TopologyValidator.validate() Go read-only mode Continue normal operation

true (default) false

slide-16
SLIDE 16

16

LET’S OVERRIDE DEFAULT TOPOLOGY VALIDATOR!

Check if  the previous topology was valid  either the new nodes appear

  • r not more than N nodes lost

Check 1.

  • 1. ther

ere e are nodes

  • des from
  • m DC1

β—‹ All β—‹ Partial β—‹ None 2.

  • 2. ther

ere e are nodes

  • des from
  • m DC2

β—‹ All β—‹ Partial β—‹ None 3.

  • 3. data

a is integ tegral ral (no partition loss happens) β—‹ Yes β—‹ No

RW RW

yes no DC1 DC2 Data Decisio ion All Partial οƒΌ RW All Partial  All None οƒΌ RW All None  Partial All οƒΌ AW

Decisio cisions ns possibl sible: e:

  • RW (read-write): continue normal
  • peration
  • AW (admin wait): freeze the cluster

and wait for admin interaction

slide-17
SLIDE 17

17

DEC ECISI SION N AUT UTOMATIO TION N USI USING NG QUO QUORU RUM NO NODE

Datace cente nter r 1 Dat atacen acente ter r 2 Qu Quorum rum dat atacenter acenter

RW RW

STOP STOP

Quorum node

slide-18
SLIDE 18

18

HDD RAM

LOCA CAL FILE ST E STORE RE (L (LFS) S)

  • Sync. write

Trx processing Paged memory Paged disk storage (files)

  • Sync. write

Write-ahead log Async

  • nc. write

te

slide-19
SLIDE 19

19

BACK CKUP UP SUB SUBSY SYST STEM EM

Current rrent Futu ture

  • Snapshot

shot to local disk (full/incremental/differential)

  • Snapshot ca

catalog log inside the data grid

  • Copying to NAS using

g NFS

  • Restoring on arbitra

trary ry grid topolo

  • logy

gy

  • Point

nt-in in-tim time e reco cover very using snapshot and WAL;

  • Ex

Exter erna nal bac ackup kup ca catal alog

  • g in relational

DBMS;

  • Copying to SDS using S3/

3/SWIFT WIFT;

  • ...and more!
slide-20
SLIDE 20

20

TH THANK NK YOU! U!

Vlad adimir imir Komarov marov <vikomarov@sberbank.ru> Mikha hail l Gorelov lov <magorelov@sberbank.ru>