How to Test the Ability of Large-Scale, Distributed Software Systems - - PowerPoint PPT Presentation

how to test the ability of large scale distributed
SMART_READER_LITE
LIVE PREVIEW

How to Test the Ability of Large-Scale, Distributed Software Systems - - PowerPoint PPT Presentation

How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures Pavel Lipsky Dell Technologies 06/04/2019 Who am I? Pavel Lipsky Before 2005 Building scalable web sites From 2005 to 2014 Test automation and


slide-1
SLIDE 1

How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures

Pavel Lipsky
 Dell Technologies
 06/04/2019

slide-2
SLIDE 2

2

Who am I?

Pavel Lipsky Before 2005 Building scalable web sites From 2005 to 2014 Test automation and DevOps From 2014 Performance and reliability of large-scale, distributed systems https://github.com/leapsky

slide-3
SLIDE 3

3

Agenda

  • What is Fault Injection?
  • Test Object
  • Stories & Demos - https://github.com/leapsky
  • Tools & Frameworks
slide-4
SLIDE 4

Story 1

Memcached

slide-5
SLIDE 5

5

Fetching Data from Memcached

Application Memcached Database

1 3 2 5 4

slide-6
SLIDE 6

6

Changing Data in Memcached

Application

1 5

Database Memcached

3 2 4

slide-7
SLIDE 7

7

Types of Software Testing

Functional testing Load testing Usability testing Security testing Fault Injection

slide-8
SLIDE 8

Story 2

slide-9
SLIDE 9

9

Payments for Goods with Payment Cards Issued by Russian Banks

slide-10
SLIDE 10

10

New IT Platform

  • Horizontal scaling
  • Using open-source software
  • Affordable low-end hardware
  • Reliability
  • Storing data in RAM
slide-11
SLIDE 11

11

GridGain Enterprise

  • SQL support
  • Quick access to objects by key
  • In-memory computing
  • Persistent Data Store
  • Strong consistency
  • Failure resistance
  • Horizontal scalability
slide-12
SLIDE 12

12

Forcing a System to Fail

“Without explicitly forcing a system to fail, it is unreasonable to have any confidence it will

  • perate correctly in failure

modes.”

Caitie McCaffrey (Backed Brat & Distributed Systems Diva), The Verification of a Distributed System

slide-13
SLIDE 13

Story 3

Lost Updates

slide-14
SLIDE 14

14

Example of Fund Transfer

  • 1. read(A)

  • 2. A := A - 50

  • 3. write(A)

  • 4. read(B)

  • 5. B := B + 50

  • 6. write(B) 

slide-15
SLIDE 15

15

Fund Transfers Between Bank Accounts

3 4 1 7 2 5 6 8 9

10

$50 $20

$100

$3 $18 $95

$100 $100 $100 $100 $100 $100 $100 $100 $100

slide-16
SLIDE 16

Demo Time

Lost Updates

slide-17
SLIDE 17

17

Lost Updates

T1 read(A) 
 T2 A:= A - 50 T3 write(A) T4 T5 … Task 1 Task 2

Expected value of А is $50 Real value of A is $0 А := $50

read(A) 
 A := A - 50 write(A) …

slide-18
SLIDE 18

Story 4

ACID

slide-19
SLIDE 19

19

ACID Properties

  • Atomicity
  • Consistenc

y

  • Isolation
  • Durability

  • 1. read(A)

  • 2. A := A - 50

  • 3. write(A)

  • 4. read(B)

  • 5. B := B + 50

  • 6. write(B) 

slide-20
SLIDE 20

20

Isolation Levels and the ANSI/ISO SQL Standard

Isolation Levels Dirty Read Non-Repeatable Read Phantom Read READ UNCOMMITTED Permitted Permitted Permitted READ COMMITTED -- Permitted Permitted REPEATABLE READ

  • Permitted

SERIALIZABLE

slide-21
SLIDE 21

21

READ_COMMITTED

Transaction 1 Transaction 2

Expected value of А is $50 Real value of A is $100 А := $50

T1 read(A) 
 T2 A:= A - 50 T3 write(A) T4 commit T5 … read(A) 
 A := A + 50 write(A) commit

slide-22
SLIDE 22

22

Apache Ignite Concurrency Modes and Isolation Levels

Isolation Levels

  • READ_COMMITTE

D

  • REPEATABLE_REA

D

  • SERIALIZABLE

Concurrency Modes

  • PESSIMISTIC
  • OPTIMISTIC
slide-23
SLIDE 23

23

Apache Ignite Documentation: Concurrency Modes and Isolation Levels

PESSIMISTIC REPEATABLE_READ - Entry lock is acquired and data is fetched from the primary node on the first read or write access and stored in the local transactional map. All consecutive access to the same data is local and will return the last read or updated transaction value. This means no other concurrent transactions can make changes to the locked data, and you are getting Repeatable Reads for your transaction. OPTIMISTIC SERIALIZABLE - Stores an entry version upon first read access. Ignite will fail a transaction at the commit stage if the Ignite engine detects that at least one of the entries used as part of the initiated transaction has been modified.

slide-24
SLIDE 24

Demo Time

Transactions

slide-25
SLIDE 25

25

.txStart(CONCURRENCY_MODE, ISOLATION_LEVEL)

try (Transaction tx = ignite.transactions().txStart(OPTIMISTIC, SERIALIZABLE)) {
 Account fromAccount = cache.get(fromAccountId);
 Account toAccount = cache.get(toAccountId);
 ... 
 tx.commit();
 }


slide-26
SLIDE 26

Story 5

Testing Under Load

slide-27
SLIDE 27

27

Performance Testing Tools

slide-28
SLIDE 28

Demo Time

slide-29
SLIDE 29

29

What cache mode to choose?

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

PARTIONED REPLICATED

Primary copy Backup copy

1 2 3 4 3 4 1 2

Primary copy Backup copy

slide-30
SLIDE 30

30

.txStart(CONCURRENCY_MODE, ISOLATION_LEVEL)

CacheConfiguration<Integer, Account> cfg = new CacheConfiguration<>(CACHE_NAME);
 cfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);
 cfg.setCacheMode(CacheMode.PARTITIONED);
 cfg.setBackups(2);


slide-31
SLIDE 31

Demo Time

slide-32
SLIDE 32

32

Jepsen Test

lein run test \

  • -test bank \
  • -time-limit 60 \
  • -concurrency 5 \
  • -nodes-file nodes \
  • -username root \
  • -password root \
  • -cache-mode PARTITIONED \
  • -cache-atomicity-mode TRANSACTIONAL \
  • -cache-write-sync-mode FULL_SYNC \
  • -read-from-backup YES \
  • -transaction-concurrency PESSIMISTIC \
  • -transaction-isolation REPEATABLE_READ \
  • -backups 2 \
  • -pds true \
  • -version 2.7.0 \
  • -os debian \
  • -nemesis kill-node
slide-33
SLIDE 33

Story 6

Disruptive Scenarios

slide-34
SLIDE 34

34

Node failure

Application crash OS crash Hardware crash JVM crash

slide-35
SLIDE 35

35

Disruptive Scenarios

  • Hardware
  • Network
  • Application
  • Other scenarios
slide-36
SLIDE 36

36

Disruptive Scenarios: Hardware

Primary partition Backup partitions 2 3 4 5 4 5 6 7 6 7 8 1 8 1 2 3 1 2 3 4 3 4 5 6 5 6 7 8 7 8 1 2

Data Center #1 Data Center #2

slide-37
SLIDE 37

37

Disruptive Scenarios: Hardware

Primary partition Backup partitions 2 3 4 5 4 5 6 7 6 7 8 1 8 1 2 3 1 2 3 4 3 4 5 6 5 6 7 8 7 8 1 2

Data Center #1 Data Center #2

slide-38
SLIDE 38

38

Disruptive Scenarios: Network

  • iptables
  • NetEm emulates:
  • network delays with different distribution

functions

  • packet loss
  • repeat packets
  • reordering of packets
  • packet distortion
slide-39
SLIDE 39

39

Disruptive Scenarios: Network

2 3 4 5 4 5 6 7 6 7 8 1 8 1 2 3 1 2 3 4 3 4 5 6 5 6 7 8 7 8 1 2

Data Center #1 Data Center #2

slide-40
SLIDE 40

40

Disruptive Scenarios: Application

slide-41
SLIDE 41

41

Disruptive Scenarios: Application

Presentation Layer (UI) Integration Layer (Kafka & ZeroMQ) Business Modules Data Storage & Computing (GridGain) Logging, Access Granting

slide-42
SLIDE 42

42

Disruptive Scenarios: Other Scenarios

slide-43
SLIDE 43

43

Tools to start using Fault Injection

Code examples https://github.com/leapsky/FaultInjectionExamples Frameworks Jepsen - https://github.com/jepsen-io/jepsen Chaos Monkey - https://github.com/Netflix/SimianArmy/wiki/Chaos- Monkey Linux Utilities NetEm (tc) - https://wiki.linuxfoundation.org/networking/netem stress-ng - https://manned.org/stress-ng/fd34c972 Iperf - https://iperf.fr kill -9 iptables Load testing tools JMeter - https:// jmeter.apache.org Configuration Management Ansible - https:// docs.ansible.com Puppet - https://puppet.com

slide-44
SLIDE 44

44

Lessons Learned

  • Fault Injection is the art of explicitly forcing a system to

fail to make sure that it will operate correctly in failure modes.

  • No risk - no test!
  • Test results must be clear and unambiguous.
  • The closer your test environments match your

production environments, the more accurate your testing will be.

slide-45
SLIDE 45

45

Thank you! Questions?

Pavel Lipsky

pavel.lipsky@gmail.com https://github.com/jepsen-io/jepsen/tree/master/ignite https://github.com/leapsky/