Making a Lion Bulletproof: SRE in Banking Robin van Zijll & - - PowerPoint PPT Presentation

making a lion bulletproof sre in banking
SMART_READER_LITE
LIVE PREVIEW

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & - - PowerPoint PPT Presentation

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019 ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with 9 million debit


slide-1
SLIDE 1

Making a Lion Bulletproof: SRE in Banking

Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019

slide-2
SLIDE 2

ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with… 9 million debit cards 8 million retail customers 7 million ATM transactions/month

slide-3
SLIDE 3

Mobile banking is used by 4.5 million customers Together, they log in 6 million times a day (100+ TPS)

slide-4
SLIDE 4

99.77 99.87 0.22 0.13 INTERNET BANKING MOBILE BANKING

AVAI AVAILAB ABILITY FIGUR GURES 2018 PRI PRIME E TIME E (06:30 AM – 01: 01:00 00 AM)

Uptime Downtime 99.88 regulator target

slide-5
SLIDE 5

Logins per second for Mobile Banking

100 40 20 60 80 120 140 00:00 04:00 08:00 12:00 16:00 20:00 20:00 00:00 04:00 08:00 12:00 16:00

slide-6
SLIDE 6

99.63 99.78 0.37 0.22 INTERNET BANKING MOBILE BANKING

AVAI AVAILAB ABILITY FIGUR GURES 2018 24 24 HOURS A DAY

Uptime Downtime 99.999 customer expectation

slide-7
SLIDE 7

SR-what?

Site Reliability Engineering is “what happens when you ask a software engineer to design an operations function” – Ben Traynor (Google)

slide-8
SLIDE 8

People

slide-9
SLIDE 9

At ING we are organized in tribes with (Biz)DevOps squads responsible for build and run

product owners tribe tribe lead squad squad squad squad tribe tribe tribe

Our SRE team is a ‘horizontal’ squad part of a productivity engineering tribe We support 1700 engineers across 340 squads

slide-10
SLIDE 10

Our SRE team

7 engineers (4 dev, 3 ops) 2 more joining soon 1 product owner 1 chapter lead mostly with engineering and on-call experience in ING product engineering

slide-11
SLIDE 11

When we hire SREs, we look for someone who’s

Passionate about reliability, problems, DevOps and open source OK with failure Insensitive to hierarchy Willing to teach and advise engineers about reliability Experienced in on-call duties and 1+ language(s) in our stack Still excited to work with us after meeting half our team and having heard realistic job expectations

slide-12
SLIDE 12

Process

slide-13
SLIDE 13

Why and how did we start with SRE?

We used to have a small team of ops engineers on call for online channels These engineers were the ones up at night, but they could not structurally improve service reliability because of our DevOps model SRE pilot was started and supported

  • Team was transformed and given a new purpose
  • Decided on SRE model, way of working and roadmap
  • Experiences and proposal were presented to senior management

After knowledge transfer of old tasks, SRE was launched :)

slide-14
SLIDE 14

For SRE, we generally see 3 organizational models

product engineering + SRE product engineering SREs tribe SRE product engineering Service ownership is shared between PE and SRE SREs are distributed and embedded in PE teams, service ownership is shared Service ownership is with PE, SRE consults and creates tools

  • ur model
slide-15
SLIDE 15

What do we do as SREs?

Product Development Capacity Planning Testing + Release Procedures Postmortem/RCA Incident Response Monitoring Service Reliability Hierarchy, from O’Reilly’s Site Reliability Engineering (2016) Curious to learn more about…

  • Learning from failure? Check out

Jason’s and Ryan’s talk

  • Chaos engineering and graceful

degredation? Check out Lorne’s talk

  • High impact outlier system failures?

Check out Laura’s talk

slide-16
SLIDE 16

What do we do as SREs?

We We spend 80% 80% of our tim ime on engin ineerin ing

  • We deliver the Reliability Toolkit: a white-box monitoring and alerting stack
  • We work on a secure container platform with a service mesh in public cloud

We We spread SR SRE E lo love and best practic ices

  • We reach out to engineers to consult and get feedback
  • We educate on reliability topics

Wh What we don’t ’t do do

  • On-call for product engineering
  • Work on SRE-topics already covered by other teams in our organization
slide-17
SLIDE 17

We do outreach and we educate on SRE topics

We We edu ducate engineers

  • Engineering onboarding
  • Prometheus workshops

We We facilitate kn knowledg dge sharing

  • Cross-domain SRE guild
  • SRE demo sessions open to all
  • Guidance via chat and intranet
  • Prometheus user community
  • Conference report out

We We reach out to engineers

  • Feedback loop for products
  • We are reliability advocates
slide-18
SLIDE 18

When we demo, we sometimes block the hallway

slide-19
SLIDE 19

We use these principles in our way of working

We work with industry standards We work with open source products and practices We automate toil wherever and whenever we can

slide-20
SLIDE 20

Technology

slide-21
SLIDE 21

Why did we develop the Reliability Toolkit?

Mean time to repair is too long – we waste time finding incident owners Lack of insight into application health for teams High level of technology diversity makes implementing monitoring difficult

slide-22
SLIDE 22

How does the Reliability Toolkit work?

Prometheus Alert Manager Model Builder Grafana E-mail, SMS (Message Bird) and ChatOps (Mattermost) Applications

slide-23
SLIDE 23

How do we provision the Reliability Toolkit?

SRE Team Together with a team we create a joint config We maintain and update binaries We deliver the Reliability Toolkit

  • n 5 instances over

3 environments, we remain responsible We deliver client libraries so metrics can be scraped from servers

slide-24
SLIDE 24

Before, teams would own and use a full pipeline…

version control combine configurations build publish deploy = reliability toolkit done by devops team done by devops team

slide-25
SLIDE 25

…now they only own and update config

version control combine configurations build deploy = reliability toolkit done by devops team

slide-26
SLIDE 26

Increasing and improving usage of Reliability Toolkit

Include client libraries in engineering frameworks Ensure a good feedback loop: in person or in tooling Educate others during onboarding and workshops Template team dashboards and make other dashboards accessible to all

slide-27
SLIDE 27

And now Reliability Toolkit usage has been increasing

slide-28
SLIDE 28

We made onboarding and using our Reliability Toolkit easy, but our 70 onboarded teams still need to ensure that Prometheus can scrape metrics How can we reach all 340 teams?

slide-29
SLIDE 29

Let’s try a service mesh!

Curious? Check the Software Defined Infrastructure track

slide-30
SLIDE 30

Why use service mesh to improve reliability?

  • Service mesh helps us to get new/updated functionality to applications fast
  • We can improve observability for all: metrics, logs, distributed tracing and

resilience patterns based on incident learnings that work out of the box

  • We can introduce/expand A/B testing, canary releasing and staged rollouts
  • Engineers only need to worry about security at application level: immutable

containers, zero trust network and security policies for free, taking away risk documentation work

slide-31
SLIDE 31

What are we working on next?

  • Scaling in our Reliability Toolkit stack for efficient use of

resources, scaling up number of teams using our stack

  • Expanding our role as reliability advocates
  • Completing PoC with service mesh
slide-32
SLIDE 32

Takeaways

  • Hire SREs from your product engineering domain
  • Never compromise on mindset in SREs
  • Start with a pilot if you are not sure if SRE works for you
  • Pick a SRE model that works well for your organization
  • Try to get senior management support and understanding
  • Invest in SRE outreach and education
  • Focus on scalability and ease-of-use in your tooling
  • Don’t be afraid of redesign if it makes users happier
slide-33
SLIDE 33

Questions?

Icons used are all from flaticon.com