LESSONS LEARNED FROM REVIEWING 150 INFRASTRUCTURES_ JON TOPPER | - - PowerPoint PPT Presentation

lessons learned from reviewing 150 infrastructures
SMART_READER_LITE
LIVE PREVIEW

LESSONS LEARNED FROM REVIEWING 150 INFRASTRUCTURES_ JON TOPPER | - - PowerPoint PPT Presentation

LESSONS LEARNED FROM REVIEWING 150 INFRASTRUCTURES_ JON TOPPER | @jtopper | he/him/his $ whoami Founder/CEO/CTO The Scale Factory Working in hosting/infrastructure for 20 years Infrastructure / AWS / DevOps @jtopper @jtopper REVIEWS RUN _


slide-1
SLIDE 1
slide-2
SLIDE 2

LESSONS LEARNED FROM REVIEWING 150 INFRASTRUCTURES_

JON TOPPER | @jtopper | he/him/his

slide-3
SLIDE 3

$ whoami

Founder/CEO/CTO The Scale Factory Working in hosting/infrastructure for 20 years @jtopper Infrastructure / AWS / DevOps

slide-4
SLIDE 4

@jtopper

slide-5
SLIDE 5

45 90 135 180

Mar-2018 May-2018 Jul-2018 Sep-2018 Nov-2018 Jan-2019 Mar-2019 May-2019 Jul-2019 Sep-2019 Nov-2019 Jan-2020

REVIEWS RUN_

@jtopper

slide-6
SLIDE 6

TODAY’S AGENDA_

What is Well-Architected? What is a Well-Architected Review? Common Review Findings @jtopper

slide-7
SLIDE 7

WHAT IS WELL-ARCHITECTED?_

@jtopper

slide-8
SLIDE 8

WELL ARCHITECTED ORIGINS_

Catalogue of emergent good practices Observed by AWS Field Solutions Architects Codified and shared Platform agnostic* @jtopper

slide-9
SLIDE 9
  • White Papers

Review Tool

@jtopper

slide-10
SLIDE 10

Performance Efficiency Cost Optimisation Operational Excellence Reliability Security

@jtopper

slide-11
SLIDE 11

IoT (Internet of Things) High Performance Computing Serverless Applications

Lenses @jtopper

slide-12
SLIDE 12

USING WELL-ARCHITECTED_

Gap analysis / planning Teaching Team alignment @jtopper

slide-13
SLIDE 13

WHAT IS A WELL-ARCHITECTED REVIEW?_

@jtopper

slide-14
SLIDE 14

WELL ARCHITECTED REVIEW_

Foundational questions Up to 4 hours Qualitative @jtopper

slide-15
SLIDE 15

Performance Efficiency Cost Optimisation Operational Excellence Reliability Security Well Architected Core Serverless Applications High Performance Computing IoT (Internet of Things) 9 11 9 8 9 2 3 2 1 1 4 3 3 4 2 4 11 6 10 4

@jtopper

46 9 16 35

slide-16
SLIDE 16

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are? @jtopper

slide-17
SLIDE 17

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI WA WA NI NI

@jtopper

slide-18
SLIDE 18

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI WA WA NI NI

High Risk

@jtopper

slide-19
SLIDE 19

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI WA WA NI NI

Medium Risk

@jtopper

slide-20
SLIDE 20

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI WA WA NI NI

Medium Risk

@jtopper

slide-21
SLIDE 21

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI WA WA NI NI

Well Architected

@jtopper

slide-22
SLIDE 22

COMMON REVIEW FINDINGS_

@jtopper

slide-23
SLIDE 23

THE GOOD_

@jtopper

slide-24
SLIDE 24

QUESTION OPS 1_

  • Evaluate external customer needs
  • Evaluate internal customer needs
  • Evaluate compliance requirements
  • Evaluate threat landscape
  • Evaluate tradeoffs
  • Manage benefits and risks
  • None of these

How do you determine what your priorities are?

NI WA CI Well Architected

77%

WA WA NI NI

93% 87% 90% 85% 89% 89% 0%

WA Rank: 1

@jtopper

slide-25
SLIDE 25

QUESTION PERF 3_

  • Understand storage characteristics and

requirements

  • Evaluate available configuration options
  • Make decisions based on access

patterns and metrics

  • None of these

How do you select your storage solution?

WA CI Well Architected

70%

NI NI

84%

WA Rank: 2

78% 73% 5% @jtopper

slide-26
SLIDE 26

QUESTION REL 5_

  • Deploy changes in a planned manner
  • Deploy changes with automation
  • None of these

How do you implement change?

WA CI Well Architected

63%

NI

83% 67% 6%

WA Rank: 3

@jtopper

slide-27
SLIDE 27

THE BAD_

@jtopper

slide-28
SLIDE 28

QUESTION REL 9_

  • Define recovery objectives for downtime

and data loss

  • Use defined recovery strategies to meet

the recovery objectives

  • Test disaster recovery implementation to

validate the implementation

  • Manage configuration drift on all

changes

  • Automate recovery
  • None of these

How do you plan for disaster recovery?

WA CI High Risk

79%

NI

33%

HRI Rank: 1 WA WA NI

33% 25% 39% 16% 31%

(87%)

@jtopper

slide-29
SLIDE 29

QUESTION SEC 11_

  • Identify key personnel and external

resources

  • Identify tooling
  • Develop incident response plans
  • Automate containment capability
  • Identify forensic capabilities
  • Pre-provision access
  • Pre-deploy tools
  • Run game days
  • None of these

How do you respond to a [security] incident?

WA CI High Risk

75%

NI

51%

HRI Rank: 2 WA WA NI

27% 39% 0% 11% 27% 10% 3% 35%

NI NI NI

(93%)

@jtopper

slide-30
SLIDE 30

QUESTION SEC 8_

  • Define data classification requirements
  • Define data protection controls
  • Implement data identification
  • Automate identification and classification
  • Identify the types of data
  • None of these

How do you classify your data?

WA CI High Risk

75%

HRI Rank: 3 WA WA

61% 39% 17% 4% 59% 23%

NI NI

(88%)

@jtopper

slide-31
SLIDE 31

QUESTION COST 9_

  • Establish a cost optimisation function
  • Develop a workload review process
  • Review and implement services in an

unplanned way

  • Review and analyse this workload

regularly

  • Keep up to date with new service

releases

  • None of these

How do you evaluate new services?

WA CI High Risk

71%

HRI Rank: 4 WA

34% 26% 84%

NI NI

(79%)

NI

43% 63% 1% @jtopper

slide-32
SLIDE 32

QUESTION REL 8_

  • Use playbooks for unanticipated failures
  • Conduct root cause analysis and share

results

  • Inject failures to test resiliency
  • Conduct game days regularly
  • None of these

How do you test resilience?

WA CI High Risk

67%

HRI Rank: 5 WA

25%

NI

(92%)

NI

73% 6% 0% 16% @jtopper

slide-33
SLIDE 33

THE NOTABLE_

@jtopper

slide-34
SLIDE 34

QUESTION OPS 3_

  • Use version control
  • Test and validate changes
  • Use config management systems
  • Use build/deploy systems
  • Perform patch management
  • Share design standards
  • Implement practices to improve code quality
  • Use multiple environments
  • Make frequent, small, reversible changes
  • Fully automate integration and deployment
  • None of these

How do you reduce defects, ease remediation, and improve flow into production?

NI WA Well Architected

14%

WA WA Rank: 23

@jtopper

NI NI NI NI NI NI NI CI

90% 87% 78% 82% 37% 57% 83% 81% 63% 52% 3%

slide-35
SLIDE 35

QUESTION OPS 6_

  • Identify key performance indicators
  • Define workload metrics
  • Collect and analyse workload metrics
  • Establish workload metric baselines
  • Learn expected patterns of activity for workload
  • Alert when workload outcomes are at risk
  • Alert when workload anomalies are detected
  • Validate the achievement of outcomes and the

effectiveness of KPIs and metrics

  • None of these

How do you understand the health of your workload?

WA Well Architected

46%

WA WA Rank: 21

@jtopper

NI NI NI NI NI CI

53% 62% 72% 51% 54% 40% 34% 37%

WA

14%

slide-36
SLIDE 36

QUESTION SEC 2_

  • Define human access requirements
  • Grant least privileges
  • Allocate unique credentials per person
  • Manage credentials based on lifecycle
  • Automate credential management
  • Grant access through roles or federation
  • None of these

How do you control human access?

WA CI High Risk

47%

HRI Rank: 20 WA WA

70% 58% 90% 70% 13% 62% 3%

NI NI

(88%)

@jtopper

NI

slide-37
SLIDE 37

QUESTION SEC 3_

  • Define programmatic access requirements
  • Grant least privileges
  • Automate credential management
  • Allocate unique credentials per component
  • Grant access through roles or federation
  • Implement dynamic authentication
  • None of these

How do you control programmatic access?

WA CI High Risk

57%

HRI Rank: 15 WA

40% 70% 24% 68% 58% 22% 13%

NI NI

(89%)

@jtopper

NI NI

slide-38
SLIDE 38

MAJOR THEMES_

@jtopper

slide-39
SLIDE 39

TEAMS ARE OK AT CHOOSING CORRECT SERVICES_

Database choices match workload Storage choices match workload Compute choices sometimes not right- sized. @jtopper

slide-40
SLIDE 40

TEAMS ARE OK AT MAKING SOFTWARE CHANGES_

Automation tools are being used Full CD remains out of reach Change batch sizes need to be smaller @jtopper

slide-41
SLIDE 41

@jtopper

https:/ /services.google.com/fh/files/misc/state-of-devops-2019.pdf

slide-42
SLIDE 42

TEAM ARE BAD AT THINKING ABOUT FAILURE MODES_

Not considering business requirements No risk analysis of failure modes Poor documentation Almost no attempt to rehearse outages @jtopper

slide-43
SLIDE 43

@jtopper

slide-44
SLIDE 44

TEAMS ARE BAD AT MONITORING FOR FAILURE MODES_

Monitoring happening Data not used for much Tracing almost non-existent @jtopper

slide-45
SLIDE 45

TEAMS NEED TO DO BETTER AT SECURITY_

Poor hygiene around patching Limited data classification Mediocre human access control Bad programmatic access control Low adoption of security monitoring tools @jtopper

slide-46
SLIDE 46

TOP BREACH CAUSES_

Using components with known vulnerabilities Security misconfiguration Injection Weak auth / session management Missing function access control

https:/ /snyk.io/blog/owasp-top-10-breaches/

@jtopper

slide-47
SLIDE 47

EVERYONE IS BETTER AT BUILDING PLATFORMS THAN THEY ARE AT SECURING OR RUNNING THEM_

slide-48
SLIDE 48

WHAT NEXT?_

Read the white papers:

https:/ /aws.amazon.com/architecture/well-architected/

Run your own review(s)

https:/ /aws.amazon.com/well-architected-tool/

Consider engaging an AWS Well-Architected partner

https:/ /scalefactory.com/services/well-architected/ (funding available)

@jtopper

slide-49
SLIDE 49

KEEP IN TOUCH_

http:/ /www.scalefactory.com/ https:/ /github.com/scalefactory @scalefactory jon@scalefactory.com