From Development to Production: Many Uses of Serverless - - PowerPoint PPT Presentation

from development to production many
SMART_READER_LITE
LIVE PREVIEW

From Development to Production: Many Uses of Serverless - - PowerPoint PPT Presentation

From Development to Production: Many Uses of Serverless Observability EMRAH SAMDAN | SEPTEMBER 9, 2019 Community Day 2019 Sponsors Who am I? Developer for 6+ years Product manager for 2 years VP of Product for Thundra


slide-1
SLIDE 1

From Development to Production: Many Uses of Serverless Observability

EMRAH SAMDAN | SEPTEMBER 9, 2019

Community Day 2019 Sponsors

slide-2
SLIDE 2

@emrahsamdan

Who am I?

  • Developer for 6+ years
  • Product manager for 2 years
  • VP of Product for Thundra
  • Organizing committee
  • Serverlessdays İstanbul

On October 3rd!

slide-3
SLIDE 3

@emrahsamdan

Agenda

  • Let’s define serverless (yes once again!)
  • Is observability a buzzword or a real thing?
  • Observability challenges in serverless
  • Observability Driven Development
  • How/When to test serverless applications
  • What to check to monitor serverless stack
  • Troubleshooting serverless applications
slide-4
SLIDE 4

@emrahsamdan

What’s serverless

Serverless computing is a cloud-computing execution model in which the cloud provider runs the server, and dynamically manages the allocation of machine

  • resources. Pricing is based on the actual amount of resources consumed by an

application, rather than on pre-purchased units of capacity.

Wikipedia: https://en.wikipedia.org/wiki/Serverless_computing

slide-5
SLIDE 5

@emrahsamdan

What is serverless?

Operational construct? Things that run perfectly and I don’t need to manage. Is Stripe, Auth0 serverless?

slide-6
SLIDE 6

@emrahsamdan

slide-7
SLIDE 7

@emrahsamdan

What’s serverless?

Utility computing I only pay per what I use.

slide-8
SLIDE 8

@emrahsamdan

What’s serverless?

A doctrine, a thought model helping you deliver faster and put your focus on the value you provide to your customers.

Ben Kehoe Paul Johnston

slide-9
SLIDE 9

I agree you all. But! All the ups can go down when you don’t pay attention what’s really happening with serverless.

slide-10
SLIDE 10

Shared Responsibility Model Cloud vendor will handle scalability and reliability. But performance and security IS still ON US.

slide-11
SLIDE 11

@emrahsamdan

Serverless Observability

  • Serverless is full of hidden

traps that can harm its promise.

  • Can be very costly.
  • Can perform really poor.
  • You need to check what’s

going on.

slide-12
SLIDE 12

@emrahsamdan

What’s observability?

https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

slide-13
SLIDE 13

@emrahsamdan

Are we ready for unknown unknowns?

Known knowns Things that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over there. Let me dig the traces and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are kaputt and I have no freaking idea with what I have.

slide-14
SLIDE 14

@emrahsamdan

The pillars

Traces Metrics Logs Visualization Machine Learning and Insights Alerts

slide-15
SLIDE 15

@emrahsamdan

Observability Challenges in Serverless

  • No access to underlying infrastructure
  • You either take whatever cloud vendor provides or accept that there will

be an overhead.

  • Overhead?

○ Gathering intelligence (should be acceptable) ○ Transporting it to where necessary (you only have invocation life time to take the data

  • ut)
  • Everything is event-driven and distributed.
slide-16
SLIDE 16

Fix the Charizard. If you can!

slide-17
SLIDE 17

@emrahsamdan

Observability-Driven Development

slide-18
SLIDE 18

@emrahsamdan

Observability-Driven Development

  • Recall that unknown-unknowns.
  • What do you need to when you have to

troubleshoot a unknown-unknown?

  • Can an observability tool know your

unknowns?

  • If you don’t know what to know what

can you do?

Known knowns Things that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over

  • there. Let me dig the traces

and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are kaputt and I have no freaking idea with what I have.

slide-19
SLIDE 19

ASK

Your observability tool should give you auto-replies. But it should also let you ask wise questions.

slide-20
SLIDE 20

@emrahsamdan

Observability-Driven Development

  • Not a replacement for test-driven development.
  • Think of the answers that you can give for any type of question.
  • If you are thinking about questions, request that feature from your tool.
  • Structured logging and manual instrumentation is the key.

Retrieved from: https://dzone.com/articles/what-is-structured-logging

slide-21
SLIDE 21

@emrahsamdan

Observability-Driven Development (Cons)

  • Observability coverage?
  • Hard to accustom.
  • You can’t sample a thing.
slide-22
SLIDE 22

TESTING

slide-23
SLIDE 23

@emrahsamdan

Testing challenges in Serverless

  • Local testing is a pain.

○ How to mock the cloud resources. Is it actually correct to mock them? ○ How should you test the chain of many invocations? ○ How to integrate it with CI/CD tools?

  • Integration testing with real resources is still the best effort but again

how?

slide-24
SLIDE 24

@emrahsamdan

Integration testing

  • Serverless != Functions
  • Test your business logic against the

resources.

  • See how your messages being

transformed in the flow.

  • Async events can cause problems

that you can never guess.

slide-25
SLIDE 25

@emrahsamdan

Integration Testing (Cons)

  • Still you’re dealing only with known-knowns.
  • Resources that are not pay-per-use.

○ Setting up a test environment. Still?

  • Not with the production data.
slide-26
SLIDE 26

@emrahsamdan

Chaos Testing Serverless Applications

  • Serverless fits the chaos engineering greatly because

○ Distributed ○ Lots of possibilities of failures in async environment ○ Event-driven (So poisonous events) ○ Roles and permissions are so granular that access can slip away.

slide-27
SLIDE 27

@emrahsamdan

Chaos testing on serverless what?

  • What would happen if inner

Lambda starts to respond slow?

  • Are you sure that you properly

tuned timeouts?

  • Test with injecting latency.
slide-28
SLIDE 28

@emrahsamdan

Chaos testing on serverless what?

  • What if we lose the connection

to Redis?

slide-29
SLIDE 29

@emrahsamdan

Chaos testing on serverless. How?

  • https://github.com/adhorn/aws-lambda-chaos-injection
  • https://github.com/gunnargrosch/

Adrian Hornsby Gunnar Grosch

slide-30
SLIDE 30

MONITORING

How large should be my screen to see the charts for thousands of functions?

slide-31
SLIDE 31

@emrahsamdan

How to discover an anomaly in serverless?

slide-32
SLIDE 32

@emrahsamdan

slide-33
SLIDE 33

@emrahsamdan

Serverless is more than functions, so is monitoring.

  • Issues can stay local before you notice them.
  • It is slow. Why?

○ API slowdown? ○ Throttle on any resource? ○ Bad coding practice?

  • Invocation counts go crazy.

○ Seasonal peak? ○ Successful product? ○ Retry storms?

slide-34
SLIDE 34

@emrahsamdan

slide-35
SLIDE 35

@emrahsamdan

Monitoring Latency

  • Abnormal latency is mostly not related

with the function code.

○ Idly waiting for a third party API. ○ Throttled resource

  • Set aggregated alerts

○ Alert on transaction duration ○ Alert on function duration. ○ Alert on operation duration

slide-36
SLIDE 36

@emrahsamdan

Storm of retries and errors

  • When your code fails for some reason, your function will retry several

times.

○ Sync events: You should control it. ○ Async events: Different retry mechanisms. ○ Stream based events: Risk of losing data.

  • Does this solve?
  • Check

○ Iterator age ○ Number of retries ○ Number of errors ○ Memory usage ○ Cold starts

slide-37
SLIDE 37

TROUBLESHOOTING

Bad things happen in serverless, too. Now, it’s time to battle!

slide-38
SLIDE 38

@emrahsamdan

Failure modes of serverless

  • https://github.com/adhorn/aws-lambda-chaos-injection
  • https://github.com/gunnargrosch/
slide-39
SLIDE 39

@emrahsamdan

Failure modes of serverless

Bad-tuned memory Timeout Error in code or in a managed resource

slide-40
SLIDE 40

@emrahsamdan

Consequences

Downtimes Huge Bills Unhappy customers

slide-41
SLIDE 41

@emrahsamdan

Challenges of Troubleshooting

How to trace the distributed async events with non-aggregated traces, metrics and logs? How to trace requests to external resources? How to trace the async distributed events?

slide-42
SLIDE 42

@emrahsamdan

Distributed Tracing

  • Trace the distributed

transactions: chain of multiple invocations

  • Understand what is

wrong with a glance

  • But?! What if the I

have a bad coding practice in the code?

slide-43
SLIDE 43

@emrahsamdan

Local Tracing

  • Instrument the code itself

and check against code quality.

  • Good for discovering

○ Bad coding practices ○ Value of local variables in the code. ○ Debugging the code without breakpoints.

slide-44
SLIDE 44

@emrahsamdan

Actionable Alerts in Serverless

  • Alert on code errors

○ Stacktrace ○ Code line it caused ○ Values of (Local variables)

  • Alert on latencies and timeout

errors

○ Slow API communications ○ Slow DB interaction for bad queries

slide-45
SLIDE 45

@emrahsamdan

How to respond to the issues on serverless

  • Issue may not be your code.
  • Check the third parties.
  • Check other metrics

○ Iterator age of streams ○ Throttles on resources

  • Have some runbooks

○ Exponential backoffs to APIs, Alternative APIs ○ Healthy on-call structures

slide-46
SLIDE 46

@emrahsamdan

Key Takeaways

  • Serverless observability is not an after-production issue.
  • Observability with all three pillars aggregated is life-saving.
  • Automation is king! But, get yourself ready to ask questions with ODD.
  • Sadly, no testing scenario is sufficient in serverless. Step into chaos

engineering before your engineering run into chaos!

  • Change the way you monitor your system. Look beyond functions and

discover local bottlenecks with an architectural view!

  • Serverless transaction= A chain of invocations commuting between

resources and APIs. Full tracing required!

  • Make your alerts actionable and start keeping runbooks for the issues

that you can predict.

slide-47
SLIDE 47

Thank you! Danke schön!