Delivering Intelligence from space Crop Forecasting Pipeline - - PowerPoint PPT Presentation

delivering intelligence from space crop forecasting
SMART_READER_LITE
LIVE PREVIEW

Delivering Intelligence from space Crop Forecasting Pipeline - - PowerPoint PPT Presentation

Delivering Intelligence from space Crop Forecasting Pipeline Monitoring Market Intelligence Gathering Plane & Ship Tracking 3 Distributed Data Engineering - Lessons 4 Distributed Data Engineering - Lessons 1. Metrics 5 Distributed


slide-1
SLIDE 1

Delivering Intelligence from space

slide-2
SLIDE 2

Crop Forecasting Plane & Ship Tracking Pipeline Monitoring Market Intelligence Gathering

slide-3
SLIDE 3

3

Distributed Data Engineering - Lessons

slide-4
SLIDE 4

4

Distributed Data Engineering - Lessons

  • 1. Metrics
slide-5
SLIDE 5

5

Distributed Data Engineering - Lessons

  • 1. Metrics
  • 2. Logging
slide-6
SLIDE 6

6

Distributed Data Engineering - Lessons

  • 1. Metrics
  • 2. Logging
  • 3. Frameworks
slide-7
SLIDE 7

7

Distributed Data Engineering - Lessons

  • 1. Metrics
  • 2. Logging
  • 3. Frameworks
  • 4. Serverless ETL
slide-8
SLIDE 8

8

  • 1. Metrics were lacking
slide-9
SLIDE 9

9

Before

user_id num_downloads num_uploads 4 10 1 7 6 3

slide-10
SLIDE 10

10

Before

user_id num_downloads num_uploads 4 11 1 7 6 3 +1

Server

slide-11
SLIDE 11

11

date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650 downloads

slide-12
SLIDE 12

12

date user_id image_size 2017-10-01 14:40:32 4 1365 2017-10-02 11:01:11 4 650 2017-10-02 11:06:00 5 9001 downloads

slide-13
SLIDE 13

13

Database

Source of Truth

slide-14
SLIDE 14

14

Database

Source of Truth

  • Migration headaches
  • Manage connections
  • Performance
slide-15
SLIDE 15

15

Database

Source of Truth

Logging

{ "message": "Downloaded img", "userId": "1234", "imgId": "1d3x5", "service": "download-server", "time": "1509385330" }

slide-16
SLIDE 16

16

Server logs

  • Debugging
  • Support

Two Kinds of Logs

[Wed Oct 11 14:32:12 2000] [info] [client 127.0.0.1] image 1d3x5 downloaded by userId 1234

slide-17
SLIDE 17

17

Server logs

  • Debugging
  • Support

Two Kinds of Logs

Metric logs

  • Dashboards
  • Analytics

[Wed Oct 11 14:32:12 2000] [info] [client 127.0.0.1] image 1d3x5 downloaded by userId 1234 { "message": "Downloaded img", "userId": "1234", "imgId": "1d3x5", "service": "download-server", "time": "1509385330" }

slide-18
SLIDE 18

18

Metric logs

  • Dashboards
  • Analytics

{ "message": "Downloaded img", "userId": "1234", "imgId": "1d3x5", "service": "download-server", "time": "1509385330" }

slide-19
SLIDE 19

19

Centralize

slide-20
SLIDE 20

20

Metric Collector

import observatory

  • bs = observatory.Tracker()
  • bs.track('search_made', {

'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })

slide-21
SLIDE 21

21

Metric Collector

import observatory

  • bs = observatory.Tracker()
  • bs.track('search_made', {

'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })

Lambda REST API Enrich / Conform

slide-22
SLIDE 22

22

Metric Collector

import observatory

  • bs = observatory.Tracker()
  • bs.track('search_made', {

'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })

Lambda REST API Redshift S3

slide-23
SLIDE 23

23

After

Lambda REST Redshift S3

  • Centralized metrics
  • Log enrichment
  • Persistent store
slide-24
SLIDE 24

24

  • 2. Debugging is painful
slide-25
SLIDE 25

25

Before

EC2 EC2

Filesystem Filesystem

Lambda

CloudWatch

slide-26
SLIDE 26

26

Centralize

slide-27
SLIDE 27

27

EC2 EC2 Lambda

Stream

slide-28
SLIDE 28

28

CloudWatch

EC2 EC2 Lambda

slide-29
SLIDE 29

29

CloudWatch

EC2 EC2 Lambda Consumer

slide-30
SLIDE 30

30

CloudWatch

EC2 EC2 Lambda Consumer

ES

S3

SAAS

slide-31
SLIDE 31

31

After

  • Elasticsearch
  • Search by UUID
slide-32
SLIDE 32

32

But!

What does the full flow of a request look like?

slide-33
SLIDE 33

33

Correlation ID

UUID

slide-34
SLIDE 34

34

Correlation ID

  • Create for any external call

External Request Service A

CID

slide-35
SLIDE 35

35

Correlation ID

  • CID passed everywhere

External Request Service A Service B Service C

CID

slide-36
SLIDE 36

36

Correlation ID

In ES → filter y CID

slide-37
SLIDE 37

37

  • 3. Building services is slow
slide-38
SLIDE 38

38

Before

  • Online console
  • Zip file deployment
  • Doesn’t sale
slide-39
SLIDE 39

39

Infrastructure as Code

slide-40
SLIDE 40

40

Infrastructure as Code

  • Serverless Framework
  • Template -> service
  • Rapid deployment
slide-41
SLIDE 41

41

# serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create events:

  • http: post users/create

resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions:

  • AttributeName: email

AttributeType: S

slide-42
SLIDE 42

42

# serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev functions: usersCreate: handler: users.create events:

  • http: post users/create

resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions:

  • AttributeName: email

AttributeType: S

Internet gateway Lambda handler Dynamodb

S3

Microservice

slide-43
SLIDE 43

43

# serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events:

  • http: post users/create

resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions:

  • AttributeName: email

AttributeType: S

slide-44
SLIDE 44

44

# serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events:

  • http: post users/create

resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions:

  • AttributeName: email

AttributeType: S

  • Service info as ENV vars
slide-45
SLIDE 45

45

# serverless.yml service: users provider: name: aws runtime: nodejs6.10 stage: dev STAGE: ${opt:stage, self:provider.stage} functions: usersCreate: handler: users.create events:

  • http: post users/create

resources: Resources: usersTable: Type: AWS::DynamoDB::Table Properties: TableName: usersTable AttributeDefinitions:

  • AttributeName: email

AttributeType: S

  • Service info as ENV vars
  • Inject in logs

import observatory

  • bs = observatory.Tracker()
  • bs.track('search_made', {

'query': event.query, 'n_results': len(resp['data']), 'user_id': user_item.id })

slide-46
SLIDE 46

46

After

  • Rapid dev
  • Source controlled
  • Log enrichment
slide-47
SLIDE 47

47

  • 4. Server time == $$$
slide-48
SLIDE 48

48

Before

  • Bursty
  • ETL servers idle
slide-49
SLIDE 49

49

Transient Resources

  • Pipeline:

○ Spin up EC2 ○ Terminate

EC2

ETL

EC2 DB

slide-50
SLIDE 50

50

FAAS Resources

  • Pipeline:

○ Discretize work ○ Lambda fleet ○ Inherently transient

DB Listener Worker Worker Worker

slide-51
SLIDE 51

51

After

  • Faster
  • Cheaper
  • Highly scalable ETL
slide-52
SLIDE 52

52

Distributed Data Engineering - Lessons

  • 1. Metrics
  • 2. Logging
  • 3. Frameworks
  • 4. Serverless ETL
slide-53
SLIDE 53

Skywatch