Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls - - PowerPoint PPT Presentation

building scalable real time data pipeline data fridge
SMART_READER_LITE
LIVE PREVIEW

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls - - PowerPoint PPT Presentation

Building Scalable Real-Time Data Pipeline Data Fridge Vicente Valls Rios Software Engineering Manager !!! 2 million orders delivered in one day (July 2019) www.deliveryhero.com Data Volume in Numbers + User clicks + Logistics data


slide-1
SLIDE 1

Vicente Valls Rios

Software Engineering Manager

Building Scalable Real-Time Data Pipeline “Data Fridge”

slide-2
SLIDE 2

www.deliveryhero.com

!!!

2 million orders delivered in one day (July 2019)

slide-3
SLIDE 3

Data Volume in Numbers

10M Order-Related Events Per Day

+ User clicks + Logistics data + Restaurant availability + Menu items + Customer data

slide-4
SLIDE 4

House of Brands & Global Services Logistics Search

Recommendation ...

slide-5
SLIDE 5

Challenges

slide-6
SLIDE 6

Data Producers

...

Data Consumers Search Recommendation Logistics ...

slide-7
SLIDE 7

Data Producers

...

Data Consumers Search Recommendation Logistics ...

Data Fridge

slide-8
SLIDE 8

Mission

  • Unify the data structure across all entities
  • Provide different data consumptions types:

○ Near real-time ○ Low-latency ○ High-latency (analytics)

  • Become a data producer for ML applications
slide-9
SLIDE 9

Architecture

slide-10
SLIDE 10

any Data Producer

Architecture

Low-latency API Events subscription Ingestion API Streaming Long-term storage

any Data Consumer any Data Consumer any Analytics

Data Fridge

slide-11
SLIDE 11

Ingestion API

slide-12
SLIDE 12
  • Verifies quality of data using

complex validations

  • Single entrypoint
  • Batch import
  • AWS Lambda for event processing
  • IP Whitelisting
  • JWT authentication

Ingestion API

AWS Lambda AWS API Gateway

slide-13
SLIDE 13

Alias: PROD v1 v2 90% 10%

Canary Deployment

  • Custom solution
  • 2 versions under the

same alias

  • Metrics monitoring
slide-14
SLIDE 14

Streaming

slide-15
SLIDE 15

Lambda

  • Kinesis stream preserves messages up to 7 days
  • Ability to replay data
  • Ability to scale up/down

Streaming

AWS Kinesis AWS SNS

any Data Consumer any Data Producer

Ingestion API

slide-16
SLIDE 16
  • We do not care about order of events.
  • Having Message Filtering.
  • Kinesis requires to scale up shards as our

Consumers grow.

  • Scale up Kinesis is harder and more expensive

than SNS.

  • SNS->HTTP/SQS service provides PUSH-PULL data

consumption.

SNS instead of Kinesis ?

slide-17
SLIDE 17
  • 2 types for subscriptions
  • Filtering based on event attributes

Streaming

AWS SQS AWS SNS HTTPS

any Data Consumer

PUSH PULL

any Data Consumer

slide-18
SLIDE 18

Stream Aggregation

AWS Lambda Order Event - AWS DynamoDB Order Event AWS SNS Order Status Event AWS SNS /SQS Order + Order Status Event AWS SNS

slide-19
SLIDE 19

Low-latency API

slide-20
SLIDE 20
  • Dead Letter Queue
  • On-demand scaling

Low-latency API

AWS Lambda AWS SQS DLQ AWS DynamoDB AWS API Gateway

slide-21
SLIDE 21

Analytics

slide-22
SLIDE 22
  • Bigquery as OLAP db
  • Data quality visualization
  • Bigquery is scalable DWH

solution

Analytics

AWS Lambda Google BigQuery AWS SQS DLQ

slide-23
SLIDE 23

Ingestion API Streaming

First use case: Near real-time Second use case: Low-latency API Third use case: Analytics

Any Producer Any Consumer

slide-24
SLIDE 24

Tech Stack

slide-25
SLIDE 25

Challenges

slide-26
SLIDE 26

Challenges

  • SLAs ( latency, durability, etc.) for some cloud

services.

  • Data Quality & Data documentation
  • GDPR
  • Automating new pipeline creation
  • Automating SNS subscriptions / BigQuery Access
slide-27
SLIDE 27

www.deliveryhero.com

We Are Hiring!