Building an open source data lake at scale in the cloud Adrian - - PowerPoint PPT Presentation

building an open source data lake at scale in the cloud
SMART_READER_LITE
LIVE PREVIEW

Building an open source data lake at scale in the cloud Adrian - - PowerPoint PPT Presentation

Building an open source data lake at scale in the cloud Adrian Woodhead, Principal Engineer 1 Agenda Background Data Lake foundation: data + metadata High Availability and Disaster Recovery Data federation Event-based data processing 2


slide-1
SLIDE 1

1

Building an open source data lake at scale in the cloud

Adrian Woodhead, Principal Engineer

slide-2
SLIDE 2

2

Expedia Group Proprietary and Confidential

Agenda

Background Data Lake foundation: data + metadata High Availability and Disaster Recovery Data federation Event-based data processing

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

Expedia Group Proprietary and Confidential

Data Lake journey

  • “traditional” RDBMS Data Warehouse
  • Introduced on-premise Hadoop + Hive cluster
  • RDBMS SQL replaced by SQL from Hive
  • Slow at busy times
  • Painful upgrade path (software and hardware)
  • Migration to “Cloud” as primary data lake
slide-5
SLIDE 5

5

Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e F o u n d a t i o n

1 2

slide-6
SLIDE 6

6

Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e H i g h A v a i l a b i l i t y

1 2

slide-7
SLIDE 7

7

Expedia Group Proprietary and Confidential

C l o u d D a t a L a k e R e d u n d a n c y

1 2

slide-8
SLIDE 8

8

Expedia Group Proprietary and Confidential

Redundancy by replication

  • Data and Metadata
  • Co-ordinated
  • Data consistency during replication
  • No partial reads
  • Completeness more important than latency

8

1 2

slide-9
SLIDE 9

9

Expedia Group Proprietary and Confidential

Circus Train – Hive dataset replicator

  • https://github.com/HotelsDotCom/circus-train/
  • Metadata only available after data
  • Supports HDFS, S3, GCS etc.
  • Standard “distcp” and optimised copiers
  • Plugin architecture – Notifications, Copiers, Metadata

transformations

  • Selective data replication – custom filters, “Hive Diff”
  • https://github.com/HotelsDotCom/shunting-yard
  • Event-driven Circus Train

9

1 2

slide-10
SLIDE 10

10

Expedia Group Proprietary and Confidential

D a t a L a k e S i l o s

1 2

slide-11
SLIDE 11

11

Expedia Group Proprietary and Confidential

Data Lake Silo Solutions

  • Move back to a single data lake
  • Scalability issues
  • Increased “blast radius”
  • Replicate shared data sets between data lakes
  • Cost of maintaining replication jobs
  • Increased file storage costs
  • Increased network transfer costs

11

1 2

slide-12
SLIDE 12

12

Expedia Group Proprietary and Confidential

Federated Cloud Data Lake

  • https://github.com/HotelsDotCom/waggle-dance/
  • Waggle Dance – a Hive Thrift metastore proxy
  • Configure it with “downstream” Hive metastores
  • Configure S3 bucket access permissions
  • Set “hive.metastore.uris” to Waggle Dance server
  • Use as you would Hive metastore in any client app

12

1 2

slide-13
SLIDE 13

13

Expedia Group Proprietary and Confidential

W a g g l e D a n c e O v e r v i e w

1 2

slide-14
SLIDE 14

14

Expedia Group Proprietary and Confidential

M u l t i - R e g i o n F e d e r a t e d C l o u d D a t a L a k e

Federate Replicate US_WEST_2 US_EAST_1 US_WEST_2 US_EAST_1 Replicate

slide-15
SLIDE 15

15

Expedia Group Proprietary and Confidential

Federated Cloud Data Lake Best Practices

  • Expose read-only endpoints to “external” users
  • Separate critical path infrastructure
  • Federate data for access within a region
  • Replicate data for access in a different region

15

1 2

slide-16
SLIDE 16

16

Expedia Group Proprietary and Confidential

Federated Cloud Data Lake Alternative

  • Presto – distributed SQL query engine for big data
  • Federate Hive, MySQL, PostgreSQL and many others
  • https://github.com/prestodb/presto

OR

  • https://github.com/prestosql/presto

?

16

1 2

slide-17
SLIDE 17

17

Expedia Group Proprietary and Confidential

Apiary - Cloud Data Lake Components

  • https://github.com/ExpediaGroup/apiary
  • Various components for a federated cloud data lake
  • Docker images for all services
  • Terraform deployment scripts
  • Ranger for authorization
  • Various optional extensions

17

1 2

slide-18
SLIDE 18

18

Expedia Group Proprietary and Confidential

Apiary – Metadata Events

  • https://github.com/ExpediaGroup/apiary-

extensions/tree/master/apiary-metastore-events

  • Events for tables/partitions CRUD operations
  • Hive MetaStoreEventListener implementations
  • Kafka
  • AWS SNS
  • Enable downstream data processing use cases
  • ETL, Governance, Lineage etc

18

1 2

slide-19
SLIDE 19

19

Expedia Group Proprietary and Confidential

Problem – rewriting data at scale

  • Changes to existing data
  • Read isolation for long running queries
  • Always create new folders for updates
  • Repoint Hive data locations
  • How to expire “orphaned data”?

19

1 2

slide-20
SLIDE 20

20

Expedia Group Proprietary and Confidential

Beekeeper – orphaned data cleanup

  • https://github.com/ExpediaGroup/beekeeper/
  • Hive table parameter:

beekeeper.remove.unreferenced.data=true

  • Apiary event listener
  • Detects data re-writes
  • Schedules old data for deletion in future
  • Periodically performs the data deletions

20

1 2

slide-21
SLIDE 21

21

Expedia Group Proprietary and Confidential

Consistent CRUD alternatives

  • http://hive.apache.org/ - Hive 3.1.x with ACID
  • https://iceberg.incubator.apache.org/ - Iceberg
  • https://delta.io/ - Delta Lake
  • https://hudi.apache.org/ - Hudi

21

1 2

slide-22
SLIDE 22

22

Expedia Group Proprietary and Confidential

Don’t forget to test

  • https://github.com/klarna/HiveRunner/ - Hive SQL unit tests
  • https://github.com/HotelsDotCom/mutant-swarm/ - Code

coverage for HiveRunner

  • https://github.com/HotelsDotCom/beeju - Unit tests for

Thrift Hive metastore service and HiveServer2

22

1 2

slide-23
SLIDE 23

23

Expedia Group Proprietary and Confidential

Where to next?

  • Hybrid cloud
  • best of both worlds but increased complexity
  • Multi-cloud
  • best of breed but increased complexity
  • Docker + Kubernetes
  • Reduce vendor lock-in
  • Massive scale without too much effort
  • Minimal changes for on-prem/EKS/GKE/AKS etc
slide-24
SLIDE 24

24

Expedia Group Proprietary and Confidential

Open Source Data Lake Components

Hive Federation https://github.com/HotelsDotCom/waggle-dance Hive Replication https://github.com/HotelsDotCom/circus-train https://github.com/ExpediaGroup/shunting-yard Cloud Data Lake https://github.com/ExpediaGroup/apiary Hive Cleanup https://github.com/ExpediaGroup/beekeeper