building an open source data lake at scale in the cloud
play

Building an open source data lake at scale in the cloud Adrian - PowerPoint PPT Presentation

Building an open source data lake at scale in the cloud Adrian Woodhead, Principal Engineer 1 Agenda Background Data Lake foundation: data + metadata High Availability and Disaster Recovery Data federation Event-based data processing 2


  1. Building an open source data lake at scale in the cloud Adrian Woodhead, Principal Engineer 1

  2. Agenda Background Data Lake foundation: data + metadata High Availability and Disaster Recovery Data federation Event-based data processing 2 Expedia Group Proprietary and Confidential 2

  3. 3

  4. Data Lake journey • “traditional” RDBMS Data Warehouse • Introduced on-premise Hadoop + Hive cluster • RDBMS SQL replaced by SQL from Hive • Slow at busy times • Painful upgrade path (software and hardware) • Migration to “Cloud” as primary data lake 4 Expedia Group Proprietary and Confidential

  5. C l o u d D a t a L a k e F o u n d a t i o n 1 2 5 Expedia Group Proprietary and Confidential

  6. C l o u d D a t a L a k e H i g h A v a i l a b i l i t y 1 2 6 Expedia Group Proprietary and Confidential

  7. C l o u d D a t a L a k e R e d u n d a n c y 1 2 7 Expedia Group Proprietary and Confidential

  8. Redundancy by replication • Data and Metadata • Co-ordinated • Data consistency during replication • No partial reads 1 • Completeness more important than latency 2 8 Expedia Group Proprietary and Confidential 8

  9. Circus Train – Hive dataset replicator • https://github.com/HotelsDotCom/circus-train/ • Metadata only available after data • Supports HDFS, S3, GCS etc. • Standard “ distcp ” and optimised copiers 1 • Plugin architecture – Notifications, Copiers, Metadata transformations • Selective data replication – custom filters, “Hive Diff” • https://github.com/HotelsDotCom/shunting-yard 2 • Event-driven Circus Train 9 Expedia Group Proprietary and Confidential 9

  10. D a t a L a k e S i l o s 1 2 10 Expedia Group Proprietary and Confidential

  11. Data Lake Silo Solutions • Move back to a single data lake • Scalability issues • Increased “blast radius” • Replicate shared data sets between data lakes 1 • Cost of maintaining replication jobs • Increased file storage costs • Increased network transfer costs 2 11 Expedia Group Proprietary and Confidential 11

  12. Federated Cloud Data Lake • https://github.com/HotelsDotCom/waggle-dance/ • Waggle Dance – a Hive Thrift metastore proxy • Configure it with “downstream” Hive metastores • Configure S3 bucket access permissions 1 • Set “ hive.metastore.uris ” to Waggle Dance server • Use as you would Hive metastore in any client app 2 12 Expedia Group Proprietary and Confidential 12

  13. W a g g l e D a n c e O v e r v i e w 1 2 13 Expedia Group Proprietary and Confidential

  14. M u l t i - R e g i o n F e d e r a t e d C l o u d D a t a L a k e Federate Replicate Replicate US_WEST_2 US_WEST_2 US_EAST_1 US_EAST_1 14 Expedia Group Proprietary and Confidential

  15. Federated Cloud Data Lake Best Practices • Expose read-only endpoints to “external” users • Separate critical path infrastructure • Federate data for access within a region • Replicate data for access in a different region 1 2 15 Expedia Group Proprietary and Confidential 15

  16. Federated Cloud Data Lake Alternative • Presto – distributed SQL query engine for big data • Federate Hive, MySQL, PostgreSQL and many others • https://github.com/prestodb/presto 1 OR • https://github.com/prestosql/presto 2 ? 16 Expedia Group Proprietary and Confidential 16

  17. Apiary - Cloud Data Lake Components • https://github.com/ExpediaGroup/apiary • Various components for a federated cloud data lake • Docker images for all services • Terraform deployment scripts 1 • Ranger for authorization • Various optional extensions 2 17 Expedia Group Proprietary and Confidential 17

  18. Apiary – Metadata Events • https://github.com/ExpediaGroup/apiary- extensions/tree/master/apiary-metastore-events • Events for tables/partitions CRUD operations • Hive MetaStoreEventListener implementations 1 • Kafka • AWS SNS • Enable downstream data processing use cases 2 • ETL, Governance, Lineage etc 18 Expedia Group Proprietary and Confidential 18

  19. Problem – rewriting data at scale • Changes to existing data • Read isolation for long running queries • Always create new folders for updates • Repoint Hive data locations 1 • How to expire “orphaned data”? 2 19 Expedia Group Proprietary and Confidential 19

  20. Beekeeper – orphaned data cleanup • https://github.com/ExpediaGroup/beekeeper/ • Hive table parameter: beekeeper.remove.unreferenced.data=true • Apiary event listener 1 • Detects data re-writes • Schedules old data for deletion in future • Periodically performs the data deletions 2 20 Expedia Group Proprietary and Confidential 20

  21. Consistent CRUD alternatives • http://hive.apache.org/ - Hive 3.1.x with ACID • https://iceberg.incubator.apache.org/ - Iceberg • https://delta.io/ - Delta Lake • https://hudi.apache.org/ - Hudi 1 2 21 Expedia Group Proprietary and Confidential 21

  22. Don’t forget to test • https://github.com/klarna/HiveRunner/ - Hive SQL unit tests • https://github.com/HotelsDotCom/mutant-swarm/ - Code coverage for HiveRunner • https://github.com/HotelsDotCom/beeju - Unit tests for 1 Thrift Hive metastore service and HiveServer2 2 22 Expedia Group Proprietary and Confidential 22

  23. Where to next? • Hybrid cloud • best of both worlds but increased complexity • Multi-cloud • best of breed but increased complexity • Docker + Kubernetes • Reduce vendor lock-in • Massive scale without too much effort • Minimal changes for on-prem/EKS/GKE/AKS etc 23 Expedia Group Proprietary and Confidential

  24. Open Source Data Lake Components Hive Replication https://github.com/HotelsDotCom/circus-train https://github.com/ExpediaGroup/shunting-yard Hive Federation https://github.com/HotelsDotCom/waggle-dance Hive Cleanup https://github.com/ExpediaGroup/beekeeper Cloud Data Lake https://github.com/ExpediaGroup/apiary 24 Expedia Group Proprietary and Confidential

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend