Apache Hadoop Ingestion & Dispersal Framework
Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Strata NY 2018 September 12, 2018
Apache Hadoop Ingestion & Dispersal Framework Danny Chen - - PowerPoint PPT Presentation
Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Agenda Mission
Apache Hadoop Ingestion & Dispersal Framework
Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Strata NY 2018 September 12, 2018
dispersal framework
○ High Level Architecture ○ Abstractions and Building Blocks
Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Hadoop ecosystem.
data & data store location
Open Sourced in September 2018 https://github.com/uber/marmaray Blog Post: https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
traffic
dispersal pipelines
systems
Marmaray Ingestion Marmaray Dispersal Hadoop Data Lake
Schemaless Analytical Processing
Chain of converters
Schema Service
Input Storage System
Source Connector M3 Monitoring & Alerting System
Work Unit Calculator
Metadata Manager (Checkpoint store) Converter1 Converter 2
Sink Connector
Output Storage System
Error Tables Datafeed Config Store
Chain of converters
Schema Service
Input Storage System
Source Connector M3 Monitoring System
Work Unit Calculator
Metadata Manager (Checkpoint store) Converter1 Converter 2
Sink Connector
Output Storage System
Error Tables Datafeed Config Store
Schema Service Get Schema by Name & version Get Schema Service Reader Reader / Decoder
Binary Data Generic Record
Get Schema Service Writer Writer / Encoder
Generic Data Binary Data
Chain of converters
Schema Service
Input Storage System
Source Connector M3 Monitoring System
Work Unit Calculator
Metadata Manager (Checkpoint store) Converter1 Converter 2
Sink Connector
Output Storage System
Error Tables Topic Config Store
Persistent Storage (ex.HDFS) In-Memory Copy Metadata Manager init() Called on Job start Different Job DAG Components persist() Called after Job finish
Set (key, value) called 0 or more times Get(key) -> value called 0 or more times
Fork Operator - Why is it needed?
Input Records Schema Conforming records Error Records
records
records (or in Spark, re-executing input transformations)
Input Records Schema Conforming records Error Records
Fork Function Tagged Records
r1, S/F r2, S/F rx, S/F
Success Filter function Failure Filter function
Persisted using Spark’s disk/ memory persistence level
Data lake with GenericRecord
Kafka Hive S3 New Source Cassandra
Data lake with GenericRecord
Kafka
Hive Table 1 Hive Table 2
JobDAG Report metrics for monitoring Register table in Hive Job Dag Actions
JobManager
ingestion for 300+ topics
multiple JobDAGs
locking
Job Mgr 1 Spark Job
Ingesting kafka-topic 1 (JobDAG 1) Ingesting kafka-topic N (JobDAG N) Waiting Q for JobDAGs
Source (Kafka) 10 min buckets Latest Bucket Sink (Hive) 10 min buckets Latest Bucket
○ Possible for very small datasets ○ Won’t work for billions of records; very expensive!!
○ How about creating time based buckets say for every 2min or 10min. ○ Count records at source and sink during every runs ■ Does it give 100% guarantee?? No but w.h.p. it is close to it.
Kafka Hoodie (Hive) Marmaray
Src Converter Sink Converter
Error Table Input Record (IR) Input Success Record (ISR) Input Error Record (IER) Output Error Record (OER) Output Records (OR)
IR IER OER OR
Kafka topic1 2014 2015 2018 01 02 08 01 02 06 Latest Date Partition Stitched parquet files (~4GB) (~400 files per partition) Non-stitched parquet files (~40MB) (~20-40K files per partition)
○ Need to scan entire partition to find out if record is present or not
○ Re-writing entire partition for
solve this
Kafka Topic 2014 2015 2018 01 02 08 01 02 06 .hoodie Hoodie metadata ts1.commit ts2.commit ts3.commit
f1_ts1.parquet f2_ts1.parquet f4_ts1.parquet f3_ts1.parquet f5_ts2.parquet f6_ts2.parquet f7_ts2.parquet f1_ts3.parquet f8_ts3.parquet
Updates
common: hadoop: fs.defaultFS: "hdfs://namenode/" hoodie: table_name: "mydb.table1" base_path: "/path/to/my.db/table1" metrics_prefix: "marmaray" enable_metrics: true parallelism: 64 kafka: conn: bootstrap.servers: "kafkanode1:9092,kafkanode2:9092" fetch.wait.max.ms: 1000 socket.receive.buffer.bytes: 5242880 fetch.message.max.bytes: 20971520 auto.commit.enable: false fetch.min.bytes: 5242880 source: topic_name: "topic1" max_messages: 1024 read_parallelism: 64 error_table: enabled: true dest_path: "/path/to/my.db/table1/.error" date_partitioned: true
containers
https://gobblin.readthedocs.io/en/latest/
Your 5 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber - Wed 11:20am Hudi: Unifying storage and serving for batch and near-real-time analytics - Wed 5:25 pm
Positions available: Seattle, Palo Alto & San Francisco email : hadoop-platform-jobs@uber.com
ce/
Follow our Facebook page: www.facebook.com/uberopensource