Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - PowerPoint PPT Presentation

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft

Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Demo • Architecture • Summary 2

Data platform users General Analysts Data Scientists Data Modelers Product Engineers Experimenters Managers Managers Data Platform 3

Core Infra high level architecture Custom apps 4

Data Discovery 5

Hi! I am a n00b Data Scientist! • My first project is to analyze and predict Data council Attendance • Where is the data? • What does it mean? 6

Status quo • Option 1: Phone a friend! • Option 2: Github search 7

Understand the context • What does this field mean? Does attendance data include employees? ‒ Does it include revenue? ‒ • Let me dig in and understand 8

Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;

Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 10

Data Scientists spend upto 1/3rd time in Data Discovery... • Data discovery Lack of ‒ understanding of what data exists, where, who owns it, who uses it, and how to request access. 11

Audience for data discovery 12

Data Discovery - User personas General Analysts Data Scientists Data Modelers Product Engineers Experimenters Managers Managers Data Platform 13

3 Data Scientist personas Power user Noob user Manager ● All info in their head ● Lost ● Dependencies ● Get interrupted a lot ● Ask “power users” a landing on time due to questions lot of questions ● Communicating with stakeholders

Data Discovery answers 3 kinds of questions Search based Lineage based Network based Where is the I am changing a data I want to follow a power table/dashboard for X? model, who are the owner user in my team. What does it contain? and most common users? Does this analysis already This table’s delivery was I want to bookmark tables of exist? delayed today, I want to interest and get a feed of notify everyone data delay, schema change, downstream. incidents.

Meet Amundsen First person to discover the South Pole - Norwegian explorer, Roald Amundsen 16

Landing page optimized for search

Search results ranked on relevance and query activity

How does search work? 19

Relevance - search for “apple” on Google Low relevance High relevance 20

Popularity - search for “apple” on Google Low popularity High popularity 21

Striking the balance Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent ● Querying activity users] ● Dashboarding ● Different weights for automated vs adhoc querying 22

Back to mocks... 23

Search results ranked on relevance and query activity

Detailed description and metadata about data resources

Data Preview within the tool

Computed stats about column metadata Disclaimer: these stats are arbitrary.

Built-in user feedback

Demo 29

Open source in mind • Pluggable code to each micro-services via Python entry point, etc • Pluggable API endpoint via Blueprint • Build your ingestion pipeline like a Lego brick

Amundsen’s architecture 31

Other Microservices ML Frontend Service Security Feature Service Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 32

1. Frontend Service 33

Amundsen table detail page

2. Metadata Service 36

2. Metadata Service • A thin proxy layer to interact with graph database Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas ‒ • Support Rest API for other services pushing / pulling metadata directly 38

Trade Off #1 Why choose Graph database 39

Why Graph database?

Trade Off #2 Why not propagate the metadata back to source 42

Why not propagate the metadata back to source 43

Why not propagate the metadata back to source ? ? 44

Why not propagate the metadata back to source 45

3. Search Service 46

3. Search Service • A thin proxy layer to interact with the search backend Currently it supports Elasticsearch as the search backend. ‒ • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48

Challenge #1 How to make the search result more relevant? 49

How to make the search result more relevant? • Define a search quality metric Click-Through-Rate (CTR) over top 5 results ‒ • Search behaviour instrumentation is key • Couple of improvements: Boost the exact table ranking ‒ Support wildcard search (e.g. event_* ) ‒ Support category search (e.g. column: is_line_ride ) ‒ 50

4. Data Builder 51

Other Microservices ML Frontend Service Other Feature Services Service Metadata Service Search Service Elastic Neo4j Search Databuilder Crawler Metadata Sources Github Postgres Hive Redshift ... Presto Source File 52

Challenge #1 Various forms of metadata 53

Metadata Sources @ Lyft 54

Metadata - Challenges • No Standardization : No single data model that fits for all data resources A data resource could be a table, an Airflow DAG or a dashboard ‒ • Different Extraction : Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55

Challenge #2 Pull model vs Push model 56

Pull model vs. Push model Pull Model Push Model ● Periodically update the index by pulling from ● The system (e.g. database) pushes the system (e.g. database) via crawlers. metadata to a message bus which downstream subscribes to. Crawler Database Data graph Database Message Data graph queue Scheduler 57

Pull model vs. push model Pull Model Push Model ● Onus of integration lays on data graph ● Onus of integration lies on database ● No interface to prescribe, hard to maintain ● Message format serves as the interface crawlers ● Allows for near-real time indexing Crawler Database Data graph Database Message Data graph queue Scheduler 58

Pull model vs. push model Pull Model Push Model ● Onus of integration lays on data graph ● Onus of integration lies on database ● No interface to prescribe, hard to maintain ● Message format serves as the interface crawlers ● Allows for near-real time indexing Crawler Database Data graph Database Message Data graph queue Preferred if Preferred if ● Near-real time indexing is important ● Waiting for indexing is ok ● Clean interface doesn’t exist ● Working with “strapped” teams ● Other tools like Wherehows are moving ● There’s already an interface towards Push Model 59

4. Databuilder

Databuilder in action

How are we building data? Databuilder

How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs

What’s next? 64

Amundsen seems to be more useful than what we thought • Tremendous success at Lyft Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ • Many organizations have similar problems Collaborating with ING, WeWork and more ‒ We plan to announce open source soon ‒ 65

Impact - Amundsen at Lyft Generally Available (GA) release Beta release (internal) Alpha release 66

Summary 67

Adding more kinds of data resources Dashboards Data sets People Streams Schemas Workflows Phase 3 Phase 2 Phase 1 (In Scoping) (In development) (Complete)

Summary • Data Discovery adds 30+% more productivity to Data Scientists • Metadata is key to the next wave of big data applications • Amundsen - Lyft’s metadata and data discovery platform • Blog post with more details: go.lyft.com/datadiscoveryblog 69

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - PowerPoint PPT Presentation

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft

gRPC at Lyft gRPC at Lyft gRPC Meetup - SF gRPC Meetup - SF Chris Roche Chris Roche Lyft,

Lyft's Envoy: Embracing a Service Mesh Matt Klein / @mattklein123, Software Engineer @Lyft

Lyft, Transit, and the Future of Mobility in CA Emily Castor, Director of Transportation Policy

SSH with Go SSH with Go GoSF Meetup GoSF Meetup 25 August 2016 25 August 2016 Chris Roche

Presentation of Lyft Implementation of S.B. 1376 CPUC Workshop - December 5, 2018 R. 12-12-011

Ilya Zverev, Lyft FOSDEM 2020 Applied Mapping Geocoding Routing Showing a map

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Programming with the OSS Cloud Stack Mike Amundsen Principal API Architect Layer 7

Clients Matter, Services Don't Recovering the Future of Services on the Web Mike Amundsen, API

Beyond REST An approach to creating stable, evolve-able Web applications Mike Amundsen @mamund

The magic behind your Lyft ride prices A case study on machine learning and streaming Strata

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Its Not Open Data Unless it is Usable Data Mike Amundsen, API Academy CA / Layer7 @mamund

Putting Big Data in its Place Mike Amundsen, API Academy at CA @mamund HH Camp Strasbourg,

Discovery Environment Extensible Data Science workbench and data-centric collaboration platform

First & Last Mile Partnerships: What is the Role of TNCs? Paul Davis, Lyft Aaron

CS5412 / LECTURE 22 Ken Birman HOW FACEBOOK REPRESENTS Spring, 2020 SOCIAL NETWORKING DATA

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

ACACIA Context-aware Edge Computing for Continuous Interactive Applications over Mobile

Efficient Query Dispatching for Scale-Out Database Systems Stefan Klauck, Max Plauth, Sven Knebel

Condos and Clouds Patterns in SaaS Applications Thinking about Cloud Computing by Looking at

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PgBouncer and 20,000 TPS on one node advanced tuning, hacks and problem solving Victor Yagofarov,

Questionnaire Specifications Database (QSD) i i ifi i b ( ) for Blaise Surveys Lilia

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - PowerPoint PPT Presentation

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft

gRPC at Lyft gRPC at Lyft gRPC Meetup - SF gRPC Meetup - SF Chris Roche Chris Roche Lyft,

Lyft's Envoy: Embracing a Service Mesh Matt Klein / @mattklein123, Software Engineer @Lyft

Lyft, Transit, and the Future of Mobility in CA Emily Castor, Director of Transportation Policy

SSH with Go SSH with Go GoSF Meetup GoSF Meetup 25 August 2016 25 August 2016 Chris Roche

Presentation of Lyft Implementation of S.B. 1376 CPUC Workshop - December 5, 2018 R. 12-12-011

Ilya Zverev, Lyft FOSDEM 2020 Applied Mapping Geocoding Routing Showing a map

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Programming with the OSS Cloud Stack Mike Amundsen Principal API Architect Layer 7

Clients Matter, Services Don't Recovering the Future of Services on the Web Mike Amundsen, API

Beyond REST An approach to creating stable, evolve-able Web applications Mike Amundsen @mamund

The magic behind your Lyft ride prices A case study on machine learning and streaming Strata

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Its Not Open Data Unless it is Usable Data Mike Amundsen, API Academy CA / Layer7 @mamund

Putting Big Data in its Place Mike Amundsen, API Academy at CA @mamund HH Camp Strasbourg,

Discovery Environment Extensible Data Science workbench and data-centric collaboration platform

First &amp; Last Mile Partnerships: What is the Role of TNCs? Paul Davis, Lyft Aaron

CS5412 / LECTURE 22 Ken Birman HOW FACEBOOK REPRESENTS Spring, 2020 SOCIAL NETWORKING DATA

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

ACACIA Context-aware Edge Computing for Continuous Interactive Applications over Mobile

Efficient Query Dispatching for Scale-Out Database Systems Stefan Klauck, Max Plauth, Sven Knebel

Condos and Clouds Patterns in SaaS Applications Thinking about Cloud Computing by Looking at

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PgBouncer and 20,000 TPS on one node advanced tuning, hacks and problem solving Victor Yagofarov,

Questionnaire Specifications Database (QSD) i i ifi i b ( ) for Blaise Surveys Lilia

First & Last Mile Partnerships: What is the Role of TNCs? Paul Davis, Lyft Aaron