Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - - PowerPoint PPT Presentation

amundsen a data discovery platform from lyft
SMART_READER_LITE
LIVE PREVIEW

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin - - PowerPoint PPT Presentation

Amundsen: A Data Discovery Platform from Lyft April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft


slide-1
SLIDE 1

April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft

Amundsen: A Data Discovery Platform from Lyft

slide-2
SLIDE 2

Agenda

  • Data at Lyft
  • Challenges with Data Discovery
  • Data Discovery at Lyft
  • Demo
  • Architecture
  • Summary

2

slide-3
SLIDE 3

Data platform users

3

Data Modelers Analysts Data Scientists General Managers

Data Platform

Engineers Experimenters Product Managers

slide-4
SLIDE 4

4

Core Infra high level architecture

Custom apps

slide-5
SLIDE 5

Data Discovery

5

slide-6
SLIDE 6
  • My first project is to analyze and predict Data council Attendance
  • Where is the data?
  • What does it mean?

Hi! I am a n00b Data Scientist!

6

slide-7
SLIDE 7
  • Option 1: Phone a friend!
  • Option 2: Github search

Status quo

7

slide-8
SLIDE 8
  • What does this field mean?

‒ Does attendance data include employees? ‒ Does it include revenue?

  • Let me dig in and understand

Understand the context

8

slide-9
SLIDE 9

Explore

SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;

slide-10
SLIDE 10

Exploring with SELECT * is EVIL

1. Lack of productivity for data scientists 2. Increased load on the databases

10

slide-11
SLIDE 11

Data Scientists spend upto 1/3rd time in Data Discovery...

11

  • Data discovery

‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.

slide-12
SLIDE 12

Audience for data discovery

12

slide-13
SLIDE 13

Data Discovery - User personas

13

Data Modelers Analysts Data Scientists General Managers

Data Platform

Engineers Experimenters Product Managers

slide-14
SLIDE 14

3 Data Scientist personas

Power user

  • All info in their head
  • Get interrupted a lot

due to questions

  • Lost
  • Ask “power users” a

lot of questions

  • Dependencies

landing on time

  • Communicating with

stakeholders

Noob user Manager

slide-15
SLIDE 15

Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents.

Data Discovery answers 3 kinds of questions

slide-16
SLIDE 16

Meet Amundsen

16

First person to discover the South Pole - Norwegian explorer, Roald Amundsen

slide-17
SLIDE 17

Landing page optimized for search

slide-18
SLIDE 18

Search results ranked on relevance and query activity

slide-19
SLIDE 19

How does search work?

19

slide-20
SLIDE 20

Relevance - search for “apple” on Google

20

Low relevance High relevance

slide-21
SLIDE 21

Popularity - search for “apple” on Google

21

Low popularity High popularity

slide-22
SLIDE 22

Striking the balance

22

Relevance Popularity

  • Names, Descriptions, Tags, [owners, frequent

users]

  • Querying activity
  • Dashboarding
  • Different weights for automated vs adhoc

querying

slide-23
SLIDE 23

Back to mocks...

23

slide-24
SLIDE 24

Search results ranked on relevance and query activity

slide-25
SLIDE 25

Detailed description and metadata about data resources

slide-26
SLIDE 26

Data Preview within the tool

slide-27
SLIDE 27

Computed stats about column metadata

Disclaimer: these stats are arbitrary.

slide-28
SLIDE 28

Built-in user feedback

slide-29
SLIDE 29

Demo

29

slide-30
SLIDE 30

Open source in mind

  • Pluggable code to each micro-services via Python entry point, etc
  • Pluggable API endpoint via Blueprint
  • Build your ingestion pipeline like a Lego brick
slide-31
SLIDE 31

Amundsen’s architecture

31

slide-32
SLIDE 32

32

Postgres Hive Redshift ... Presto Github Source File

Databuilder Crawler

Neo4j Elastic Search

Metadata Service Search Service Frontend Service

ML Feature Service Security Service Other Microservices

Metadata Sources

slide-33
SLIDE 33
  • 1. Frontend Service

33

slide-34
SLIDE 34

34

Postgres Hive Redshift ... Presto Github Source File

Databuilder Crawler

Neo4j Elastic Search

Metadata Service Search Service Frontend Service

ML Feature Service Security Service Other Microservices

Metadata Sources

slide-35
SLIDE 35

Amundsen table detail page

slide-36
SLIDE 36
  • 2. Metadata Service

36

slide-37
SLIDE 37

37

Postgres Hive Redshift ... Presto Github Source File

Databuilder Crawler

Neo4j Elastic Search

Metadata Service Search Service Frontend Service

ML Feature Service Security Service Other Microservices

Metadata Sources

slide-38
SLIDE 38

38

  • 2. Metadata Service
  • A thin proxy layer to interact with graph database

‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas

  • Support Rest API for other services pushing / pulling metadata directly
slide-39
SLIDE 39

Trade Off #1 Why choose Graph database

39

slide-40
SLIDE 40

Why Graph database?

slide-41
SLIDE 41

Why Graph database?

slide-42
SLIDE 42

Trade Off #2 Why not propagate the metadata back to source

42

slide-43
SLIDE 43

Why not propagate the metadata back to source

43

slide-44
SLIDE 44

Why not propagate the metadata back to source

44

? ?

slide-45
SLIDE 45

Why not propagate the metadata back to source

45

slide-46
SLIDE 46
  • 3. Search Service

46

slide-47
SLIDE 47

47

Postgres Hive Redshift ... Presto Github Source File

Databuilder Crawler

Neo4j Elastic Search

Metadata Service Search Service Frontend Service

ML Feature Service Security Service Other Microservices

Metadata Sources

slide-48
SLIDE 48
  • 3. Search Service
  • A thin proxy layer to interact with the search backend

‒ Currently it supports Elasticsearch as the search backend.

  • Support different search patterns

‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search

48

slide-49
SLIDE 49

Challenge #1 How to make the search result more relevant?

49

slide-50
SLIDE 50

How to make the search result more relevant?

50

  • Define a search quality metric

‒ Click-Through-Rate (CTR) over top 5 results

  • Search behaviour instrumentation is key
  • Couple of improvements:

‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)

slide-51
SLIDE 51
  • 4. Data Builder

51

slide-52
SLIDE 52

52

Postgres Hive Redshift ... Presto Github Source File

Databuilder Crawler

Neo4j Elastic Search

Metadata Service Search Service Frontend Service

ML Feature Service Other Services Other Microservices

Metadata Sources

slide-53
SLIDE 53

Challenge #1 Various forms of metadata

53

slide-54
SLIDE 54

54

Metadata Sources @ Lyft

slide-55
SLIDE 55

Metadata - Challenges

  • No Standardization: No single data model that fits for all data

resources ‒ A data resource could be a table, an Airflow DAG or a dashboard

  • Different Extraction: Each data set metadata is stored and fetched

differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ …

55

slide-56
SLIDE 56

Challenge #2 Pull model vs Push model

56

slide-57
SLIDE 57

Pull model vs. Push model

57

Pull Model Push Model

  • Periodically update the index by pulling from

the system (e.g. database) via crawlers.

  • The system (e.g. database) pushes

metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph

slide-58
SLIDE 58

Pull model vs. push model

58

Pull Model Push Model

  • Onus of integration lays on data graph
  • No interface to prescribe, hard to maintain

crawlers

  • Onus of integration lies on database
  • Message format serves as the interface
  • Allows for near-real time indexing

Crawler Database Data graph Scheduler Database Message queue Data graph

slide-59
SLIDE 59

Pull model vs. push model

59

Pull Model Push Model

  • Onus of integration lays on data graph
  • No interface to prescribe, hard to maintain

crawlers

  • Onus of integration lies on database
  • Message format serves as the interface
  • Allows for near-real time indexing

Crawler Database Data graph Database Message queue Data graph

Preferred if

  • Near-real time indexing is important
  • Clean interface doesn’t exist
  • Other tools like Wherehows are moving

towards Push Model

Preferred if

  • Waiting for indexing is ok
  • Working with “strapped” teams
  • There’s already an interface
slide-60
SLIDE 60
  • 4. Databuilder
slide-61
SLIDE 61

Databuilder in action

slide-62
SLIDE 62

How are we building data? Databuilder

slide-63
SLIDE 63

How is databuilder orchestrated?

Amundsen uses Apache Airflow to orchestrate Databuilder jobs

slide-64
SLIDE 64

What’s next?

64

slide-65
SLIDE 65

Amundsen seems to be more useful than what we thought

  • Tremendous success at Lyft

‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!

  • Many organizations have similar problems

‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon

65

slide-66
SLIDE 66

Impact - Amundsen at Lyft

66

Beta release (internal) Generally Available (GA) release Alpha release

slide-67
SLIDE 67

Summary

67

slide-68
SLIDE 68

Adding more kinds of data resources

People Dashboards Data sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows

slide-69
SLIDE 69

Summary

  • Data Discovery adds 30+% more productivity to Data Scientists
  • Metadata is key to the next wave of big data applications
  • Amundsen - Lyft’s metadata and data discovery platform
  • Blog post with more details: go.lyft.com/datadiscoveryblog

69

slide-70
SLIDE 70

Jin Hyuk Chang | @jinhyukchang Tao Feng | @feng-tao Slides at go.lyft.com/amundsen_datacouncil_2019 Blog post at go.lyft.com/datadiscoveryblog

Icons under Creative Commons License from https://thenounproject.com/

70

slide-71
SLIDE 71

Backup

71