Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, - - PowerPoint PPT Presentation

ground a data context service
SMART_READER_LITE
LIVE PREVIEW

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, - - PowerPoint PPT Presentation

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017 https:/ /github.com/ground-context/ground ground Open Source Big Data Community Health Long-term Data L Data Analysis Data Wrangling I A


slide-1
SLIDE 1

ground

Ground: A Data Context Service

Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al. CIDR 2017 https:/ /github.com/ground-context/ground

slide-2
SLIDE 2

Open Source Big Data Community Health

Long-term Data Management

Data Analysis Data Wrangling

F A I L

slide-3
SLIDE 3

What was the big data revolution really all about?

slide-4
SLIDE 4

Database

slide-5
SLIDE 5

A DECOUPLED STACK

Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine Query Optimizer API / Query Language

Big Data

slide-6
SLIDE 6

A DECOUPLED STACK

Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine Query Optimizer API / Query Language

SQL

GP ORCA

The Good: Agility

slide-7
SLIDE 7

A DECOUPLED STACK

SQL

GP ORCA

The Bad: Dis-integration.

slide-8
SLIDE 8

CRISIS: HOW DO WE SHARE INFORMATION?

slide-9
SLIDE 9

WHAT IS METADATA?

slide-10
SLIDE 10
  • Data about data
  • This used to be so simple!
  • But … schema on use
  • One of many changes

WHAT IS METADATA?

slide-11
SLIDE 11

Lay the groundwork for rich
 data context.

OPPORTUNITY: A BIGGER CONTEXT

Don’t just fill a metadata- sized hole in the big data stack.

slide-12
SLIDE 12

WHAT IS DATA CONTEXT?

All the information surrounding the use of data.

slide-13
SLIDE 13

The ABCs of Data Context

Application Context: Views, models, code Behavioral Context: Data lineage & usage
 Change Over Time: Version histories

Generated by—and useful to—many applications and components.

slide-14
SLIDE 14

ground

Janet

I bet social media content can predict which customers might cancel their accounts! Hey Janet! We already paid for a full Gnip feed from Twitter — you can find it here By the way: Sue used this following related table and script.

slide-15
SLIDE 15

Janet

ground

Hey Janet! This looks like Twitter JSON. Many people use this script to turn it into a table. Be careful: When people store outputs from this script, the following fields are often flagged by IT as PII. BTW, have you tried the sentiment analysis package? I bet social media content can predict which customers might cancel their accounts!

slide-16
SLIDE 16

share

Sue

7.5 15 22.5 30 4 8 12 16

ground

Janet

It looks true! 
 Tweets predict churn!

slide-17
SLIDE 17

TweetId Text Sentiment 47 “sad!” negative 53 “awesome!” positive 57 “go packers!” neutral 64 “fleek!” positive TweetId Text neg pos neut 47 “sad!” 1 53 “awesome!” 1 57 “go packers!” 1 64 “fleek!” 1

ground

Sue

I wonder if Janet’s sentiment analysis will help with my discount targeting pipeline.

7.5 15 22.5 30 4 8 12 16

slide-18
SLIDE 18

TweetId Text neg pos neut 47 “sad!” 53 “awesome!” 57 “go packers!” 64 “fleek!” TweetId Text Sentiment 47 “sad!” sadness 53 “awesome!” elation 57 “go packers!” sports 64 “fleek!” trendy

Sue

Uh oh, prediction accuracy metrics are down!

Time passes…

Oh dear. I better call a meeting to introduce better governance on sentiment labeler. FYI: Janet’s wrangling script changed!

ground

Prediction Accuracy

25 50 75 100 1/1/2017 00:00 1/1/17 18:00 1/2/17 12:00

VERSION HISTORY 12/31/2016 00:00 -800 
 hash: 6dda491064bcce14f558bf83867b8c247027c423
 user: will

slide-19
SLIDE 19

WHAT DID CONTEXT ENABLE?

Figuring out which changes introduced the error.

VERSION HISTORY

Determining who made the change to help us resolve the issue.

user: will

Fueling our model accuracy monitor.

25 50 75 100 1/1/2017 00:00 1/2/17 00:00

Self-service catalog, wrangling and analytics.
 Collective governance of data.

slide-20
SLIDE 20

7 7 9 9

THE BIG CONTEXT

Where are the interesting technical challenges? All over! Our goal is not to solve all these challenges. It’s to provide an environment to enable solutions.

slide-21
SLIDE 21

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES

METAMODEL

COMMON GROUND

Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Time Travel Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth

ground

slide-22
SLIDE 22

Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth

COMMON GROUND CONTEXT MODEL

Pachyderm Chronos

Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Time Machine Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES

METAMODEL

COMMON GROUND

slide-23
SLIDE 23

DESIGN REQUIREMENTS

  • Model-agnostic
  • Immutable
  • Scalable
  • Politically Neutral
slide-24
SLIDE 24

Postel’s Law

Be conservative in what you do, 
 be liberal in what you accept from others

slide-25
SLIDE 25

A: Model Graphs

COMMON GROUND

The metamodel

slide-26
SLIDE 26

member k1 member k1: string member k2 Object 2 member k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root

RELATIONAL SCHEMA

JSON DOCUMENT

Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key

slide-27
SLIDE 27

COMMON GROUND

The versioning model

  • B. Version Graphs

A: Model Graphs

slide-28
SLIDE 28

COMMON GROUND

The versioning model

  • A. Model Graphs
  • B. Version Graphs
slide-29
SLIDE 29

COMMON GROUND

The usage model

  • C. Lineage Graphs
  • A. Model Graphs
  • B. Version Graphs
slide-30
SLIDE 30

SCALABLE, IMMUTABLE BACKEND

Longstanding open problem Workloads?

  • Graph queries for metamodel traversal
  • Log analysis queries for usage

Room for improvement

  • Goal: compete with in-memory performance


(“the McSherry baseline”)

Figure 8: Dwell time analysis. Figure 9: Impact analysis. Figure 10: PostgreSQL transitive closure variants.

slide-31
SLIDE 31

NEUTRALITY

Reminder: There will be k competing solutions for:

  • Data wrangling
  • Data cataloging
  • Schema extraction
  • Feature extraction
  • Social network analysis
  • Etc.
  • This will consolidate somewhat, but only over time

Goal: foster the ecosystem

slide-32
SLIDE 32

NEUTRALITY

YOU

slide-33
SLIDE 33

MANY OPEN RESEARCH QUESTIONS

Underground

  • Workloads
  • Common Ground

representations

  • No-overwrite versioned DB
  • Time travel queries: point

and trend Graph queries + log analysis

  • Consistency

Aboveground

  • Content extraction
  • Analytic user exhaust
  • Socio-technical networks
  • Collective governance
  • Reproducibility
  • Lifecycle of systems that

learn

slide-34
SLIDE 34

CURRENT STATUS

Alpha Release

  • Integrated with LinkedIn Gobblin,

Kafka, Hive Metastore, Github

  • All components have Docker

images on DockerHub

  • We’d love feedback!

www.ground-context.org

ground