ground
Ground: A Data Context Service
Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al. CIDR 2017 https:/ /github.com/ground-context/ground
Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, - - PowerPoint PPT Presentation
Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017 https:/ /github.com/ground-context/ground ground Open Source Big Data Community Health Long-term Data L Data Analysis Data Wrangling I A
Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al. CIDR 2017 https:/ /github.com/ground-context/ground
Open Source Big Data Community Health
Long-term Data Management
Data Analysis Data Wrangling
What was the big data revolution really all about?
Database
A DECOUPLED STACK
Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine Query Optimizer API / Query Language
Big Data
A DECOUPLED STACK
Ingest/ PubSub Workflow Scheduler Storage Dataflow Engine Query Optimizer API / Query Language
SQL
GP ORCA
The Good: Agility
A DECOUPLED STACK
SQL
GP ORCA
The Bad: Dis-integration.
CRISIS: HOW DO WE SHARE INFORMATION?
WHAT IS METADATA?
WHAT IS METADATA?
Lay the groundwork for rich data context.
OPPORTUNITY: A BIGGER CONTEXT
Don’t just fill a metadata- sized hole in the big data stack.
WHAT IS DATA CONTEXT?
All the information surrounding the use of data.
The ABCs of Data Context
Application Context: Views, models, code Behavioral Context: Data lineage & usage Change Over Time: Version histories
Generated by—and useful to—many applications and components.
Janet
I bet social media content can predict which customers might cancel their accounts! Hey Janet! We already paid for a full Gnip feed from Twitter — you can find it here By the way: Sue used this following related table and script.
Janet
Hey Janet! This looks like Twitter JSON. Many people use this script to turn it into a table. Be careful: When people store outputs from this script, the following fields are often flagged by IT as PII. BTW, have you tried the sentiment analysis package? I bet social media content can predict which customers might cancel their accounts!
share
Sue
7.5 15 22.5 30 4 8 12 16
Janet
It looks true! Tweets predict churn!
TweetId Text Sentiment 47 “sad!” negative 53 “awesome!” positive 57 “go packers!” neutral 64 “fleek!” positive TweetId Text neg pos neut 47 “sad!” 1 53 “awesome!” 1 57 “go packers!” 1 64 “fleek!” 1
Sue
I wonder if Janet’s sentiment analysis will help with my discount targeting pipeline.
7.5 15 22.5 30 4 8 12 16
TweetId Text neg pos neut 47 “sad!” 53 “awesome!” 57 “go packers!” 64 “fleek!” TweetId Text Sentiment 47 “sad!” sadness 53 “awesome!” elation 57 “go packers!” sports 64 “fleek!” trendy
Sue
Uh oh, prediction accuracy metrics are down!
Time passes…
Oh dear. I better call a meeting to introduce better governance on sentiment labeler. FYI: Janet’s wrangling script changed!
Prediction Accuracy
25 50 75 100 1/1/2017 00:00 1/1/17 18:00 1/2/17 12:00
VERSION HISTORY 12/31/2016 00:00 -800 hash: 6dda491064bcce14f558bf83867b8c247027c423 user: will
WHAT DID CONTEXT ENABLE?
Figuring out which changes introduced the error.
VERSION HISTORY
Determining who made the change to help us resolve the issue.
user: will
Fueling our model accuracy monitor.
25 50 75 100 1/1/2017 00:00 1/2/17 00:00
Self-service catalog, wrangling and analytics. Collective governance of data.
7 7 9 9
THE BIG CONTEXT
Where are the interesting technical challenges? All over! Our goal is not to solve all these challenges. It’s to provide an environment to enable solutions.
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES
METAMODEL
COMMON GROUND
Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Time Travel Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth
Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth
COMMON GROUND CONTEXT MODEL
Pachyderm Chronos
Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Time Machine Model Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES
METAMODEL
COMMON GROUND
DESIGN REQUIREMENTS
Postel’s Law
Be conservative in what you do, be liberal in what you accept from others
A: Model Graphs
COMMON GROUND
The metamodel
member k1 member k1: string member k2 Object 2 member k1 member k2: number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root
RELATIONAL SCHEMA
JSON DOCUMENT
Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key
COMMON GROUND
The versioning model
A: Model Graphs
COMMON GROUND
The versioning model
COMMON GROUND
The usage model
SCALABLE, IMMUTABLE BACKEND
Longstanding open problem Workloads?
Room for improvement
(“the McSherry baseline”)
Figure 8: Dwell time analysis. Figure 9: Impact analysis. Figure 10: PostgreSQL transitive closure variants.
NEUTRALITY
Reminder: There will be k competing solutions for:
Goal: foster the ecosystem
NEUTRALITY
YOU
MANY OPEN RESEARCH QUESTIONS
Underground
representations
and trend Graph queries + log analysis
Aboveground
learn
CURRENT STATUS
Alpha Release
Kafka, Hive Metastore, Github
images on DockerHub
www.ground-context.org