Data Model Predictions ( x ) Kim Hammar (Logical Clocks) - - PowerPoint PPT Presentation

data model predictions x
SMART_READER_LITE
LIVE PREVIEW

Data Model Predictions ( x ) Kim Hammar (Logical Clocks) - - PowerPoint PPT Presentation

Feature Store: the missing data layer in ML pipelines? 1 Spotify ML Guild Fika Kim Hammar kim@logicalclocks.com February 26, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?


slide-1
SLIDE 1

Feature Store: the missing data layer in ML pipelines?1

Spotify ML Guild Fika Kim Hammar

kim@logicalclocks.com

February 26, 2019

1Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?

https://www.logicalclocks.com/feature-store/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29

slide-2
SLIDE 2

Model ϕ(x) Data Predictions

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29

slide-3
SLIDE 3

Model ϕ(x) Data Predictions

Distributed Training Data Validation Feature Engineering Data Collection Hardware Management HyperParameter Tuning Model Serving Pipeline Management A/B Testing Monitoring

2

2Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29

slide-4
SLIDE 4

Outline

1 Hopsworks: Quick background of the platform 2 What is a Feature Store 3 Why You Need a Feature Store, Things to Consider:

How to encourage feature reusage? How to store large-scale datasets for deep learning? How to serve features for inference?

4 How to Build a Feature Store (Hopsworks Feature Store Case Study) 5 Demo Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29

slide-5
SLIDE 5

REST API Kafka TF Serving Data Ingestion Data Prep Feature Store Training Serving Orchestration CPUs GPUs

HopsML

HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

slide-6
SLIDE 6

REST API Kafka TF Serving Data Ingestion Data Prep Feature Store Training Serving Orchestration CPUs GPUs

HopsML

HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

slide-7
SLIDE 7

ϕ(x)

    y1 . . . yn         x1,1 . . . x1,n . . . . . . . . . xn,1 . . . xn,n    

ˆ y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

slide-8
SLIDE 8

ϕ(x)

    y1 . . . yn         x1,1 . . . x1,n . . . . . . . . . xn,1 . . . xn,n    

ˆ y

_\_ ( " ) ) _/_

3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.

https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

slide-9
SLIDE 9

ϕ(x)

    y1 . . . yn         x1,1 . . . x1,n . . . . . . . . . xn,1 . . . xn,n    

ˆ y

_\_ ( " ) ) _/_

“Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.”

  • Uber3

3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.

https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

slide-10
SLIDE 10

ϕ(x)

    y1 . . . yn         x1,1 . . . x1,n . . . . . . . . . xn,1 . . . xn,n    

ˆ y

Feature Store “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.”

  • Uber4

4Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.

https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

slide-11
SLIDE 11

Disentangle ML Pipelines with a Feature Store

Raw/Structured Data Feature Store

Feature Engineering Training

Models

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

A feature store is a central vault for storing documented, curated, and access-controlled features. The feature store is the interface between data engineering and data model development

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29

slide-12
SLIDE 12

Dataset 1 Dataset 2

. . .

Dataset n

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Feature Engineering

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29

slide-13
SLIDE 13

Dataset 1 Dataset 2

. . .

Dataset n Feature Store

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

slide-14
SLIDE 14

Dataset 1 Dataset 2

. . .

Dataset n Feature Store Backfilling

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

slide-15
SLIDE 15

Dataset 1 Dataset 2

. . .

Dataset n Feature Store Backfilling Analysis

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

slide-16
SLIDE 16

Dataset 1 Dataset 2

. . .

Dataset n Feature Store Backfilling Analysis Versioning

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

slide-17
SLIDE 17

Dataset 1 Dataset 2

. . .

Dataset n Feature Store Backfilling Analysis Versioning Documentation

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y ≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8) (−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

slide-18
SLIDE 18

What is a Feature?

A feature is a measurable property of some data-sample A feature could be.. An aggregate value (min, max, mean, sum) A raw value (a pixel, a word from a piece of text) A value from a database table (the age of a customer) A derived representation: e.g an embedding or a cluster Features are the fuel for AI systems:

    x1 . . . xn    

Features

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29

slide-19
SLIDE 19

Raw text lower-case & remove noise tokenization lemmatization words.txt group by post words_post.csv word2vec TF-IDF LDA

  • ntology-matching

annotation with weak supervision Model

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29

slide-20
SLIDE 20

Raw text Feature Store TF-IDF word2vec LDA weak annotation normalization Model

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29

slide-21
SLIDE 21

How to Encourage Feature Reusage?

slide-22
SLIDE 22

Feature Marketplace Feature Marketplace

Download Features Search Features Features Publish

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29

slide-23
SLIDE 23

Feature Store API Service

from hops import featurestore features_df = featurestore.get_features([ "average_attendance", "average_player_age" ])

Feature Relationships

Feature Groups

Shared Storage Feature Store API Service

Feature Metadata in ML pipelines Include features

Figure: Feature Store API Service

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29

slide-24
SLIDE 24

How to Store Datasets for Deep Learning?

slide-25
SLIDE 25

How to Store Datasets for Deep Learning?

Should be framework agnostic Need to be able to store tensor datasets Should support sharding for distributed training Advanced features: row-predicate filtering, SQL interface, columnar selection.

?

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29

slide-26
SLIDE 26

How to Store Datasets for Deep Learning?

  • HDF5
  • TFRecords

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29

slide-27
SLIDE 27

How to Store Datasets for Deep Learning?

Petastorm is a dataset format designed for deep learning Petastorm stores data as parquet files with extra metadata to handle multi-dimensional tensors Petastorm contains readers for the popular machine learning frameworks such as SparkML, Tensorflow, PyTorch

  • Petastorm

5

5Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for

Deep Learning. https://eng.uber.com/petastorm/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 19 / 29

slide-28
SLIDE 28

How to Serve Features for Inference?

slide-29
SLIDE 29

Delivering Features for Training and Serving is Different

Serving can require real-time features Ideally we want consistency between real-time features and batch features used for training Complex engineering problem

Feature Store

Real-Time Features

  • Prediction

Inference Request

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 21 / 29

slide-30
SLIDE 30

How to Implement a (batch) Feature Store?

slide-31
SLIDE 31

The Components of a Feature Store

The Storage Layer: For storing feature data in the feature store The Metadata Layer: For storing feature metadata (versioning, feature analysis, documentation, jobs) The Feature Engineering Jobs: For computing features The Feature Registry: A user interface to share and discover features The Feature Store API: For writing/reading to/from the feature store

Feature Storage Feature Metadata Jobs Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 23 / 29

slide-32
SLIDE 32

Feature Storage

Feature Computation Raw/Structured Data Data Lake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Foreign keys Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 24 / 29

slide-33
SLIDE 33

Feature Metadata

Feature Computation Raw/Structured Data Data Lake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Featurestore Metadata Foreign keys Foreign keys Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 25 / 29

slide-34
SLIDE 34

Feature Registry and API

Feature Computation Raw/Structured Data Data Lake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Featurestore Metadata Foreign keys Foreign keys Hopsworks Feature registry (UI) REST API Program APIs

Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 26 / 29

slide-35
SLIDE 35

Feature Computation Raw/Structured Data Data Lake Feature Store Curated Features Model

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Demo-Setting

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 27 / 29

slide-36
SLIDE 36

Summary

Machine learning comes with a high technical cost Machine learning pipelines needs proper data management A feature store is a place to store curated and documented features The feature store serves as an interface between feature engineering and model development, it can help disentangle complex ML pipelines Hopsworks6 provides the world’s first open-source feature store

@hopshadoop www.hops.io @logicalclocks www.logicalclocks.com We are open source: https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops 7

6Jim Dowling. Introducing Hopsworks.

https://www.logicalclocks.com/introducing-hopsworks/. 2018.

7Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,

Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 28 / 29

slide-37
SLIDE 37

References

Hopsworks’ feature store8 (the only open-source one!) Uber’s feature store9 Airbnb’s feature store10 Comcast’s feature store11 GO-JEK’s feature store12 HopsML13 Hopsworks14

8Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?

https://www.logicalclocks.com/feature-store/. 2018.

9Li Erran Li et al. “Scaling Machine Learning as a Service”.

In: Proceedings of The 3rd International Conference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of Machine Learning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL: http://proceedings.mlr.press/v67/li17a.html.

10Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.

https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.

11Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.

https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-

  • predictions. 2018.

12Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.

https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.

13Logical Clocks AB. HopsML: Python-First ML Pipelines.

https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.

14Jim Dowling. Introducing Hopsworks.

https://www.logicalclocks.com/introducing-hopsworks/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 29 / 29

slide-38
SLIDE 38

Backup Slides

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 30 / 29

slide-39
SLIDE 39

Modeling Data in the Feature Store

A feature group is a logical grouping of features

Typically from the same input dataset and computed with the same job

A training dataset is a set of features suitable for a prediction task

Features in a training dataset are often from several feature groups E.g features on customers, features on user activities, etc. Training Datasets d Feature groups g Features f

f1 f2 f3 f4 f5 g1 f6 f7 f8 f9 f10 g2 f11 f12 g3 d1 d2

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 31 / 29

slide-40
SLIDE 40

Hopsworks Feature Store API Service

SQL hops-util Query Planner Feature Store Data Hive on HopsFS Feature Store Metadata Dataframe With Features Client Interface Feature Store Service Output

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 32 / 29

slide-41
SLIDE 41

Training Pipeline in HopsML

1 Create job/notebook to compute features and publish to the feature

store

2 Create job/notebook to read features/labels and save to a training

dataset

3 Read the training dataset into your model for training

HopsFS Data Lake Hive Feature store

Raw data Feature computation Feature store features

hops-util hops-util-py

Basis features Training dataset Model

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 33 / 29

slide-42
SLIDE 42

Hopsworks Feature Store API

Reading from the Feature Store:

from hops import featurestore features_df = featurestore.get_features([ "average_attendance", "average_player_age" ])

Writing to the Feature Store:

from hops import featurestore raw_data = spark.read.parquet(filename) pol_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup")

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 34 / 29