Petastorm Petastorm: A Light-Weight Approach to Building ML - - PowerPoint PPT Presentation

petastorm
SMART_READER_LITE
LIVE PREVIEW

Petastorm Petastorm: A Light-Weight Approach to Building ML - - PowerPoint PPT Presentation

Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords Huge


slide-1
SLIDE 1

Petastorm

Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG

slide-2
SLIDE 2

Deep-learning for self driving vehicles

Complex upstream API Cluster processing

  • Huge row sizes (multi MBs)
  • Huge datasets (tens+ TB)

Learning curve Many datasets

Raw AV data API

HDF5 TFRecords Png files

Maps API Labels API

Autonomous Vehicle

slide-3
SLIDE 3

Consolidating datasets

Research engineers (typically) don’t do data-extraction Train directly from the well-known dataset

Raw AV data Labels Maps API API API

slide-4
SLIDE 4

Uber ATG Mission

Introduce self-driving technology to the Uber network in order to make transporting people and goods safer, more efficient, and more affordable around the world.

slide-5
SLIDE 5

About myself...

Yevgeni Litvin Work on data platform and onboard integration

  • f models
slide-6
SLIDE 6

Our talk today

Enabling “One Dataset” approach File formats Petastorm as an enabling tool

slide-7
SLIDE 7

One dataset

One dataset used by multiple research projects.

  • Easy to compare models.
  • Easy to reproduce training.
  • Faster research engineer ramp-up.
  • ML infra-team management.
  • Superset of the data a single project may

require.

  • No model-specific preprocessing.
  • Efficient data access.
  • TF/PyTorch/other framework native access.
slide-8
SLIDE 8

Apache Parquet

Efficient column-subset reads. Atomic read unit: one column from a row group (a chunk). Random access to a row-group. Natively supported by Apache Spark, Hive and other big-data tools. No tensor support

slide-9
SLIDE 9

Petastorm

Scalable Native TensorFlow, PyTorch Shuffling Sharding Queries, Indexing Parquet partitions Local caching N-grams (windowing)

slide-10
SLIDE 10

Research engineer experience

Before: After: Data extraction (Query upstream systems, ETL at scale) Train Evaluate Deploy Train Evaluate Deploy

slide-11
SLIDE 11

Apache Parquet as a dataframe with tensors

Two integration alternatives

Train from existing org Parquet stores (native types, no tensors)

nd-arrays, scalars (e.g. images, lidar point clouds)

Apache Parquet store

Fog Horse Hedgehog

non-Petastorm, Apache Parquet store

slide-12
SLIDE 12

Extra schema information

FrameSchema = Unischema('FrameSchema', [ UnischemaField('timestamp', np.int32, (), ScalarCodec(IntegerType()), nullable=False), UnischemaField('front_cam', np.uint8, (1200, 1920, 3), CompressedImageCodec('png'), nullable=False), UnischemaField('label_box', np.uint8, (None, 2, 2), NdarrayCodec(), nullable=False), ])

Stored with Parquet store Defines tensor serialization format Runtime types validation Needed for wiring natively into Tensorflow graph

slide-13
SLIDE 13

Generating a dataset

with materialize_dataset(spark, output_url, FrameSchema, rowgroup_size_mb): rows_rdd = sc.parallelize(range(rows_count))\ .map(row_generator)\ .map(lambda x: dict_to_spark_row(FrameSchema, x)) spark.createDataFrame(rows_rdd, FrameSchema.as_spark_schema()) \ .write \ .parquet(output_url) def row_generator(x): return {'timestamp': ...,

'front_cam': np.asarray(...), 'label_box: np.asarray(...)}

  • 1. Configure row-group

size spark and writes Petastorm metadata at the end

  • 2. Encode tensors and

convert to a spark Row

  • 3. Spark schema from

Unischema

slide-14
SLIDE 14

Python

with make_reader('hdfs:///tmp/hello_world_dataset') as reader: for sample in reader: print(sample.id) plt.imshow(sample.image1) [Out 0] 0 # Reading from non-Petastorm dataset (only native Apache Parquet types) with make_batch_reader('hdfs:///tmp/hel...') as reader: for sample in reader: print(sample.id) [Out 1] [0, 1, 2, 3, 4, 5]

slide-15
SLIDE 15

Tensorflow

# tf tensors with make_reader('hdfs:///tmp/dataset') as reader: data = tf_tensors(reader) predictions = my_model(data.image1, data.image2) Substitute with make_batch_reader to read non-Petastorm dataset Connect Petastorm Reader object into TF graph

slide-16
SLIDE 16

PyTorch

from petastorm.pytorch import DataLoader with DataLoader(make_reader(dataset_url)) as train_loader: sample = next(iter(train_loader)) print(sample['id'])

slide-17
SLIDE 17

Real example

with make_reader('hdfs:///path/to/uber/atg/dataset', schema_fields=[AVSchema.lidar_xyz]) as reader: sample = next(reader) plt.plot(sample.returns_xyz[:, 0], sample.returns_xyz[:, 1], '.')

slide-18
SLIDE 18

Reader architecture

Uses Apache Arrow Reading workers (threads or processes) Row-groups filtered, shuffled Output rows as np.array, tf.tensor or tf.data.Dataset

slide-19
SLIDE 19

Petastorm row predicate

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... car car car pedestrian car car bicycle car ... ... ... ... ... ... ... ... ... ... ... ... car car car car car car User defined row filter Optimizations in_lambda(['object_type'], lambda object_type: object == 'car'))

slide-20
SLIDE 20

Transform

[[1,2,3], [4], [5,6]] [[1], [4,5,6]] [[10]] User defined row update On thread/process pool

def modify_row(row): row['list_of_lists_as_tensor'] = \ foo_to_tensor(row['list_of_lists']) del row['data_as_list_of_lists']

0, 0, ... 0 1, 1, ... 0 3, 3, ... 1 4, 3, ... 1 1, 2, ... 0 3, 5, ... 1 0, 8, ... 3 1, 4, ... 2 1, 3, ... 1

slide-21
SLIDE 21

Local cache

Slow/expensive links In-memory cache

make_reader(..., cache_type=’local-disk’)

slide-22
SLIDE 22

Sharding

Distributed training Quick experimentation

make_reader(..., cur_shard=3, shard_count=10)

slide-23
SLIDE 23

NGrams (windowing)

Sorted datasets Efficient IO/decoding Cons: RAM wasteful shuffling

t=0 t=1 t=2 t=1 t=2 t=3 t=2 t=3 t=4 t=0 t=1 t=2 t=3 t=4

slide-24
SLIDE 24

Conclusion

Petastorm developed to support “One Dataset” workflow. Uses Apache Parquet as the store format:

  • Tensors support
  • Provides set of tools needed for deep-learning training/evaluation

Organization data-warehouse (non-Petastorm, native Parquet types) (still lot’s of work left to be done… we are hiring!)

slide-25
SLIDE 25

Thank you! Github: https://github.com/uber/petastorm yevgeni@uber.com