Petastorm
Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG
Petastorm Petastorm: A Light-Weight Approach to Building ML - - PowerPoint PPT Presentation
Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords Huge
Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG
Complex upstream API Cluster processing
Learning curve Many datasets
Raw AV data API
HDF5 TFRecords Png files
Maps API Labels API
Autonomous Vehicle
Research engineers (typically) don’t do data-extraction Train directly from the well-known dataset
Raw AV data Labels Maps API API API
Uber ATG Mission
Introduce self-driving technology to the Uber network in order to make transporting people and goods safer, more efficient, and more affordable around the world.
Yevgeni Litvin Work on data platform and onboard integration
Enabling “One Dataset” approach File formats Petastorm as an enabling tool
One dataset used by multiple research projects.
require.
Efficient column-subset reads. Atomic read unit: one column from a row group (a chunk). Random access to a row-group. Natively supported by Apache Spark, Hive and other big-data tools. No tensor support
Scalable Native TensorFlow, PyTorch Shuffling Sharding Queries, Indexing Parquet partitions Local caching N-grams (windowing)
Before: After: Data extraction (Query upstream systems, ETL at scale) Train Evaluate Deploy Train Evaluate Deploy
Apache Parquet as a dataframe with tensors
Train from existing org Parquet stores (native types, no tensors)
nd-arrays, scalars (e.g. images, lidar point clouds)
Apache Parquet store
Fog Horse Hedgehog
non-Petastorm, Apache Parquet store
FrameSchema = Unischema('FrameSchema', [ UnischemaField('timestamp', np.int32, (), ScalarCodec(IntegerType()), nullable=False), UnischemaField('front_cam', np.uint8, (1200, 1920, 3), CompressedImageCodec('png'), nullable=False), UnischemaField('label_box', np.uint8, (None, 2, 2), NdarrayCodec(), nullable=False), ])
Stored with Parquet store Defines tensor serialization format Runtime types validation Needed for wiring natively into Tensorflow graph
with materialize_dataset(spark, output_url, FrameSchema, rowgroup_size_mb): rows_rdd = sc.parallelize(range(rows_count))\ .map(row_generator)\ .map(lambda x: dict_to_spark_row(FrameSchema, x)) spark.createDataFrame(rows_rdd, FrameSchema.as_spark_schema()) \ .write \ .parquet(output_url) def row_generator(x): return {'timestamp': ...,
'front_cam': np.asarray(...), 'label_box: np.asarray(...)}
size spark and writes Petastorm metadata at the end
convert to a spark Row
Unischema
with make_reader('hdfs:///tmp/hello_world_dataset') as reader: for sample in reader: print(sample.id) plt.imshow(sample.image1) [Out 0] 0 # Reading from non-Petastorm dataset (only native Apache Parquet types) with make_batch_reader('hdfs:///tmp/hel...') as reader: for sample in reader: print(sample.id) [Out 1] [0, 1, 2, 3, 4, 5]
# tf tensors with make_reader('hdfs:///tmp/dataset') as reader: data = tf_tensors(reader) predictions = my_model(data.image1, data.image2) Substitute with make_batch_reader to read non-Petastorm dataset Connect Petastorm Reader object into TF graph
from petastorm.pytorch import DataLoader with DataLoader(make_reader(dataset_url)) as train_loader: sample = next(iter(train_loader)) print(sample['id'])
with make_reader('hdfs:///path/to/uber/atg/dataset', schema_fields=[AVSchema.lidar_xyz]) as reader: sample = next(reader) plt.plot(sample.returns_xyz[:, 0], sample.returns_xyz[:, 1], '.')
Uses Apache Arrow Reading workers (threads or processes) Row-groups filtered, shuffled Output rows as np.array, tf.tensor or tf.data.Dataset
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... car car car pedestrian car car bicycle car ... ... ... ... ... ... ... ... ... ... ... ... car car car car car car User defined row filter Optimizations in_lambda(['object_type'], lambda object_type: object == 'car'))
[[1,2,3], [4], [5,6]] [[1], [4,5,6]] [[10]] User defined row update On thread/process pool
def modify_row(row): row['list_of_lists_as_tensor'] = \ foo_to_tensor(row['list_of_lists']) del row['data_as_list_of_lists']
0, 0, ... 0 1, 1, ... 0 3, 3, ... 1 4, 3, ... 1 1, 2, ... 0 3, 5, ... 1 0, 8, ... 3 1, 4, ... 2 1, 3, ... 1
Slow/expensive links In-memory cache
make_reader(..., cache_type=’local-disk’)
Distributed training Quick experimentation
make_reader(..., cur_shard=3, shard_count=10)
Sorted datasets Efficient IO/decoding Cons: RAM wasteful shuffling
t=0 t=1 t=2 t=1 t=2 t=3 t=2 t=3 t=4 t=0 t=1 t=2 t=3 t=4
Petastorm developed to support “One Dataset” workflow. Uses Apache Parquet as the store format:
Organization data-warehouse (non-Petastorm, native Parquet types) (still lot’s of work left to be done… we are hiring!)
Thank you! Github: https://github.com/uber/petastorm yevgeni@uber.com