petastorm
play

Petastorm Petastorm: A Light-Weight Approach to Building ML - PowerPoint PPT Presentation

Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords Huge


  1. Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG

  2. Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords ● Huge row sizes (multi MBs) ● Huge datasets (tens+ TB) HDF5 Learning curve Many datasets Raw AV API Png files data API API Maps Labels

  3. Consolidating datasets Research engineers (typically) don’t do data-extraction Train directly from the well-known dataset Raw AV API data API API Maps Labels

  4. Uber ATG Mission Introduce self-driving technology to the Uber network in order to make transporting people and goods safer, more efficient, and more affordable around the world.

  5. About myself... Yevgeni Litvin Work on data platform and onboard integration of models

  6. Our talk today Enabling “One Dataset” approach File formats Petastorm as an enabling tool

  7. One dataset One dataset used by multiple research projects. ● Easy to compare models. Easy to reproduce training. ● ● Faster research engineer ramp-up. ML infra-team management. ● ● Superset of the data a single project may require. ● No model-specific preprocessing. Efficient data access. ● ● TF/PyTorch/other framework native access.

  8. Apache Parquet Efficient column-subset reads. Atomic read unit: one column from a row group (a chunk). Random access to a row-group. Natively supported by Apache Spark, Hive and other big-data tools. No tensor support

  9. Petastorm Scalable Native TensorFlow, PyTorch Shuffling Sharding Queries, Indexing Parquet partitions Local caching N-grams (windowing)

  10. Research engineer experience Before: After: Data extraction (Query upstream systems, ETL at scale) Train Train Evaluate Evaluate Deploy Deploy

  11. Two integration alternatives Apache Parquet as a dataframe with tensors Train from existing org Parquet stores (native types, no tensors) Hedgehog Fog Horse nd-arrays, scalars (e.g. images, Apache Parquet non-Petastorm, Apache Parquet lidar point store store clouds)

  12. Extra schema information FrameSchema = Unischema('FrameSchema', [ UnischemaField('timestamp', np.int32, (), ScalarCodec(IntegerType()), nullable=False), UnischemaField('front_cam', np.uint8, (1200, 1920, 3), CompressedImageCodec('png'), nullable=False), UnischemaField('label_box', np.uint8, (None, 2, 2), NdarrayCodec(), nullable=False), ]) Stored with Parquet store Defines tensor serialization format Runtime types validation Needed for wiring natively into Tensorflow graph

  13. Generating a dataset 1. Configure row-group size spark and writes Petastorm metadata at the end with materialize_dataset(spark, output_url, FrameSchema, rowgroup_size_mb) : rows_rdd = sc.parallelize(range(rows_count))\ 2. Encode tensors and .map(row_generator)\ convert to a spark Row .map(lambda x: dict_to_spark_row(FrameSchema, x) ) spark.createDataFrame(rows_rdd, FrameSchema.as_spark_schema() ) \ .write \ .parquet(output_url) 3. Spark schema from Unischema def row_generator(x): return {'timestamp': ..., 'front_cam': np.asarray(...), 'label_box: np.asarray(...)}

  14. Python with make_reader ('hdfs:///tmp/hello_world_dataset') as reader: for sample in reader: print(sample.id) plt.imshow(sample.image1) [Out 0] 0 # Reading from non-Petastorm dataset (only native Apache Parquet types) with make_batch_reader ('hdfs:///tmp/hel...') as reader: for sample in reader: print(sample.id) [Out 1] [0, 1, 2, 3, 4, 5]

  15. Tensorflow Substitute with make_batch_reader to read non-Petastorm dataset # tf tensors with make_reader ('hdfs:///tmp/dataset') as reader: Connect Petastorm Reader object into data = tf_tensors (reader) TF graph predictions = my_model(data.image1, data.image2)

  16. PyTorch from petastorm.pytorch import DataLoader with DataLoader ( make_reader (dataset_url)) as train_loader: sample = next(iter(train_loader)) print(sample['id'])

  17. Real example with make_reader ('hdfs:///path/to/uber/atg/dataset', schema_fields=[AVSchema.lidar_xyz]) as reader: sample = next(reader) plt.plot(sample.returns_xyz[:, 0], sample.returns_xyz[:, 1], '.')

  18. Reader architecture Uses Apache Arrow Reading workers (threads or processes) Row-groups filtered, shuffled Output rows as np.array, tf.tensor or tf.data.Dataset

  19. Petastorm row predicate User defined row filter in_lambda (['object_type'], Optimizations lambda object_type: object == 'car')) car ... ... car ... ... car ... ... car ... ... pedestrian ... ... car ... ... car ... ... car ... ... car ... ... car ... ... bicycle ... ... car ... ... car ... ... car ... ...

  20. Transform def modify_row(row): row[ 'list_of_lists_as_tensor' ] = \ User defined row update foo_to_tensor(row[ 'list_of_lists' ]) On thread/process pool del row[ 'data_as_list_of_lists' ] 0, 0, ... 0 1, 1, ... 0 3, 3, ... 1 4, 3, ... 1 1, 2, ... 0 [[1,2,3], [4], [5,6]] 3, 5, ... 1 [[1], [4,5,6]] [[10]] 0, 8, ... 3 1, 4, ... 2 1, 3, ... 1

  21. Local cache Slow/expensive links In-memory cache make_reader(..., cache_type=’local-disk’)

  22. Sharding Distributed training Quick experimentation make_reader(..., cur_shard=3, shard_count=10)

  23. NGrams (windowing) Sorted datasets Efficient IO/decoding Cons: RAM wasteful shuffling t=0 t=1 t=2 t=0 t=1 t=1 t=2 t=3 t=2 t=3 t=4 t=2 t=3 t=4

  24. Conclusion Petastorm developed to support “One Dataset” workflow. Uses Apache Parquet as the store format: - Tensors support - Provides set of tools needed for deep-learning training/evaluation Organization data-warehouse (non-Petastorm, native Parquet types) (still lot’s of work left to be done… we are hiring!)

  25. Github: https://github.com/uber/petastorm Thank you! yevgeni@uber.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend