Petastorm Petastorm: A Light-Weight Approach to Building ML - PowerPoint PPT Presentation

Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG

Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords ● Huge row sizes (multi MBs) ● Huge datasets (tens+ TB) HDF5 Learning curve Many datasets Raw AV API Png files data API API Maps Labels

Consolidating datasets Research engineers (typically) don’t do data-extraction Train directly from the well-known dataset Raw AV API data API API Maps Labels

Uber ATG Mission Introduce self-driving technology to the Uber network in order to make transporting people and goods safer, more efficient, and more affordable around the world.

About myself... Yevgeni Litvin Work on data platform and onboard integration of models

Our talk today Enabling “One Dataset” approach File formats Petastorm as an enabling tool

One dataset One dataset used by multiple research projects. ● Easy to compare models. Easy to reproduce training. ● ● Faster research engineer ramp-up. ML infra-team management. ● ● Superset of the data a single project may require. ● No model-specific preprocessing. Efficient data access. ● ● TF/PyTorch/other framework native access.

Apache Parquet Efficient column-subset reads. Atomic read unit: one column from a row group (a chunk). Random access to a row-group. Natively supported by Apache Spark, Hive and other big-data tools. No tensor support

Petastorm Scalable Native TensorFlow, PyTorch Shuffling Sharding Queries, Indexing Parquet partitions Local caching N-grams (windowing)

Research engineer experience Before: After: Data extraction (Query upstream systems, ETL at scale) Train Train Evaluate Evaluate Deploy Deploy

Two integration alternatives Apache Parquet as a dataframe with tensors Train from existing org Parquet stores (native types, no tensors) Hedgehog Fog Horse nd-arrays, scalars (e.g. images, Apache Parquet non-Petastorm, Apache Parquet lidar point store store clouds)

Extra schema information FrameSchema = Unischema('FrameSchema', [ UnischemaField('timestamp', np.int32, (), ScalarCodec(IntegerType()), nullable=False), UnischemaField('front_cam', np.uint8, (1200, 1920, 3), CompressedImageCodec('png'), nullable=False), UnischemaField('label_box', np.uint8, (None, 2, 2), NdarrayCodec(), nullable=False), ]) Stored with Parquet store Defines tensor serialization format Runtime types validation Needed for wiring natively into Tensorflow graph

Generating a dataset 1. Configure row-group size spark and writes Petastorm metadata at the end with materialize_dataset(spark, output_url, FrameSchema, rowgroup_size_mb) : rows_rdd = sc.parallelize(range(rows_count))\ 2. Encode tensors and .map(row_generator)\ convert to a spark Row .map(lambda x: dict_to_spark_row(FrameSchema, x) ) spark.createDataFrame(rows_rdd, FrameSchema.as_spark_schema() ) \ .write \ .parquet(output_url) 3. Spark schema from Unischema def row_generator(x): return {'timestamp': ..., 'front_cam': np.asarray(...), 'label_box: np.asarray(...)}

Python with make_reader ('hdfs:///tmp/hello_world_dataset') as reader: for sample in reader: print(sample.id) plt.imshow(sample.image1) [Out 0] 0 # Reading from non-Petastorm dataset (only native Apache Parquet types) with make_batch_reader ('hdfs:///tmp/hel...') as reader: for sample in reader: print(sample.id) [Out 1] [0, 1, 2, 3, 4, 5]

Tensorflow Substitute with make_batch_reader to read non-Petastorm dataset # tf tensors with make_reader ('hdfs:///tmp/dataset') as reader: Connect Petastorm Reader object into data = tf_tensors (reader) TF graph predictions = my_model(data.image1, data.image2)

PyTorch from petastorm.pytorch import DataLoader with DataLoader ( make_reader (dataset_url)) as train_loader: sample = next(iter(train_loader)) print(sample['id'])

Real example with make_reader ('hdfs:///path/to/uber/atg/dataset', schema_fields=[AVSchema.lidar_xyz]) as reader: sample = next(reader) plt.plot(sample.returns_xyz[:, 0], sample.returns_xyz[:, 1], '.')

Reader architecture Uses Apache Arrow Reading workers (threads or processes) Row-groups filtered, shuffled Output rows as np.array, tf.tensor or tf.data.Dataset

Petastorm row predicate User defined row filter in_lambda (['object_type'], Optimizations lambda object_type: object == 'car')) car ... ... car ... ... car ... ... car ... ... pedestrian ... ... car ... ... car ... ... car ... ... car ... ... car ... ... bicycle ... ... car ... ... car ... ... car ... ...

Transform def modify_row(row): row[ 'list_of_lists_as_tensor' ] = \ User defined row update foo_to_tensor(row[ 'list_of_lists' ]) On thread/process pool del row[ 'data_as_list_of_lists' ] 0, 0, ... 0 1, 1, ... 0 3, 3, ... 1 4, 3, ... 1 1, 2, ... 0 [[1,2,3], [4], [5,6]] 3, 5, ... 1 [[1], [4,5,6]] [[10]] 0, 8, ... 3 1, 4, ... 2 1, 3, ... 1

Local cache Slow/expensive links In-memory cache make_reader(..., cache_type=’local-disk’)

Sharding Distributed training Quick experimentation make_reader(..., cur_shard=3, shard_count=10)

NGrams (windowing) Sorted datasets Efficient IO/decoding Cons: RAM wasteful shuffling t=0 t=1 t=2 t=0 t=1 t=1 t=2 t=3 t=2 t=3 t=4 t=2 t=3 t=4

Conclusion Petastorm developed to support “One Dataset” workflow. Uses Apache Parquet as the store format: - Tensors support - Provides set of tools needed for deep-learning training/evaluation Organization data-warehouse (non-Petastorm, native Parquet types) (still lot’s of work left to be done… we are hiring!)

Github: https://github.com/uber/petastorm Thank you! yevgeni@uber.com

Petastorm Petastorm: A Light-Weight Approach to Building ML - PowerPoint PPT Presentation

Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords Huge

- A scalable automobile rental web service Michael Zhang Sammy Guo Sujaya Maiyya Kyle Carson

External APIs and Drupal Strategies for Integrating External API Data Introductions Luke

Differentiated Storage Services M. Mesnier, J.B. Akers, F. Chen, T. Luo Presentation by Szymon

Displaying the level of contentiousness of Wikipedia pages via a coloring scheme.

Alpha Presentation Mobile Customer Satisfaction Application The Capstone Experience Team Meijer

Joint Wireless Information and Energy Transfer in Cache-assisted Relaying Systems IEEE Wireless

Understand the trade-offs using compilers for Java applications (From AOT to JIT and Beyond!)

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC & 19 UTC Welcome to the

OPS An Opportunistic Networking Protocol Simulator for OMNeT++ Asanga Udugama , Anna Frster,

Project Plan Mobile Audit Itinerary The Capstone Experience Team Auto-Owners Jacob Burger

Artwork Personalization at Netflix Justin Basilico QCon SF 2018 2018-11-05 @JustinBasilico

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba < toshaan@vantosh.com

Cisco

TalkTalk FY20 Preliminary Results - Transcript of Pre-recorded Presentation Thursday 11 th June

Alternative way of starting up the WinCC OA UI using HTTP protocol Hannu Kamarainen, Piotr

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

Mobile Edge Computing Wei-Yu Chen Outline 5G Communication Components Computation

Content Based Architectures for Networking Aaditeshwar Seth Department of Computer Science, IIT

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Buffering test, proxy version, SNMP Helsinki 28 May 2019 Harald sheim, Senior Engineer AIS

Tomorrow Rob Tiffany CTO and Global Product Manager/Lumada IoT 9/18/2017 Session Objectives and

INTEL AMT. STEALTH BREAKTHROUGH Dmitriy Evdokimov, CTO Embedi Alexander Ermolov, Security

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified

Petastorm Petastorm: A Light-Weight Approach to Building ML - PowerPoint PPT Presentation

Petastorm Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber Yevgeni Litvin (yevgeni@uber.com), Uber ATG Deep-learning for self driving vehicles Complex upstream API Autonomous Vehicle Cluster processing TFRecords Huge

- A scalable automobile rental web service Michael Zhang Sammy Guo Sujaya Maiyya Kyle Carson

External APIs and Drupal Strategies for Integrating External API Data Introductions Luke

Differentiated Storage Services M. Mesnier, J.B. Akers, F. Chen, T. Luo Presentation by Szymon

Displaying the level of contentiousness of Wikipedia pages via a coloring scheme.

Alpha Presentation Mobile Customer Satisfaction Application The Capstone Experience Team Meijer

Joint Wireless Information and Energy Transfer in Cache-assisted Relaying Systems IEEE Wireless

Understand the trade-offs using compilers for Java applications (From AOT to JIT and Beyond!)

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC &amp; 19 UTC Welcome to the

OPS An Opportunistic Networking Protocol Simulator for OMNeT++ Asanga Udugama , Anna Frster,

Project Plan Mobile Audit Itinerary The Capstone Experience Team Auto-Owners Jacob Burger

Artwork Personalization at Netflix Justin Basilico QCon SF 2018 2018-11-05 @JustinBasilico

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba &lt; toshaan@vantosh.com

Cisco

TalkTalk FY20 Preliminary Results - Transcript of Pre-recorded Presentation Thursday 11 th June

Alternative way of starting up the WinCC OA UI using HTTP protocol Hannu Kamarainen, Piotr

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

Mobile Edge Computing Wei-Yu Chen Outline 5G Communication Components Computation

Content Based Architectures for Networking Aaditeshwar Seth Department of Computer Science, IIT

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Buffering test, proxy version, SNMP Helsinki 28 May 2019 Harald sheim, Senior Engineer AIS

Tomorrow Rob Tiffany CTO and Global Product Manager/Lumada IoT 9/18/2017 Session Objectives and

INTEL AMT. STEALTH BREAKTHROUGH Dmitriy Evdokimov, CTO Embedi Alexander Ermolov, Security

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified

Pre-ICANN57 Policy Update Webinar 20 October 2016 | 10 UTC & 19 UTC Welcome to the

An Introduction to SELinux Presentation Toshaan Bharvani - VanTosh bvba < toshaan@vantosh.com