Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - PowerPoint PPT Presentation

Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis

Design and Expectations  2 Use Cases:  Exhaustive (operation using all values of a record)  Selective (operation using limited values of a record)  5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro  The tests gave insights on specific advantages and dis- advantages for each format as well as their time and space performance. 2

Experiment descriptions  For the “exhaustive” use case (UC1) we used EOS logs “processed” data.  Current default data format is CSV .  For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard.  Current default data format is JSON .  For each use case all formats were generated a priori (from the default format) and then executed the tests.  Technology: Spark (Scala) with SparkSQL library.  No test performed with compression. 3

Formats  CSV – text files, comma separated values, one per line  JSON – text files, JavaScript objects, one per line  Serialiazed RDD Objects (SRO) – Spark dataset serialized on text files  Avro – serialization format with binary encoding  Parquet – colunmar format with binary encoding 4

Space Requirements (in GB) 25 23.4 19.9 20 EOS default 15 Size (GB) 13.4 12.7 UC1 11.7 UC2 Job Monitoring 10 8.1 default 6.6 6.3 6.2 5 2.1 0 CSV JSON SRO AVRO Parquet 5

Spark executions for i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers). Times include reading and computation (test jobs don't write any file, they just print to stdout the result ). 6

Times: UC1 "Exhaustive" 12.7 23.4 19.9 11.7 6.3 GB 300 245.9 250 205 Time (seconds) 200 155.8 AVG 150 131.6 MIN 108.4 100 80.3 80.7 78.5 65.41 64.5 50 0 CSV JSON SRO AVRO Parquet 7

Times: UC2 "Selective" 6.2 13.4 8.1 6.6 2.1 GB 120 109.5 100 83.3 Time (seconds) 74.6 80 63.6 AVG 60 52.6 MIN 42 40.4 35.1 40 23.7 18.4 20 0 CSV JSON SRO AVRO Parquet 8

Time Comparison between UC1 and UC2 AVG UC1 AVG UC2 300 250 200 seconds 150 100 50 0 CSV JSON SRO AVRO Parquet 9

Space and Time Performance Gain/Loss [compared to current CSV JSON SRO Avro Parquet default format] Space UC1 = + 84 % + 56 % - 8 % - 51 % [EOS logs] CSV Time performance = + 215 % + 93 % = + 35 % UC1 Space UC2 - 54 % = - 40 % - 51 % - 84 % [Job Monitoring] JSON Time performance - 64 % = - 35 % - 54 % - 79 % UC2 10

Pros and Cons Pros Cons CSV Always supported and easy to use. No schema change allowed. No type definitions. Efficient. No declaration control. JSON Encoded in plain text (easy to use). Inefficient. High space consuming. Schema changes allowed. No declaration control. Serialized Declaration control. Choice “between” Spark only. No compression. RDD CSV and JSON (for space and time). Schema changes allowed but to be manually Objects Good to store aggregate result. implemented. Avro Schema changes allowed. Space consuming like CSV (not really a negative). Efficiency comparable to CSV. Needs a plugin (we found an incompatibility with Compression definition included in the our Spark version and avro library, we had to fix schema. and recompile it). Parquet Low space consuming (RLE). Extremely Needs a plugin. efficient for “selective” use cases but Slow to be generated. good performances also in other cases. 11

Data Formats - Overview CSV JSON SRO Avro Parquet Support Change of NO YES YES YES YES Schema Primitive/Complex - YES (but with YES YES YES Types general numeric) Declaration control - NO YES YES YES Support Compression YES YES NO YES YES Storage Consumption Medium High Medium/High Medium Low (RLE) Supported by which All All (to be parsed Spark only All (needs All (needs technologies? from text) plugin) plugin) Possilibity to print a YES YES NO YES (with NO (yes with snippet as sample avro tools) unofficial tools) 12

Conclusions There is no “ultimate” file format but…  Avro shows promising results for exhaustive use cases, with performances comparable to CSV.  Parquet shows extremely good results for selective use cases and really low space consuming.  JSON is good to store directly (without any additional effort) data coming from web-like services that might change their format in a future, but it is too inefficient and high space consuming.  CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user.  Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format. 13

Thank You 14

Spark UC1 executions UC1 (EM,NE,EC): 2G 2 1 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 4 2 0 50 100 150 200 250 300 350 400 450 500 parquet sro avro json csv 15

Spark UC2 executions UC2 (EM,NE,EC): 2G 2 1 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 4 2 0 50 100 150 200 250 parquet sro avro json csv 16

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - PowerPoint PPT Presentation

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag Motesnitsalis Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5

Applying Behavioural Insights to Public Policy Simon Ruda Outline 1. What are behavioural

Welcome to our Traveller Insights Session 1 Travellers Insights and Trends 2 Tips and Tools to

Transforming Transit through Transforming Transit through Insights in Motion Insights in Motion

LCBO Customer Insights for Sale Customer Insights & CRM Pam Lawson May 27, 2016 Customer

Insights The key to Culture The Nordic Paradox- how equality drives inequality. Presentation

Melon Attitudinal Insights Consumer Research Outline Presented by Colmar Brunton (a Kantar

Regional Insights Paper Overview Presentation Purpose of the GB Regional Insights Paper

Insights and Trends in Allied Healthcare Insights and Trends in Allied Health Care Staffing

Insights from time use research and mixed data methods. A contribution to INSIGHTS: bringing

Insights into the evolution and spread of Insights into the evolution and spread of insecticide

Insights Winter 2014 Business Valuation, Forensic Analysis, and Financial Opinion Insights F OCUS

Insights from the FMA John Botica and Derek Grantham Insights from the FMA- whats

SNA 2B: ER graphs: Insights and realism Lada Adamic Insights Previously: degree

VP, Consumer Insights Dig ig In Insights METHODOLOGY SUMMARY Interviewing Survey offered

Lottery Industry Insights Survey 2019 Genera Networks recently sent out the Lottery Industry

They Dont Look They Dont Look Disabled to Me! Ethical Insights for Invisible

20,000 Upgrades Later Lessons From a Year of Managed Kubernetes Upgrades Adam Wolfe Gordon

CLIMATE JUSTICE: CAN WE AGREE TO DISAGREE? Operationalising competing equity principles to

Advances in Optoelectronic Technologies for ROADM Subsystem s Louay Eldada Chief Technology

C4ISR Architectures and Software Architectures Rich Hilliard rh@mitre.org IEEE Architecture

Power supply for the BELLE II PXD 5th International Workshop on DEPFET Detectors and

SUPPORT SERVICES BUILDING PENN STATE MILTON S. HERSHEY MEDICAL CENTER PENN STATE AE SENIOR THESIS

Waveform Generation Fundamental part of signal processing is the signal. Within the

BaBar Distributed Computing Stephen J. Gowdy SLAC Super B-Factory Workshop 22 nd April 2005 22

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - PowerPoint PPT Presentation

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag Motesnitsalis Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5

Applying Behavioural Insights to Public Policy Simon Ruda Outline 1. What are behavioural

Welcome to our Traveller Insights Session 1 Travellers Insights and Trends 2 Tips and Tools to

Transforming Transit through Transforming Transit through Insights in Motion Insights in Motion

LCBO Customer Insights for Sale Customer Insights &amp; CRM Pam Lawson May 27, 2016 Customer

Insights The key to Culture The Nordic Paradox- how equality drives inequality. Presentation

Melon Attitudinal Insights Consumer Research Outline Presented by Colmar Brunton (a Kantar

Regional Insights Paper Overview Presentation Purpose of the GB Regional Insights Paper

Insights and Trends in Allied Healthcare Insights and Trends in Allied Health Care Staffing

Insights from time use research and mixed data methods. A contribution to INSIGHTS: bringing

Insights into the evolution and spread of Insights into the evolution and spread of insecticide

Insights Winter 2014 Business Valuation, Forensic Analysis, and Financial Opinion Insights F OCUS

Insights from the FMA John Botica and Derek Grantham Insights from the FMA- whats

SNA 2B: ER graphs: Insights and realism Lada Adamic Insights Previously: degree

VP, Consumer Insights Dig ig In Insights METHODOLOGY SUMMARY Interviewing Survey offered

Lottery Industry Insights Survey 2019 Genera Networks recently sent out the Lottery Industry

They Dont Look They Dont Look Disabled to Me! Ethical Insights for Invisible

20,000 Upgrades Later Lessons From a Year of Managed Kubernetes Upgrades Adam Wolfe Gordon

CLIMATE JUSTICE: CAN WE AGREE TO DISAGREE? Operationalising competing equity principles to

Advances in Optoelectronic Technologies for ROADM Subsystem s Louay Eldada Chief Technology

C4ISR Architectures and Software Architectures Rich Hilliard rh@mitre.org IEEE Architecture

Power supply for the BELLE II PXD 5th International Workshop on DEPFET Detectors and

SUPPORT SERVICES BUILDING PENN STATE MILTON S. HERSHEY MEDICAL CENTER PENN STATE AE SENIOR THESIS

Waveform Generation Fundamental part of signal processing is the signal. Within the

BaBar Distributed Computing Stephen J. Gowdy SLAC Super B-Factory Workshop 22 nd April 2005 22

LCBO Customer Insights for Sale Customer Insights & CRM Pam Lawson May 27, 2016 Customer