Performance and Insights
- n File Formats – 2.0
Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - - PowerPoint PPT Presentation
Performance and Insights on File Formats 2.0 Luca Menichetti, Vag Motesnitsalis Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5
2
3
4
12.7 23.4 19.9 11.7 6.3 6.2 13.4 8.1 6.6 2.1 5 10 15 20 25 CSV JSON SRO AVRO Parquet Size (GB) UC1 UC2
Job Monitoring default EOS default
5
for i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit
formats-analyses.jar input-$UC-$format > output-$UC-$format-$i
6
80.3 245.9 155.8 80.7 108.4 65.41 205 131.6 64.5 78.5 50 100 150 200 250 300 CSV JSON SRO AVRO Parquet Time (seconds) AVG MIN 12.7 23.4 19.9 11.7 6.3 GB
7
42 109.5 74.6 52.6 23.7 35.1 83.3 63.6 40.4 18.4 20 40 60 80 100 120 CSV JSON SRO AVRO Parquet Time (seconds) AVG MIN 6.2 13.4 8.1 6.6 2.1 GB
8
50 100 150 200 250 300 CSV JSON SRO AVRO Parquet seconds AVG UC1 AVG UC2
9
[compared to current default format]
CSV JSON SRO Avro Parquet Space UC1 [EOS logs] CSV = + 84 % + 56 %
Time performance UC1 = + 215 % + 93 % = + 35 % Space UC2 [Job Monitoring] JSON
=
Time performance UC2
=
10
Pros Cons CSV Always supported and easy to use. Efficient. No schema change allowed. No type definitions. No declaration control. JSON Encoded in plain text (easy to use). Schema changes allowed.
No declaration control. Serialized RDD Objects Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result. Spark only. No compression. Schema changes allowed but to be manually implemented. Avro Schema changes allowed. Efficiency comparable to CSV. Compression definition included in the schema. Space consuming like CSV (not really a negative). Needs a plugin (we found an incompatibility with
and recompile it). Parquet Low space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases. Needs a plugin. Slow to be generated.
11
CSV JSON SRO Avro Parquet Support Change of Schema NO YES YES YES YES Primitive/Complex Types
general numeric) YES YES YES Declaration control
YES YES YES Support Compression YES YES NO YES YES Storage Consumption Medium High Medium/High Medium Low (RLE) Supported by which technologies? All All (to be parsed from text) Spark only All (needs plugin) All (needs plugin) Possilibity to print a snippet as sample YES YES NO YES (with avro tools) NO (yes with unofficial tools)
12
Avro shows promising results for exhaustive use cases, with
Parquet shows extremely good results for selective use cases and really
JSON is good to store directly (without any additional effort) data coming
CSV is still quite efficient in time and space, but the schema is frozen and
Serialized Spark RDD is a good solution to store Scala objects that need
13
14
15
50 100 150 200 250 300 350 400 450 500
(EM,NE,EC): 2G 4 2 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 2 1 UC1
parquet sro avro json csv
16
50 100 150 200 250
(EM,NE,EC): 2G 4 2 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 2 1 UC2
parquet sro avro json csv