Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - - PowerPoint PPT Presentation

performance and insights
SMART_READER_LITE
LIVE PREVIEW

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag - - PowerPoint PPT Presentation

Performance and Insights on File Formats 2.0 Luca Menichetti, Vag Motesnitsalis Design and Expectations 2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record) 5


slide-1
SLIDE 1

Performance and Insights

  • n File Formats – 2.0

Luca Menichetti, Vag Motesnitsalis

slide-2
SLIDE 2

Design and Expectations

 2 Use Cases:

 Exhaustive (operation using all values of a record)  Selective (operation using limited values of a record)

 5 Data Formats:

CSV, Parquet, serialized RDD objects, JSON, Apache Avro

 The tests gave insights on specific advantages and dis-

advantages for each format as well as their time and space performance.

2

slide-3
SLIDE 3

Experiment descriptions

 For the “exhaustive” use case (UC1) we used EOS logs

“processed” data.

 Current default data format is CSV.

 For the “selective” use case (UC2) we used experiment Job

Monitoring data from Dashboard.

 Current default data format is JSON.

 For each use case all formats were generated a priori (from

the default format) and then executed the tests.

 Technology: Spark (Scala) with SparkSQL library.  No test performed with compression.

3

slide-4
SLIDE 4

Formats

 CSV – text files, comma separated values, one per line  JSON – text files, JavaScript objects, one per line  Serialiazed RDD Objects (SRO) – Spark dataset serialized on

text files

 Avro – serialization format with binary encoding  Parquet – colunmar format with binary encoding

4

slide-5
SLIDE 5

Space Requirements (in GB)

12.7 23.4 19.9 11.7 6.3 6.2 13.4 8.1 6.6 2.1 5 10 15 20 25 CSV JSON SRO AVRO Parquet Size (GB) UC1 UC2

Job Monitoring default EOS default

5

slide-6
SLIDE 6

Spark executions

for i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit

  • -execution-number 2
  • -execution-cores 2
  • -execution-memory 2G
  • -class ch.cern.awg.Test$UC$format

formats-analyses.jar input-$UC-$format > output-$UC-$format-$i

We took the time from all (UC, format) jobs to calculate an average for each type

  • f execution (deleting outliers).

Times include reading and computation (test jobs don't write any file, they just print to stdout the result ).

6

slide-7
SLIDE 7

Times: UC1 "Exhaustive"

80.3 245.9 155.8 80.7 108.4 65.41 205 131.6 64.5 78.5 50 100 150 200 250 300 CSV JSON SRO AVRO Parquet Time (seconds) AVG MIN 12.7 23.4 19.9 11.7 6.3 GB

7

slide-8
SLIDE 8

Times: UC2 "Selective"

42 109.5 74.6 52.6 23.7 35.1 83.3 63.6 40.4 18.4 20 40 60 80 100 120 CSV JSON SRO AVRO Parquet Time (seconds) AVG MIN 6.2 13.4 8.1 6.6 2.1 GB

8

slide-9
SLIDE 9

Time Comparison between UC1 and UC2

50 100 150 200 250 300 CSV JSON SRO AVRO Parquet seconds AVG UC1 AVG UC2

9

slide-10
SLIDE 10

Space and Time Performance Gain/Loss

[compared to current default format]

CSV JSON SRO Avro Parquet Space UC1 [EOS logs] CSV = + 84 % + 56 %

  • 8 %
  • 51 %

Time performance UC1 = + 215 % + 93 % = + 35 % Space UC2 [Job Monitoring] JSON

  • 54 %

=

  • 40 %
  • 51 %
  • 84 %

Time performance UC2

  • 64 %

=

  • 35 %
  • 54 %
  • 79 %

10

slide-11
SLIDE 11

Pros and Cons

Pros Cons CSV Always supported and easy to use. Efficient. No schema change allowed. No type definitions. No declaration control. JSON Encoded in plain text (easy to use). Schema changes allowed.

  • Inefficient. High space consuming.

No declaration control. Serialized RDD Objects Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result. Spark only. No compression. Schema changes allowed but to be manually implemented. Avro Schema changes allowed. Efficiency comparable to CSV. Compression definition included in the schema. Space consuming like CSV (not really a negative). Needs a plugin (we found an incompatibility with

  • ur Spark version and avro library, we had to fix

and recompile it). Parquet Low space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases. Needs a plugin. Slow to be generated.

11

slide-12
SLIDE 12

Data Formats - Overview

CSV JSON SRO Avro Parquet Support Change of Schema NO YES YES YES YES Primitive/Complex Types

  • YES (but with

general numeric) YES YES YES Declaration control

  • NO

YES YES YES Support Compression YES YES NO YES YES Storage Consumption Medium High Medium/High Medium Low (RLE) Supported by which technologies? All All (to be parsed from text) Spark only All (needs plugin) All (needs plugin) Possilibity to print a snippet as sample YES YES NO YES (with avro tools) NO (yes with unofficial tools)

12

slide-13
SLIDE 13

Conclusions

There is no “ultimate” file format but…

 Avro shows promising results for exhaustive use cases, with

performances comparable to CSV.

 Parquet shows extremely good results for selective use cases and really

low space consuming.

 JSON is good to store directly (without any additional effort) data coming

from web-like services that might change their format in a future, but it is too inefficient and high space consuming.

 CSV is still quite efficient in time and space, but the schema is frozen and

leave the validation up to the user.

 Serialized Spark RDD is a good solution to store Scala objects that need

to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format.

13

slide-14
SLIDE 14

Thank You

14

slide-15
SLIDE 15

Spark UC1 executions

15

50 100 150 200 250 300 350 400 450 500

(EM,NE,EC): 2G 4 2 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 2 1 UC1

parquet sro avro json csv

slide-16
SLIDE 16

Spark UC2 executions

16

50 100 150 200 250

(EM,NE,EC): 2G 4 2 (EM,NE,EC): 2G 2 2 (EM,NE,EC): 2G 2 1 UC2

parquet sro avro json csv