lz4 bulkio and offset removal performance
play

LZ4, BulkIO, and offset removal performance Jim Pivarski Princeton - PowerPoint PPT Presentation

LZ4, BulkIO, and offset removal performance Jim Pivarski Princeton University DIANA October 11, 2017 1 / 15 Motivation for this study Three updates to ROOT I/O are aimed at speeding up or reducing file size for end-user analysis: new


  1. LZ4, BulkIO, and offset removal performance Jim Pivarski Princeton University – DIANA October 11, 2017 1 / 15

  2. Motivation for this study Three updates to ROOT I/O are aimed at speeding up or reducing file size for end-user analysis: ◮ new compression algorithm: LZ4 (speed) ◮ reading TBasket data directly into arrays: BulkIO (speed) ◮ removing offset data from TBranches that have a counter (size) 2 / 15

  3. Motivation for this study Three updates to ROOT I/O are aimed at speeding up or reducing file size for end-user analysis: ◮ new compression algorithm: LZ4 (speed) ◮ reading TBasket data directly into arrays: BulkIO (speed) ◮ removing offset data from TBranches that have a counter (size) Focus on CMS NanoAOD in particular because ◮ it is aimed at end-users (1–2 kB/event) ◮ it is broadly intended for 30–50% of analyses (not an individual user’s ntuple) 2 / 15

  4. Motivation for this study Three updates to ROOT I/O are aimed at speeding up or reducing file size for end-user analysis: ◮ new compression algorithm: LZ4 (speed) ◮ reading TBasket data directly into arrays: BulkIO (speed) ◮ removing offset data from TBranches that have a counter (size) Focus on CMS NanoAOD in particular because ◮ it is aimed at end-users (1–2 kB/event) ◮ it is broadly intended for 30–50% of analyses (not an individual user’s ntuple) Also including studies of LHCb (thanks, Oksana!). No ATLAS files because I can’t generate new ones or TTree::CopyTree old ones. 2 / 15

  5. Parameters of the NanoAOD studies ◮ AWS instance with a fast SSD disk (i2.xlarge). ◮ No resource contention because I paid for exclusive access. ◮ “Writing” means a TTree::CopyTree with new TFile compression. ◮ “Reading” means filling a class made by MakeClass. ◮ “BulkIO” means filling arrays through GetEntriesSerialized . ◮ Always reading from warmed cache. ◮ Five repeated trials; standard deviations are small compared to trends. 3 / 15

  6. LZ4 doesn’t compress as well as ZLIB, LZMA 4 / 15

  7. . . . same for LHCb 5 / 15

  8. But it’s faster: levels 1–3 are as fast as writing uncompressed 6 / 15

  9. . . . same for LHCb 7 / 15

  10. More importantly: reading is as fast as uncompressed 8 / 15

  11. And BulkIO reading is super-fast: serious penalty for LZMA 9 / 15

  12. Speed vs. size trade-offs write speed vs size read speed vs size BulkIO speed vs size 10 / 15

  13. Removing unnecessary offsets TBranches for variable-sized data contain offsets indicating where each entry starts. ◮ This is unnecessary for branches with counters (e.g. "Muon.pt[nMuons]/F" ). ◮ A fix is in progress (PR #1003) to optionally not write these offsets. ◮ May also write counts, instead of offsets, since repeated values might be more compressible. My study pre-dated (inspired) this PR; I constructed a copy of NanoAOD without offsets by putting all muon data into a flat TTree, all jet data into a flat TTree, etc. 11 / 15

  14. After compression, this saves 8–18% 12 / 15

  15. And it closes the LZ4/LZMA gap to a factor of 1.5 × 13 / 15

  16. And it closes the LZ4/LZMA gap to a factor of 1.5 × 13 / 15

  17. Do offsets vs. counts matter? Yes for LZ4. Synthetic test: I generated Poisson-random counts and integrated them to make offsets, then ZLIB and LZ4 compressed them. 14 / 15

  18. Conclusions LZ4 is as fast as uncompressed data for traditional GetEntry jobs. BulkIO is an order of magnitude faster than GetEntry , especially with LZ4. Unnecessary offsets add ∼ 10% to file size; may be removed. Counts compress better than offsets, especially for LZ4. 15 / 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend