compressing coldbox data
play

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - PowerPoint PPT Presentation

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR: Lossless compression factor = 4 Implies reduction from 12bits/ADC readout to 3 bits per ADC readout In the rest of this talk, not


  1. Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida

  2. ProtoDUNE-SP TDR: • Lossless compression factor = 4 • Implies reduction from 12bits/ADC readout to 3 bits per ADC readout • In the rest of this talk, not discussing factors, only average bits / ADC readout • Hence, keep in mind: • “3 bits” = TDR spec • “4 bits” = compression factor 3 • “6 bits” = compression factor 2

  3. How well does a generic algorithm work? • ROOT’s native compression for 10 events, 1536 channels • 10k ADC readouts per channel per event, 2 bytes per ADC readout • Compressed: avg 5.73 bits per ADC readout [effective compression factor 2.1, half of the TDR spec]

  4. Using “gzip -9” explicitly • Store data for a single channel in a file, compress • Performance depends on how the bits are packed in the file • Convention in figures below: 12 bits = 3 nibbles: H,M,L

  5. What RMS will compress into 3 bits? • Consider “ideal” case for compression - uniform distribution of values • A uniform distribution across D consecutive discrete values has an # RMS of ! = √%& ; ( = ! 12 is the width of a flat distribution needed for a given ! • To encode D discrete values, one requires log2(D) bits: % • + ,-./ = log & ( = log & ! 12 = log & (!) + & log & 12 = log & (!) + 1.8 • In order to encode into 3 bits of data, the RMS of the distribution can’t be more than 2.3 ADC counts • Observed pedestal RMS’s are 6-8 ADC counts • Encoding raw values will not provide desired compression

  6. Information Theory limits on compression • For a stochastic noiseless source emitting a set of symbols with frequencies p_i, the number of bits per symbol is the (Shannon) entropy: • Shannon, Claude E. (July–October 1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (3): 379–423.

  7. Gaussian distributed discrete random values • Huffman compression achieves Shannon entropy level of performance • Need RMS of 2 bins to compress into 3 bits • RMS of 4 bins should compress into 4 bits • RMS’s of 6-8 bins should compress into 4.6-5.0 bits

  8. Variable Distributions, Run #1287 • Consider three variables as targets to encode using a compression algorithm X n -2X n-1 +X n-2 X n X n -X n-1 Difference wrt linear prediction Raw ADC Counts Difference wrt (based on previous two counts) previous count

  9. Variable Distribution RMS’s: Linear prediction Difference Raw ADC Counts

  10. Truncated Huffman compression • Raw ADC counts: tree encodes values seen in event • For target variables, expect most values are in the range [-16,16] • Huffman-encode only this window • RAW + target: have additional (13-14 bit) Huffman code for “value outside range”, followed by full 12-bit value • 25 bit penalty for data not under control • compression performance will be worse than Shannon entropy

  11. Performance on Run #1287 Encode Differences • Green = Shannon entropy • Blue = Channel+Event specific Huffman Trees Encode • Red = Use one Raw Values (random) Huffman Tree Encode wrt for all data Linear Distributions of avg bits per ADC word Prediction observed per channel, per event • Raw data requires lots of custom Huffman Trees • Encoding diff wrt linear prediction works best (avg less than 4 bits per ADC word)

  12. Performance Loss For Generic Trees Encode Encode wrt Differences Linear Prediction • For two target variables, lose fraction of a bit in performance • Linear predictor loss is better contained, i.e. performance more predictable

  13. Raw ADC Value Correlation Factors • Reproduced correlations observed by Tom in run 973 • Data in run 1287 appears to be much less correlated

  14. What’s different between the two runs? Run #973 Run #1287 Raw ADC Channel-Channel Correlation Factor Raw ADC Channel-Channel Correlation Factor • Run 1287 has no correlation factors greater than ~10% • Run 973 has a significant tail in the RMS distribution • Possibly due to slow noise in the electronics?

  15. Example: Anti-correlation from slow noise • Waveform for first event, channels 1199 vs 1216 • Causes significant increase in RMS, almost 100% uncorrelated

  16. Comparison of variable RMS’s per channel: • Run 973 overall behavior of target variables is “better” than 1287 • Expect run 973 to compress better than run 1287

  17. Compression performance on run 1287 vs 973 • Encoding Difference wrt previous ADC count

  18. Compression Performance, run 1287 vs 973, cont’d • Encoding difference wrt Linear Prediction

  19. Estimated Event Size • ProtoDUNE-SP TDR spec is to compress 230.4 MB of TDC data into 57.6 MB • Run compression test on 10 events, for both runs, record #bits used • Run 1287 conveniently reads out 1536 channels, 1/10 th of full protoDUNE-SP • Run 973 has 2304 channels reading out, scale numbers by 1536/2304 Run Number Difference, Difference, Linear Prediction, Linear Prediction, Size wrt TDR Spec Custom Trees Single Tree Custom trees Single Tree 1287 72.5 MB 73.4 MB 71.5 MB 72.2 MB +25% 0973 (scaled) 70.3 MB 71.1 MB 70.3 MB 70.4 MB +22% • 25% larger event size than required by TDR spec • ADC readout encoded on avg in 3.75 bits (TDR spec is 3) • Compression factor 3.20 (TDR spec is 4)

  20. Conclusions, so far • Evaluated compression performance on coldbox data • Found two good candidate variables for encoding • Evaluated encoding with “truncated” Huffman compression • Found approach to be generic and robust • ~1% penalty for sub-optimal encoding tree, even across events • Expect similar performance for hard-coded common tree for all channels, all events (simplifies firmware implementation) • No performance loss in presence of “slow” noise • Estimate compressed event size to be 25% larger than TDR spec • No significant channel noise cross-correlation observed (in run #1287) • Likely not much to gain from combining information across channels • Found promising correlations with ADC counts earlier in the stream (further reduce avg RMS by 10%, i.e. 5% better compression)

  21. Plans • Check cross-channel correlation between encoding variables • Re-check gzip performance on larger sample of events • Attempt to utilize information from earlier in the stream to further shrink target variable RMS • Choose single, hardcoded compression tree • Optimize decompression algorithm for speed, report performance • Study per-event compression performance on larger sample (e.g. entire run 1287) • Try ”gzip -9” on compressed output • Any other tests? • Report back with final findings, document

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend