data loading for ml applications using tdfs
play

Data-loading (for ML applications) using TDFs Stefan Wunsch - PowerPoint PPT Presentation

Data-loading (for ML applications) using TDFs Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22 1 Motivation Most of the data analysis of the high-level HEP analyses happens in the Python domain (frameworks of analysis groups on top of flat


  1. Data-loading (for ML applications) using TDFs Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22 1

  2. Motivation ◮ Most of the data analysis of the high-level HEP analyses happens in the Python domain (frameworks of analysis groups on top of flat ntuples). ◮ Even more extrem for ML applications: Most frameworks are only usable from Python (Keras, xgboost, most of TensorFlow, PyTorch, . . . ) ◮ How data-loading often looks like (for ML applications) in HEP: ... >>> x = root_pandas.read_root("file.root", "tree").as_matrix() >>> print(x.shape) (number_of_entries, number_of_branches) >>> model.fit(x, ...) ... ◮ Most efficient solution today: root_numpy (used by root_pandas ) ◮ But ROOT has the possibilities to do this more efficient. 2

  3. Random slide from a MVA-based analysis 3

  4. Feature request ◮ Support taking data from ROOT files and put it into memory (as fast as possible) ◮ Memory layout of the output: Contiguous, interpretable as n-dimensional arrays ◮ Make the data accessible from Python, interpretation of memory as numpy array Interface proposal using TDataFrame : >>> tdf = ROOT.Experimental.TDataFrame("tree", "file.root") >>> tdf = tdf.Filter("var1>0").Define("new_var", "var1*var2") >>> x = tdf.AsMatrix(["var1", "var2", "new_var"]) >>> print(x.shape) (number_of_entries, 3) 4

  5. Advantages compared to root_numpy approach ◮ Useful set of TDF features directly usable ◮ Efficient selection of data ( Filter ) ◮ Define new variables ( Define ) ◮ Other fancy operations ( ForEach ) ◮ . . . ◮ Size of input files not limited by memory ◮ Make use of implicit multi-threading → Gain of a factor of N in speedup (ideally) 5

  6. First benchmarks (1) Loading 709MB of data from disk to memory. Array of random floats with shape (50000000, 4) 12 Elapsed time in seconds 11 10 root_numpy TDataFrame 9 8 7 1 2 3 4 Number of threads Measured on a machine with (2) 4 (physical) logical cores. 6

  7. First benchmarks (2) Performance subject to input data size and number of threads 50 TDF with 1 thread TDF with 2 threads Elapsed time in seconds TDF with 3 threads 40 TDF with 4 threads root_numpy 30 20 10 0.7 1.4 2.1 2.8 Size of data in MB Measured on a machine with (2) 4 (physical) logical cores. 7

  8. First benchmarks (3) Loading 2.8GB of data from disk to memory. 60 Elapsed time in seconds 50 40 30 20 0 5 10 15 20 Number of threads Measured on a machine with (24) 48 (physical) logical cores. 8

  9. What is missing to do this properly? ◮ Proposal for a matching interface in C++ (Container for returned data?) ◮ Proper PyROOT handling of numpy arrays ◮ Input argument handling: Interpreted as float* , shape information is lost ◮ Return value handling: Not supported (?) 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend