Data-loading (for ML applications) using TDFs Stefan Wunsch - - PowerPoint PPT Presentation

data loading for ml applications using tdfs
SMART_READER_LITE
LIVE PREVIEW

Data-loading (for ML applications) using TDFs Stefan Wunsch - - PowerPoint PPT Presentation

Data-loading (for ML applications) using TDFs Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22 1 Motivation Most of the data analysis of the high-level HEP analyses happens in the Python domain (frameworks of analysis groups on top of flat


slide-1
SLIDE 1

Data-loading (for ML applications) using TDFs

Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22

1

slide-2
SLIDE 2

Motivation

◮ Most of the data analysis of the high-level HEP analyses

happens in the Python domain (frameworks of analysis groups

  • n top of flat ntuples).

◮ Even more extrem for ML applications: Most frameworks are

  • nly usable from Python (Keras, xgboost, most of TensorFlow,

PyTorch, . . . )

◮ How data-loading often looks like (for ML applications) in HEP:

... >>> x = root_pandas.read_root("file.root", "tree").as_matrix() >>> print(x.shape) (number_of_entries, number_of_branches) >>> model.fit(x, ...) ...

◮ Most efficient solution today: root_numpy (used by

root_pandas)

◮ But ROOT has the possibilities to do this more efficient.

2

slide-3
SLIDE 3

Random slide from a MVA-based analysis

3

slide-4
SLIDE 4

Feature request

◮ Support taking data from ROOT files and put it into memory

(as fast as possible)

◮ Memory layout of the output: Contiguous, interpretable as

n-dimensional arrays

◮ Make the data accessible from Python, interpretation of

memory as numpy array

Interface proposal using TDataFrame: >>> tdf = ROOT.Experimental.TDataFrame("tree", "file.root") >>> tdf = tdf.Filter("var1>0").Define("new_var", "var1*var2") >>> x = tdf.AsMatrix(["var1", "var2", "new_var"]) >>> print(x.shape) (number_of_entries, 3)

4

slide-5
SLIDE 5

Advantages compared to root_numpy approach

◮ Useful set of TDF features directly usable

◮ Efficient selection of data (Filter) ◮ Define new variables (Define) ◮ Other fancy operations (ForEach) ◮ . . .

◮ Size of input files not limited by memory ◮ Make use of implicit multi-threading

→ Gain of a factor of N in speedup (ideally)

5

slide-6
SLIDE 6

First benchmarks (1)

1 2 3 4 Number of threads 7 8 9 10 11 12 Elapsed time in seconds

Loading 709MB of data from disk to memory. Array of random floats with shape (50000000, 4)

root_numpy TDataFrame

Measured on a machine with (2) 4 (physical) logical cores.

6

slide-7
SLIDE 7

First benchmarks (2)

0.7 1.4 2.1 2.8 Size of data in MB 10 20 30 40 50 Elapsed time in seconds

Performance subject to input data size and number of threads

TDF with 1 thread TDF with 2 threads TDF with 3 threads TDF with 4 threads root_numpy

Measured on a machine with (2) 4 (physical) logical cores.

7

slide-8
SLIDE 8

First benchmarks (3)

5 10 15 20 Number of threads 20 30 40 50 60 Elapsed time in seconds

Loading 2.8GB of data from disk to memory.

Measured on a machine with (24) 48 (physical) logical cores.

8

slide-9
SLIDE 9

What is missing to do this properly?

◮ Proposal for a matching interface in C++ (Container for

returned data?)

◮ Proper PyROOT handling of numpy arrays

◮ Input argument handling: Interpreted as float*, shape

information is lost

◮ Return value handling: Not supported (?) 9