Data-loading (for ML applications) using TDFs Stefan Wunsch - PowerPoint PPT Presentation

Jul 24, 2023 •117 likes •218 views

Data-loading (for ML applications) using TDFs Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22 1 Motivation Most of the data analysis of the high-level HEP analyses happens in the Python domain (frameworks of analysis groups on top of flat

Data-loading (for ML applications) using TDFs Stefan Wunsch stefan.wunsch@cern.ch 2018-02-22 1
Motivation ◮ Most of the data analysis of the high-level HEP analyses happens in the Python domain (frameworks of analysis groups on top of flat ntuples). ◮ Even more extrem for ML applications: Most frameworks are only usable from Python (Keras, xgboost, most of TensorFlow, PyTorch, . . . ) ◮ How data-loading often looks like (for ML applications) in HEP: ... >>> x = root_pandas.read_root("file.root", "tree").as_matrix() >>> print(x.shape) (number_of_entries, number_of_branches) >>> model.fit(x, ...) ... ◮ Most efficient solution today: root_numpy (used by root_pandas ) ◮ But ROOT has the possibilities to do this more efficient. 2
Random slide from a MVA-based analysis 3
Feature request ◮ Support taking data from ROOT files and put it into memory (as fast as possible) ◮ Memory layout of the output: Contiguous, interpretable as n-dimensional arrays ◮ Make the data accessible from Python, interpretation of memory as numpy array Interface proposal using TDataFrame : >>> tdf = ROOT.Experimental.TDataFrame("tree", "file.root") >>> tdf = tdf.Filter("var1>0").Define("new_var", "var1*var2") >>> x = tdf.AsMatrix(["var1", "var2", "new_var"]) >>> print(x.shape) (number_of_entries, 3) 4
Advantages compared to root_numpy approach ◮ Useful set of TDF features directly usable ◮ Efficient selection of data ( Filter ) ◮ Define new variables ( Define ) ◮ Other fancy operations ( ForEach ) ◮ . . . ◮ Size of input files not limited by memory ◮ Make use of implicit multi-threading → Gain of a factor of N in speedup (ideally) 5
First benchmarks (1) Loading 709MB of data from disk to memory. Array of random floats with shape (50000000, 4) 12 Elapsed time in seconds 11 10 root_numpy TDataFrame 9 8 7 1 2 3 4 Number of threads Measured on a machine with (2) 4 (physical) logical cores. 6
First benchmarks (2) Performance subject to input data size and number of threads 50 TDF with 1 thread TDF with 2 threads Elapsed time in seconds TDF with 3 threads 40 TDF with 4 threads root_numpy 30 20 10 0.7 1.4 2.1 2.8 Size of data in MB Measured on a machine with (2) 4 (physical) logical cores. 7
First benchmarks (3) Loading 2.8GB of data from disk to memory. 60 Elapsed time in seconds 50 40 30 20 0 5 10 15 20 Number of threads Measured on a machine with (24) 48 (physical) logical cores. 8
What is missing to do this properly? ◮ Proposal for a matching interface in C++ (Container for returned data?) ◮ Proper PyROOT handling of numpy arrays ◮ Input argument handling: Interpreted as float* , shape information is lost ◮ Return value handling: Not supported (?) 9

Recommend

Loading and Manipulating Data Thomas J. Leeper Department of Political Science and Government

Loading Data Basic Data Summaries Data Manipulation Loading and Manipulating Data Thomas J. Leeper Department of Political Science and Government Aarhus University November 14, 2013 Loading Data Basic Data Summaries Data Manipulation

580 views • 28 slides

Pentalift Pentalift Equipment Equipment Corporation Corporation Loading Dock Loading Dock

Pentalift Pentalift Equipment Equipment Corporation Corporation Loading Dock Loading Dock Design Design Presentation Presentation History Established in 1983 by Arne Pedersen. Pentalift consists of two Divisions, Loading Dock Division

692 views • 52 slides

PRACTICAL OFF-LOADING & WOUND STRESS FORCE COUNTERING METHODS Presentation to Peter

PRACTICAL OFF-LOADING & WOUND STRESS FORCE COUNTERING METHODS Presentation to Peter Stavropoulos, DPM PRACTICAL OFF-LOADING & WOUND STRESS FORCE COUNTERING METHODS Padding and Taping is really - Practical Off-Loading & Wound

267 views • 11 slides

Web Conferencing Loading Content Table of Contents Web Conferencing Loading Presentations

Web-Conferencing\Media Support: 505.277.0857 Toll Free: 1.877.688.8817 Email: media@u nm.edu Web Conferencing Loading Content Table of Contents Web Conferencing Loading Presentations and Image Files

252 views • 12 slides

LOADING & HANDLING OF ROLLED CELLULOSE HOW DOES IT DIFFER FROM LOADING & HANDLING OF

LOADING & HANDLING OF ROLLED CELLULOSE HOW DOES IT DIFFER FROM LOADING & HANDLING OF PAPER ROLLS? JULY 10, 2019 BY SAM GAYLE TECHNICAL SERVICES MANAGER DAMAGE PREVENTION Rolled Cellulose from Coosa Pines, AL. Resolute Forest

448 views • 19 slides

Real Time Loading for Sybase IQ Sybase IQ: Target Markets in 2009 Real-Time Loading Valuable to

Real Time Loading for Sybase IQ Sybase IQ: Target Markets in 2009 Real-Time Loading Valuable to All Report Servers Play Horizontal market focused on enterprise or departmental reporting application users: Ad-hoc and canned queries

472 views • 8 slides

A Review of Nitrogen Loading and A Review of Nitrogen Loading and Treatment Performance

A Review of Nitrogen Loading and A Review of Nitrogen Loading and Treatment Performance Treatment Performance Recommendations for OWTS in the Recommendations for OWTS in the Wekiva Study Area Study Area Wekiva FDOH Technical Review &

750 views • 72 slides

LOADING & SECURING DIFFERENT GRADES OF PAPER KRAFT PAPER KRAFT PAPER LOADING CONSIDERATIONS

LOADING & SECURING DIFFERENT GRADES OF PAPER KRAFT PAPER KRAFT PAPER LOADING CONSIDERATIONS Load Securement Planning and Variability Assessment Car Inspection Choosing and Executing a Load Plan Doorway Securement &

506 views • 12 slides

The Loading Spinner AKA, the throbber Why do we have loading spinners? Purpose: tells the

The Loading Spinner AKA, the throbber Why do we have loading spinners? Purpose: tells the user that the computer is performing an action in the background ex: - downloading content - conducting intensive calculations -

654 views • 11 slides

Fatigue Overview Andrew Ning There are four scenarios we have discussed for analyzing fatigue:

Fatigue Overview Andrew Ning There are four scenarios we have discussed for analyzing fatigue: 1. Fully reversed simple loading (i.e., mean zero) 2. Fluctuating simple loading 3. Combined simple loading 4. Complex loading The first three are

313 views • 4 slides

System Loading System Loading Tributary Areas Many floor systems consist of a reinforced

System Loading System Loading Tributary Areas Many floor systems consist of a reinforced concrete slab sup- ported on a rectangular grid of beams. Such a grid of beams reduces the span of the slab and thus permits the designer to reduce

293 views • 16 slides

2 In-plane loading 2 In-plane loading membrane elements membrane elements 2.4

2 In-plane loading 2 In-plane loading membrane elements membrane elements 2.4 Equilibrium and yield conditions 2.4 Equilibrium and yield conditions 17.10.2020 17.10.2020 ETH Zurich | Chair of Concrete Structures and Bridge Design |

1.22k views • 29 slides

DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis* , Ioannis

DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis* , Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki Data loading: A necessary evil Volume => Expensive Top query performance 40 zettabytes by

613 views • 35 slides

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow multiple logical applications to be composed together into a single application from the users perspective. - Engine RFC Lazy Loading Engines.

1.44k views • 78 slides

Practical Bioinformatics Mark Voorhies 4/6/2017 Mark Voorhies Practical Bioinformatics Loading

Practical Bioinformatics Mark Voorhies 4/6/2017 Mark Voorhies Practical Bioinformatics Loading and re-loading your functions # Use import the f i r s t time you load a module # (And keep using import u n t i l i t loads # s u c c

775 views • 32 slides

Presentation is loading The Latency Problem 100 ms Average : 9,3s Loading -1% Revenue The

The Cache Sketch: Revisiting Expiration-based Caching in the Age of Cloud Data Management Felix Gessert, Michael Schaarschmidt, Wolfram Wingerath, Steffen Friedrich, Norbert Ritter gessert@informatik.uni-hamburg.de Presentation is loading

1.13k views • 43 slides

Using Cafana to do DUNE-prism fit - Weve been using Cafana for fits so far to demonstrate the

Using Cafana to do DUNE-prism fit - Weve been using Cafana for fits so far to demonstrate the power of DUNEprism. - Moving forward, we will need to do off-axis fit, which means including large number of off-axis samples with many corrections

283 views • 7 slides

Combined Loading 1 Lecture 6 ME EN 372 Andrew Ning aning@byu.edu Outline Bar with Combined

Combined Loading 1 Lecture 6 ME EN 372 Andrew Ning aning@byu.edu Outline Bar with Combined Loads Bar with Combined Loads Find stress state for element A and B y = 20 mm 550 N A 30 Nm 30 N m 8,000 N T B x z 100 mm A

170 views • 3 slides

Welcome! Please Sit with Someone Surprising 1

Welcome! Please Sit with Someone Surprising 1 St. Marks Discernment and Search Commi?ee January 4, 2015, Town Hall MeeIng Town Hall

472 views • 21 slides

!"#$%"&'()+(,-.%( /-0"1(23$%3+( 4&"03+."#5()(,-67+"1'3(

!"#$%"&'()*+(,-.%( /-0"1(23$%3+( 4&"03+."#5(*)(,-67+"1'3( !-#+*&8(!+-9".4&"$*( !+"&$":-;8(,-67+"1'3(<&*=;31'3(>+-&.)3+( ?(/-0"1(23$%3+(@-+$%(ABCD(

567 views • 21 slides

EuCARD Magnet R & D prepared by Attilio Milanese CERN LARP Collaboration Meeting 15 1-3

EuCARD Magnet R & D prepared by Attilio Milanese CERN LARP Collaboration Meeting 15 1-3 Nov. 2010 The contribution of our American and European colleagues is acknowledged. EuCARD WP7 High Field Magnets European Coordination for

296 views • 17 slides

branch (branch_name, branch_city, assets) customer (customer_name, customer_street,

branch (branch_name, branch_city, assets) customer (customer_name, customer_street, customer_city) account (account_number, branch_name, balance) loan (loan_number, branch_name, amount) depositor (customer_name, account_number)

338 views • 12 slides

CDBG - Economic Development Revolving Loan Fund Updates Sar Sara Bu a Buschman an, Di ,

Wisconsin Department of Administration Division of Energy, Housing and Community Resources CDBG - Economic Development Revolving Loan Fund Updates Sar Sara Bu a Buschman an, Di , Division A Administrat ator or David Pawlisch, Bureau

376 views • 18 slides

Widening Participations Winners & Losers Will Cooling University of Leicester, Senior

The Autumn Statement Widening Participations Winners & Losers Will Cooling University of Leicester, Senior WP Policy and Partnerships Officer www.le.ac.uk wic1@le.ac.uk Whats In A Name . The Big Winner Mature Students

251 views • 11 slides