Experience with RDataFrame Spotlight on interactive/exploratory use - - PowerPoint PPT Presentation

experience with rdataframe
SMART_READER_LITE
LIVE PREVIEW

Experience with RDataFrame Spotlight on interactive/exploratory use - - PowerPoint PPT Presentation

Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [ oliver.lantwin@cern.ch ] December 6, 2018 Intro Who am I? Physicist (PhD student) working on the SHiP experiment, and its software Future experiment,


slide-1
SLIDE 1

Experience with RDataFrame

Spotlight on interactive/exploratory use

Oliver Lantwin

[oliver.lantwin@cern.ch]

December 6, 2018

slide-2
SLIDE 2

Intro

slide-3
SLIDE 3

Who am I?

› Physicist (PhD student) working on the SHiP experiment, and its software › Future experiment, so work includes:

› Physics studies using toy simulation › Full simulation using Geant4 and co › TGeo › Reconstruction, Digitisation › Analysis › small dedicated software for test beams

→ Plenty of places to play around with new ROOT features without breaking anything › Personally, I have used both C++ and Python, as well as the scientifjc python packages extensively (including pandas)

Oliver Lantwin (Imperial College London) December 6, 2018 Intro

2

slide-4
SLIDE 4

Why am I giving this talk?

› Been playing around with RDataFrame a lot recently:

› Dedicated small physics studies › Procrastination

› In contact with Enrico, Danilo &co. with bug reports, suggestions &c.:

› RVec<T>::operator== etc. return RVec<int> instead of RVec<bool>? → Fixed › SaveGraph should return the dot spec by default → Fixed (demo to follow) › Automatic histogram naming (for the lazy physicist) › redefjnition of columns causes seg violation → Fixed (now useful error message)

Oliver Lantwin (Imperial College London) December 6, 2018 Intro

3

slide-5
SLIDE 5

Killer feature: Toy MC

slide-6
SLIDE 6

TGenPhaseSpace sans RDF

TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; TGenPhaseSpace event; double masses[] = {m_pi, m_mu, m_numu}; event.SetDecay(K_L, 3, masses); auto f = TFile::Open(”K_L_phasespace.root”, ”recreate”); double weight; TLorentzVector pi, l, nul; TTree tree(”phasespace”, ”K_L phasespace”); tree.Branch(”pi”, &pi); tree.Branch(”l”, &l); tree.Branch(”nul”, &nul); tree.Branch(”w”, &weight); for (auto && i: ROOT::MakeSeq(1000000)){ weight = event.Generate(); pi = *event.GetDecay(0); l = *event.GetDecay(1); nul = *event.GetDecay(2); tree.Fill(); } tree.Write(); f->Close(); Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC

4

slide-7
SLIDE 7

With RDF

TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; TGenPhaseSpace event; double masses[] = {m_pi, m_mu, m_numu}; event.SetDecay(K_L, 3, masses); ROOT::RDataFrame df(100000); auto phasespace = df .Define(”weight”, ”event.Generate()”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”l”, ”*event.GetDecay(1)”) .Define(”nu”, ”*event.GetDecay(2)”); phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”);

Pro: › Much shorter › more expressive › no TTree boilerplate

Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC

5

slide-8
SLIDE 8

With RDF

TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; TGenPhaseSpace event; double masses[] = {m_pi, m_mu, m_numu}; event.SetDecay(K_L, 3, masses); ROOT::RDataFrame df(100000); auto phasespace = df .Define(”weight”, ”event.Generate()”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”l”, ”*event.GetDecay(1)”) .Define(”nu”, ”*event.GetDecay(2)”); phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”);

Con: Reliant on side effects and implementation details: › Not obvious to me that Defines always happen in order › Not parallelisable (event object) problematic (would need explicit lambda for each Define with slot

Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC

6

slide-9
SLIDE 9

Maybe future RDF? (better suggestions welcome!)

TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; TGenPhaseSpace event; double masses[] = {m_pi, m_mu, m_numu}; event.SetDecay(K_L, 3, masses); ROOT::RDataFrame df(100000); auto genFunc = [event](){ auto weight = event.Generate(); auto pi = *event.GetDecay(0); auto l = *event.GetDecay(1); auto nu = *event.GetDecay(1); return make_tuple(weight, pi, l, nu) } auto phasespace = df .MultiDefine({”weight”, ”pi”, ”l”, ”nu”}, genFunc); phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”);

› (nearly) purely functional › easily parallelisable using array of TGenPhaseSpace and (Multi)DefineSlot. › similar to ROOT-9766

Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC

7

slide-10
SLIDE 10

Toy MC in RDF

› Trivial to adapt to most simple generation routines (inspired by Enrico’s Pythia8 example) › Can Snapshot and continue analysing right away (e.g. in notebook), no need to context switch and can easily change toy MC parameters etc. and rerun entire chain! › Code above is a real example recently used for an estimation

Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC

8

slide-11
SLIDE 11

Nasty nested loops

slide-12
SLIDE 12

SHiP simulation data model

TTree of TClonesArrays of C++ classes with scalar member variables › TClonesArray member are split, but types are not correctly identifjed → ROOT-9674

****************************************************************************** *Tree :cbmsim : /cbmroot * *Entries : 100 : Total = 4199856 bytes File Size = 1868961 * * : : Tree compression factor = 2.20 * ****************************************************************************** *Br 0 :MCTrack : Int_t cbmroot.Stack.MCTrack_ * *Entries : 100 : Total Size= 11377 bytes File Size = 463 * *Baskets : 1 : Basket Size= 32000 bytes Compression= 1.91 * *............................................................................* *Br 1 :MCTrack.fUniqueID : UInt_t fUniqueID[cbmroot.Stack.MCTrack_] * *Entries : 100 : Total Size= 27425 bytes File Size = 548 * *Baskets : 1 : Basket Size= 32000 bytes Compression= 48.79 * *............................................................................* ...

TBrowser

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

9

slide-13
SLIDE 13

Nasty Nested Loops

This data model is easy to work with using a looping approach: for event in tree: for p in particles: do_stuff() But pandas, uproot and RDataFrame struggle with it. Been brainstorming with Enrico, Jim etc. on how this could be improved in general (see also https://github.com/bluehood/NNLOops, https://github.com/scikit-hep/awkward-array)

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

10

slide-14
SLIDE 14

Example: PyROOT

h = TH2D(”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3) for track in tracks: if abs(track.GetPdgCode() == 13): pt = hypot(track.GetPx(), track.GetPy()) p = hypot(track.GetPz(), pt) h.Fill(p, pT) Pro: Easy to understand Con: Pure python loops

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

11

slide-15
SLIDE 15

TTree::Draw()?

tree->Draw(”sqrt(” ”MCTrack.GetPx()*MCTrack.GetPx()” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”)” ”:sqrt(” ”MCTrack.GetPx()*MCTrack.GetPx()” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”+MCTrack.GetPz()*MCTrack.GetPz())” ”>>h(100, 0, 100, 100, 0, 3)”, ”abs(MCTrack.GetPdgCode())==13”, ”colz”) Pro: Faster, still fairly obvious to ROOT users what happens Con: Not true C++, very fragile, doesn’t work for more complicated examples

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

12

slide-16
SLIDE 16

Contortions with RDF

auto h = df .Define(”Tracks”, clones_converter<ShipMCTrack>, {”MCTrack”}) .Define(”Muons”, [](const RVec<ShipMCTrack> & ts){ return Filter(ts, [](const ShipMCTrack& t){ return abs(t.GetPdgCode())==13; }); }, {”Tracks”}) .Define(”Muons_pt”, [](const RVec<ShipMCTrack> &muons){ return Map(muons, [](const ShipMCTrack& muon){ return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()); }); },{”Muons”}) .Define(”Muons_P”, [](const RVec<ShipMCTrack> &muons){ return Map(muons, [](const ShipMCTrack& muon){ return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()+ muon.GetPz()*muon.GetPz()); }); },{”Muons”}) .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3},”Muons_P”,”Muons_pt”);

Pro: Fast, Safe Con: A lot of repetition, boiler plate; colleagues less happy with C++ will complain

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

13

slide-17
SLIDE 17

Why this way?

› Operating on RVec<ShipMCTrack> allows us to use Filters, Maps and track[isMuon] niceties, but causes trouble when we want to use getter methods etc. › Alternative approach:

› Use split feature to get scalar properties as RVecs directly › Use RVec columnar calculations

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

14

slide-18
SLIDE 18

Second try

Caveat: Does not yet work auto h = df.Define(”isMuon”, ”abs(MCTrack.fPdgCode)==13”) .Define(”pt”, ”sqrt(MCTrack.fPx*MCTrack.fPx+MCTrack.fPy+MCTrack.fPy)”) .Define(”P”, ”sqrt(pt*pt+MCTrack.fPz+MCTrack.fPz)”) .Define(”muon_P”, ”P[isMuon]”) .Define(”muon_pt”, ”pt[isMuon]”) .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3}, ”muon_P”,”muon_pt”); Pro: Much less verbose Con: Would have to read split TClonesArray branches into RVec, ignore class member visibility (ROOT seems to do this in many cases already)

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

15

slide-19
SLIDE 19

Nasty nested loops: Open questions

› Filtering on e.g. particles? Many cases where we don’t want to discard the entire event › Exploding collections, i.e. redefjning what an event is (could help with above)

› I frequently need to think both about two defjnitions:

› 1 proton on target or › 1 muon reaching SHiP

› Passing methods into containers very complicated (limit of my template-fu)

Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops

16

slide-20
SLIDE 20

Misc thoughts and conclusion

slide-21
SLIDE 21

Defjne and Hist

See myself using patterns like: df.Define(”muP”, ”P[isMuon]”).Histo1D(”muP”) a lot (note: P is using nice RVec fjltering syntax). Compare to pandas: df[df.isMuon].P.hist() # Can you see how this is actually not the same? TTree::Draw: tree->Draw(”P”, ”isMuon”) I understand the concerns about parsing Histo?D arguments, but I’m not satisfjed with the current solution either.

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

17

slide-22
SLIDE 22

Defjne and Hist

See myself using patterns like: df.Define(”muP”, ”P[isMuon]”).Histo1D(”muP”) a lot (note: P is using nice RVec fjltering syntax). Compare to pandas: df[df.isMuon].P.hist() # Can you see how this is actually not the same? TTree::Draw: tree->Draw(”P”, ”isMuon”) I understand the concerns about parsing Histo?D arguments, but I’m not satisfjed with the current solution either.

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

17

slide-23
SLIDE 23

Defjne and Hist

See myself using patterns like: df.Define(”muP”, ”P[isMuon]”).Histo1D(”muP”) a lot (note: P is using nice RVec fjltering syntax). Compare to pandas: df[df.isMuon].P.hist() # Can you see how this is actually not the same? TTree::Draw: tree->Draw(”P”, ”isMuon”) I understand the concerns about parsing Histo?D arguments, but I’m not satisfjed with the current solution either.

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

17

slide-24
SLIDE 24

SaveGraph

Setup Result

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

18

slide-25
SLIDE 25

Notebook quality of life improvements I

Half of exploratory notebooks look like this for me: TCanvas c13; h13->Draw(); c13.Draw(); › Canvas paradigm powerful for complex plots etc., but 99+% of the time when exploring data this power is not needed. › Wish: %jsroot inline? %%jsroot canvas? Altair+RGraphics?

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

19

slide-26
SLIDE 26

Notebook quality of life improvements II

› I’ve used root --notebook once, to prepare this presentation, usually use Jupyter Lab (with python+JupyROOT and the ROOT C++ kernel) › A lot of users who want to use ROOT in notebooks will probably already have Jupyter set up › Keep fjnding myself having to re-symlink the ROOT kernel (not entirely ROOT’s fault) → Install kernels system wide?

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

20

slide-27
SLIDE 27

Performance

› Fast enough out of the box for most users (much faster than python) › IMT and ideas about PROOF successor exciting › Being able to handle larger than memory data without changing a line of code (c.f. hackish pandas solutions) vital! Good performance → quick feedback/experimentation loop! Very important for interactive use! Laziness helps (but needs to be obvious when things actually run, so that the user is not surprised). › Wild idea: Progress indicator for event loop?

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

21

slide-28
SLIDE 28

PR

› Most people I talk to at CERN haven’t heard of RDF/RVec &co. › Most people I tell about it react very positively and are open to trying it › The strong points of RDF aren’t HEP specifjc! › pandas isn’t perfect either!

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

22

slide-29
SLIDE 29

Conclusion

› It’s fun to play with RDF and push it’s envelope when the team is so supportive and open to suggestions! › It might take some template magic and complicated lambdas, but I have yet to fjnd something I could not get RDF to do (mostly thanks to the full access to the ROOT C++ objects A lot of criticism in this talk, but only because RDF is already pleasant enough that I end up using it a lot and would like it to be even better!

Oliver Lantwin (Imperial College London) December 6, 2018 Misc thoughts and conclusion

23