experience with rdataframe
play

Experience with RDataFrame Spotlight on interactive/exploratory use - PowerPoint PPT Presentation

Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [ oliver.lantwin@cern.ch ] December 6, 2018 Intro Who am I? Physicist (PhD student) working on the SHiP experiment, and its software Future experiment,


  1. Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [ oliver.lantwin@cern.ch ] December 6, 2018

  2. Intro

  3. Who am I? › Physicist (PhD student) working on the SHiP experiment, and its software › Future experiment, so work includes: › Physics studies using toy simulation › Full simulation using Geant4 and co › TGeo › Reconstruction, Digitisation › Analysis › small dedicated software for test beams → Plenty of places to play around with new ROOT features without breaking anything › Personally, I have used both C++ and Python , as well as the scientifjc python packages extensively (including pandas ) Oliver Lantwin (Imperial College London) December 6, 2018 Intro 2

  4. Why am I giving this talk? › Been playing around with RDataFrame a lot recently: › Dedicated small physics studies › Procrastination › In contact with Enrico, Danilo &co. with bug reports, suggestions &c.: › RVec<T>::operator== etc. return RVec<int> instead of RVec<bool> ? → Fixed › SaveGraph should return the dot spec by default → Fixed (demo to follow) › Automatic histogram naming (for the lazy physicist) › redefjnition of columns causes seg violation → Fixed (now useful error message) Oliver Lantwin (Imperial College London) December 6, 2018 Intro 3

  5. Killer feature: Toy MC

  6. TGenPhaseSpace sans RDF weight = event.Generate(); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) f->Close(); tree.Write(); } tree.Fill(); nul = *event.GetDecay(2); l = *event.GetDecay(1); pi = *event.GetDecay(0); for ( auto && i: ROOT::MakeSeq(1000000)){ TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; tree.Branch(”w”, &weight); tree.Branch(”nul”, &nul); tree.Branch(”l”, &l); tree.Branch(”pi”, &pi); TTree tree(”phasespace”, ”K_L phasespace”); TLorentzVector pi, l, nul; double weight; auto f = TFile::Open(”K_L_phasespace.root”, ”recreate”); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 4 event.SetDecay(K_L, 3, masses);

  7. With RDF phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) › no TTree boilerplate › more expressive › Much shorter Pro: .Define(”nu”, ”*event.GetDecay(2)”); TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; .Define(”l”, ”*event.GetDecay(1)”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”weight”, ”event.Generate()”) auto phasespace = df ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 5 event.SetDecay(K_L, 3, masses);

  8. With RDF phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) with slot › Not parallelisable ( event object) problematic (would need explicit lambda for each Define › Not obvious to me that Define s always happen in order Reliant on side effects and implementation details: Con: .Define(”nu”, ”*event.GetDecay(2)”); TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; .Define(”l”, ”*event.GetDecay(1)”) .Define(”pi”, ”*event.GetDecay(0)”) .Define(”weight”, ”event.Generate()”) auto phasespace = df ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 6 event.SetDecay(K_L, 3, masses);

  9. Maybe future RDF ? (better suggestions welcome!) } Killer feature: Toy MC December 6, 2018 Oliver Lantwin (Imperial College London) › similar to ROOT-9766 › easily parallelisable using array of TGenPhaseSpace and (Multi)DefineSlot . › (nearly) purely functional phasespace.Snapshot(”phasespace”, ”K_L_phasespace.root”); .MultiDefine({”weight”, ”pi”, ”l”, ”nu”}, genFunc); auto phasespace = df return make_tuple(weight, pi, l, nu) TLorentzVector K_L{P_K*TMath::Cos(angle), 0, P_K*TMath::Sin(angle), TMath::Sqrt(P_K*P_K+m_K*m_K)}; auto nu = *event.GetDecay(1); auto l = *event.GetDecay(1); auto pi = *event.GetDecay(0); auto weight = event.Generate(); auto genFunc = [event](){ ROOT::RDataFrame df(100000); double masses[] = {m_pi, m_mu, m_numu}; TGenPhaseSpace event; 7 event.SetDecay(K_L, 3, masses);

  10. Toy MC in RDF › Trivial to adapt to most simple generation routines (inspired by Enrico’s Pythia8 example) › Can Snapshot and continue analysing right away (e.g. in notebook), no need to context switch and can easily change toy MC parameters etc. and rerun entire chain! › Code above is a real example recently used for an estimation Oliver Lantwin (Imperial College London) December 6, 2018 Killer feature: Toy MC 8

  11. Nasty nested loops

  12. SHiP simulation data model File Size 32000 bytes Compression= 1.91 * *............................................................................* *Br 1 :MCTrack.fUniqueID : UInt_t fUniqueID[cbmroot.Stack.MCTrack_] * *Entries : 100 : Total Size= 27425 bytes = *Baskets : 548 * *Baskets : 1 : Basket Size= 32000 bytes Compression= 48.79 * *............................................................................* ... TBrowser Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 1 : Basket Size= 463 * TTree of TClonesArray s of C++ classes with scalar member variables * ROOT-9674 ****************************************************************************** *Tree :cbmsim : /cbmroot * *Entries : 100 : Total = 4199856 bytes File Size = 1868961 * : = : Tree compression factor = 2.20 * ****************************************************************************** *Br 0 :MCTrack : Int_t cbmroot.Stack.MCTrack_ * *Entries : 100 : Total Size= 11377 bytes File Size 9 › TClonesArray member are split, but types are not correctly identifjed →

  13. Nasty Nested Loops This data model is easy to work with using a looping approach: for event in tree: for p in particles: do_stuff() But pandas , uproot and RDataFrame struggle with it. Been brainstorming with Enrico, Jim etc. on how this could be improved in general (see also https://github.com/bluehood/NNLOops, https://github.com/scikit-hep/awkward-array) Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 10

  14. Example: PyROOT h = TH2D(”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3) for track in tracks: if abs(track.GetPdgCode() == 13): pt = hypot(track.GetPx(), track.GetPy()) h.Fill(p, pT) Pro: Easy to understand Con: Pure python loops Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 11 p = hypot(track.GetPz(), pt)

  15. TTree::Draw() ? ”>>h(100, 0, 100, 100, 0, 3)”, Nasty nested loops December 6, 2018 Oliver Lantwin (Imperial College London) Con: Not true C++ , very fragile, doesn’t work for more complicated examples Pro: Faster, still fairly obvious to ROOT users what happens ”abs(MCTrack.GetPdgCode())==13”, ”colz”) ”+MCTrack.GetPz()*MCTrack.GetPz())” tree->Draw(”sqrt(” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”MCTrack.GetPx()*MCTrack.GetPx()” ”:sqrt(” ”)” ”+MCTrack.GetPy()*MCTrack.GetPy()” ”MCTrack.GetPx()*MCTrack.GetPx()” 12

  16. Contortions with RDF return Map(muons, [](const ShipMCTrack& muon){ Nasty nested loops December 6, 2018 Oliver Lantwin (Imperial College London) Con: A lot of repetition, boiler plate; colleagues less happy with C++ will complain Pro: Fast , Safe .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3},”Muons_P”,”Muons_pt”); },{”Muons”}) return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()+ muon.GetPz()*muon.GetPz()); }); .Define(”Muons_P”, [](const RVec<ShipMCTrack> &muons){ auto h = df },{”Muons”}) return sqrt( muon.GetPx()*muon.GetPx()+ muon.GetPy()*muon.GetPy()); }); return Map(muons, [](const ShipMCTrack& muon){ .Define(”Muons_pt”, [](const RVec<ShipMCTrack> &muons){ }, {”Tracks”}) return abs(t.GetPdgCode())==13; }); return Filter(ts, [](const ShipMCTrack& t){ .Define(”Muons”, [](const RVec<ShipMCTrack> & ts){ .Define(”Tracks”, clones_converter<ShipMCTrack>, {”MCTrack”}) 13

  17. Why this way? › Operating on RVec<ShipMCTrack> allows us to use Filter s, Map s and track[isMuon] niceties, but causes trouble when we want to use getter methods etc. › Alternative approach: › Use split feature to get scalar properties as RVec s directly › Use RVec columnar calculations Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 14

  18. Second try Caveat: Does not yet work auto h = df.Define(”isMuon”, ”abs(MCTrack.fPdgCode)==13”) .Define(”pt”, ”sqrt(MCTrack.fPx*MCTrack.fPx+MCTrack.fPy+MCTrack.fPy)”) .Define(”P”, ”sqrt(pt*pt+MCTrack.fPz+MCTrack.fPz)”) .Define(”muon_P”, ”P[isMuon]”) .Define(”muon_pt”, ”pt[isMuon]”) .Histo2D({”h”, ”;P[GeV/c];p_{T}[GeV/c]”, 100, 0, 100, 100, 0, 3}, ”muon_P”,”muon_pt”); Pro: Much less verbose Con: Would have to read split TClonesArray branches into RVec , ignore class member visibility ( ROOT seems to do this in many cases already) Oliver Lantwin (Imperial College London) December 6, 2018 Nasty nested loops 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend