scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of - - PowerPoint PPT Presentation

scikit learn to tmva xml converter tool
SMART_READER_LITE
LIVE PREVIEW

scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of - - PowerPoint PPT Presentation

scikit-learn to TMVA: XML converter tool Yuriy Ilchenko (U. of Texas), Nazim Huseynov (JINR) IML LHC Machine Learning WG Meeting Feb 03, 2015 History ttbar production with non-prompt leptons - major background for a few ttH channels


slide-1
SLIDE 1

scikit-learn to TMVA: XML converter tool

Yuriy Ilchenko (U. of Texas), Nazim Huseynov (JINR) IML LHC Machine Learning WG Meeting Feb 03, 2015

slide-2
SLIDE 2

History

  • ttbar production with non-prompt leptons - major background

for a few ttH channels

  • Idea is to use MVA - boosted decision tree (BDT) - to separate

prompt from non-prompt leptons

  • Employ TMVA from ROOT
  • List of input variables - object level only
  • pt, eta, sigd0PV, z0SinTheta, etcone20/pt, ptcone20/pt
  • Сompare BDT performance against the standard analysis cuts
  • ROC-curve (BDT) vs a point (cuts)

2

slide-3
SLIDE 3

Signal efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Background rejection

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MVA Method: BDT

Background rejection versus Signal efficiency Signal efficiency

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92

Background rejection

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

MVA Method: BDT

Background rejection versus Signal efficiency Signal efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Background rejection

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MVA Method: BDT

Background rejection versus Signal efficiency

TMVA - electrons

3

10% sample

zoom in

33% sample

zoom in

Signal efficiency

0.84 0.86 0.88 0.9 0.92

Background rejection

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

MVA Method: BDT

Background rejection versus Signal efficiency

cuts

slide-4
SLIDE 4

Signal efficiency

0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94

Background rejection

0.95 0.96 0.97 0.98 0.99 1

MVA Method: BDT

Background rejection versus Signal efficiency Signal efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Background rejection

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MVA Method: BDT

Background rejection versus Signal efficiency

TMVA - muons

4

10% sample

zoom in

33% sample

  • -- <ERROR> BDT : YOUR tree has
  • nly 1 Node... kind of a funny

*tree*. I cannot boost such a thing... if after 1 step the error rate is == 0.5

no results :( Decided to try an alternative MVA library cuts

slide-5
SLIDE 5

scikit-learn

  • “sklearn” - popular open-source library for data-

analysis written in python

  • Implements all major models - decision trees,

neural networks, etc

  • Supported by an international community of

developers

5

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

www.scikit-learn.org

slide-6
SLIDE 6

sklearn - electrons

6

10% sample

zoom in

33% sample

zoom in

cuts

slide-7
SLIDE 7

sklearn - muons

7

10% sample

zoom in

33% sample

zoom in

cuts

slide-8
SLIDE 8

sklearn to TMVA

8

  • Problem: No sklearn available in ATLAS software
  • Solution: convert a classifier trained with scikit-learn to

the xml format readable by TMVA Reader

  • Perk: apply BDT in ATLAS independently of scikit-learn

For Training For Testing

skTMVA converter

slide-9
SLIDE 9
  • skTMVA - sklearn to TMVA converter
  • part of koza4ok package: contained ROC-curve

calculation, some other tools

  • written in python
  • @GitHub - https://github.com/yuraic/koza4ok
  • What’s supported?
  • BDT binary classification
  • AdaBoost, Grad Boosting
  • xml format only

9

skTMVA converter

slide-10
SLIDE 10
  • Getting the converter

git clone https://github.com/yuraic/koza4ok.git

  • Setup the repository

source setup_koza4ok.sh

  • And in your python code

10

skTMVA in action

scikit-learn model TMVA input variables and their type (variable order matters!)

  • utput TMVA xml file
slide-11
SLIDE 11
  • In koza4ok/example folder
  • Training - no input data is required, data

is generated on fly

  • bdt_sklearn_to_tmva_AdaBoost.py
  • bdt_sklearn_to_tmva_Grad.py
  • Testing and Validation- draw ROC curve

by TMVA and scikit-learn and overlay

  • validate_sklearn_to_tmva.py

11

skTMVA in practice

Two files created when running examples

  • bdt_sklearn_to_tmva_example.pkl
  • stores scikit-learn model
  • bdt_sklearn_to_tmva_example.xml
  • converted TMVA xml file
slide-12
SLIDE 12

Summary

Summary

  • skTMVA - scikit-learn to TMVA converter
  • supports BDT binary classification - AdaBoost, Gradient Boosting
  • saves to xml file
  • comes with examples and validation code
  • web: https://github.com/yuraic/koza4ok

Plans

  • Convert scikit-learn model to a standalone C++ file
  • Contact us
  • Yuriy Ilchenko (core development) - ilchenko@physics.utexas.edu
  • Nazim Huseynov (validation, testing) - nguseynov@jinr.ru

12

slide-13
SLIDE 13

Backup

13

slide-14
SLIDE 14

Decision Tree in scikit-learn and TMVA

14

scikit-learn Decision Tree apply skTMVA converter

  • TMVA variable description in

back-up slides (or google)

  • sklearn tree structure is

http://scikit-learn.org/dev/ auto_examples/tree/ unveil_tree_structure.html

slide-15
SLIDE 15

TMVA minimal weights xml

15

Describe Variables Maps var to VarIndex Tree structure as a bunch of included nodes

Tree number Tree weight (AdaBoost)

Example: a single tree encoded in TMVA xml file <GeneralInfo> and <Options> - removed, don’t affect BDT score

slide-16
SLIDE 16

TMVA BDT xml parameters

  • Variables section
  • variable Min, Max values show no effect on output BDT score
  • BinaryTree section - node parameters
  • IVar=“0" - refers to a variable defined by VarIndex in the Variables section
  • pos=“s” - root node, pos=“l”- left, pos=“r” - right
  • Cut=“3.4095886230468750e+01" - node cut value
  • nType - node type; compared against NodePurityLimit which is set in configuration
  • TMVA BDT config parameters
  • nType=“-1" - terminal background node
  • nType=“1" - terminal signal node
  • nType=“0" - intermediate node
  • cType - cut type
  • cType=“0" - if node variable > cut value, then go left; otherwise - right
  • cType=“1" - if node variable > cut value, then go right; otherwise - left
  • purity - S/(S+B); S - number of signal events, B - number of background events
  • res=“…” and rms=“…” - regression predictions (used in Gradient Boosting)
  • NCoef=“0" - always zero, some Fisher coefficients, not sure what they are for

16