tttt BDT
Nick Amin September 29, 2018
tttt BDT Nick Amin September 29, 2018 Overview Last time, showed - - PowerPoint PPT Presentation
tttt BDT Nick Amin September 29, 2018 Overview Last time, showed cut-based analysis with latest data and lumi of (35.87+41.53+35.53=) 112.9fb -1 getting around 2.84 expected significance Repeat with updated BDT (previously, had
Nick Amin September 29, 2018
⚫ Last time, showed cut-based analysis with latest data and
lumi of (35.87+41.53+35.53=)112.9fb-1 getting around 2.84𝜏 expected significance
⚫ Repeat with updated BDT (previously, had 19-variable
TMVA BDT trained with 2016 samples)
⚫ Explore xgboost instead of TMVA, and in any case, retrain
TMVA BDT with 2016+2017 samples for more statistics
scheme/formula (rather than trying random partitions and picking the best one)
2
⚫ 19 variables on the right extracted from 2016+2017 MC
analysis
⚫ All numbers in these slides should be consistent, and associated with a luminosity
to project to 132fb-1
⚫ I checked that the discriminator shape for signal is essentially the same for OS and
SS events, so include signal OS events to double the statistics
⚫ Retrain TMVA BDT with below configuration (found from hyperparameter scan last
time)
3
feature_names = [ "nbtags", "njets", "met", "ptl2", "nlb40", "ntb40", "nleps", "htb", "q1", "ptj1", "ptj6", "ptj7", "ml1j1", "dphil1l2", "maxmjoverpt", "ptl1", "detal1l2", "ptj8", "ptl3", ] method = factory.BookMethod(loader, ROOT.TMVA.Types.kBDT, "BDT", ":".join([ "!H", "!V", "NTrees=500", "nEventsMin=150", "MaxDepth=5", "BoostType=AdaBoost", "AdaBoostBeta=0.25", "SeparationType=GiniIndex", "nCuts=20", "PruneMethod=NoPruning", ]))
⚫ Preprocessing
stability
weights of 1, throw away a small (sub%) fraction of events that have large relative weights, from x+gamma mainly
⚫ Tried to use BayesianOptimization package to get optimal
hyperparameters
regions for which "information gained" is maximized
trees, and the subsampling fraction right, the rest don’t matter/matter very little
⚫ Also naively tried Condor (pick random points and submit ~4-5k
trainings)
⚫ To avoid picking an overtrained hyperparameter set, rather than
pick exactly the best point, I used representative values for the parameters on the right (definitions documented here) and made numbers more round
⚫ Key points here
actually affect the shape of the discriminator output
4
num_trees = 500 param['objective'] = 'binary:logistic' param['eta'] = 0.07 param['max_depth'] = 5 param['silent'] = 1 param['nthread'] = 15 param['eval_metric'] = "auc" param['subsample'] = 0.6 param['alpha'] = 8.0 param['gamma'] = 2.0 param['lambda'] = 1.0 param['min_child_weight'] = 1.0 param['colsample_bytree'] = 1.0
bkg in train/test sets
no overtraining observed
⚫ Top right shows AUC of xgboost is ~1.2% higher thanTMVA
⚫ Bottom right shows maximal s/sqrt(s+b) (single cut) is1.83 for xgboost, but 1.75 for TMVA (5% higher for xgboost)
5 xgboost
⚫ Ran HiggsCombine 10-50k times, using a simplified card/
nuisance structure
"Rares" as shown in the plot on the right
significance
processes + 1 signal process + 0 nuisances
1 signal process + (Nbins * (5+1)) uncorrelated nuisances representing MC statistical uncertainty in each bin
avoid low MC statistic bins/fluctuations, though the difference in the two values is only a few percent because this analysis is statistically limited
⚫ I’m showing s/sqrt(s+b) as the metric for each bin in the
ratio panels, but I found that for low number of bins (e.g., 2-3), it is not indicative of the expected significance from
approximation usually agrees with combine within ~2% (again, for 2-3 bins, so not useful in the right plot)
6 TMVA output mapped from [-1,1] to [0,1] σ = 2(s + b)ln(1 + s/b) − 2s
Note, these discriminator plots require the actual baseline selection (HT>300, MET>50, Nb/Njets≥2, lepton pT>25,20)
where to start binning on the left
⚫ Afterwards, create 20 equal-width bins for TMVA and xgboost and calculate the expected significance without MCstat and with MC stat
(xgboost)
7 xgboost stretched TMVA from [0.15,1] to [0,1] 2.63477, 2.59117 2.60103, 2.44803
number of bins (between 10-20) and random binning
cumulative sum and squeeze to [0,1] to obtain a "random binning"
metric to avoid weird-looking distributions (e.g., right)
⚫ Left plot shows significance (no MC stat) vs significance — on average, "sig no stat" is~1.8% higher than "sig stat"
⚫ Middle plot has 1D distributions of "sig stat" for xgboost and TMVAs/sqrt(s+b) than any other bin, so it dominates the result and is clearly correlated with the output of combine
8
⚫ Plot expected significance for TMVA (left) and xgboost (right)
— the legend is more useful than the histograms though
⚫ For each bin count, display the mean significance and also
the mean of the highest 10% of significances
highest
9
⚫ Now plot the difference between expected significance without MC
statistics nuisances and with, as a function of the number of bins
⚫ For TMVA, the difference decreases a little bit going from 10 to 19
bins
⚫ For xgboost, the difference increases going from 10 to 19 ⚫ I would expect less bins to mean a smaller effect of MC statistics,
along the lines of what xgboost shows
10
⚫ From an earlier slide, signal is compressed at disc=1 for xgboost. Naively try to reshape it to
look like TMVA by matching the relative signal counts in each bin
⚫ Take the equally-spaced bins in the xgboost discriminator (x-axis) and make them match TMVA
(y-axis) — bins more finely where signal is bunched up
⚫ Green dots are calculated by matching integrals, blue is a linear interpolation that we can apply ⚫ Two approaches
⚫ Note that orange is very sigmoid-like…
11
⚫ Both approaches are equivalent, so take the first one (remap individual discriminator values) for now
⚫ The updated distribution on the left looks like TMVA’s now with an expected significance (including
stats) of 2.77𝜏 — ~7% higher than TMVA from slide 7.
⚫ The right plot is remade from slide 5 with an added green curve for the reshaped xgboost values —
the shape is the same between green and blue now, with green having an obvious higher peak
12 2.80443,2.76839 xgboost, 20 bins
⚫ The TMVA manual (https://arxiv.org/pdf/physics/0703039.pdf) has a note in section
8.2.2 about transforming the output of a likelihood estimator, where signal and background can become squished at 0 and 1, making it hard to bin
⚫ It uses an inverse sigmoid (close to what we saw earlier) to un-squish the edges ⚫ This is not done by default for AdaBoost in TMVA as far as I can tell, so the
difference in distributions I see could be a product of the different loss functions/ training procedure (backup slide gives evidence for this)
13
⚫ Run combine some more with the exact xgboost —>
TMVA mapping
⚫ Flippable with slide 8 ⚫ Clear improvement in xgboost over TMVA on average in
the bottom middle plot (distribution of expected significances with MC stat)
14
⚫ Plot expected significance for TMVA (left) and
xgboost (right)
⚫ The significance values are fairly stable across
a number of bins ranging from 10 to 24
⚫ Summary plot of significance vs nbins on right
not much of an increase for TMVA after ~13 bins, and ~18 bins for xgboost
15
⚫ Now plot the difference between expected significance
without MC statistics nuisances and with, as a function of the number of bins
⚫ Now, for both TMVA and xgboost, the difference
decreases a little bit going from 10 to 24 bins
16
⚫ Currently, the reshaping requires the TMVA discriminant shape as an input, which is not ideal if we just
want a simple formula/prescription for making xgboost output give good results
⚫ Parameterize the mapping between equally-spaced bin thresholds (x-axis) and good bin thresholds for
xgboost (x-axis) as a sigmoid
correction term to guarantee that it does
extremes of the xgboost distribution. When k=0, 𝜏(x)=x and we don’t modify the bin thresholds at all.
⚫ Now we want to find the best k for a given number of bins
17 σ(x) = 1 1 + e
−k(x − 1
2)
+ 2x − 1 1 + ek/2
⚫ Run combine scan with "sigmoid-binning" for xgboost
discriminator parameterized by number of bins and k
⚫ Higher values of k lead to higher expected significance,
but this isn’t necessary (left)
⚫ Don’t need many bins to get high expected significance
(if k is chosen appropriately)
18
2.78758,2.73789
⚫ Plot the median value of k for a given number of bins, and fit a line to the best
performers (𝜏>2.70) to get the table on the right
⚫ For higher nbins, lower k is favored, which means we don’t split up the signal
have thought
⚫ Pick a conservative number of bins (13) to get k=8.3, compute the bin edges,
and we get a significance of 2.74 in the bottom right, only ~1% lower than the 20 bin case from slide 12
19
nbins k 10 9.0 11 8.8 12 8.5 13 8.3 14 8.1 15 7.9 16 7.7 17 7.4 18 7.2 19 7.0 20 6.8 21 6.5 22 6.3 23 6.1 24 5.9
⚫ Now the BDT workflow is
simple — no templates, classes — though making the function in the first place required lots of string parsing because of the xgboost version from CMSSW
⚫ Simple function to get sigmoid binning after modifying two empirically measured
parameters
⚫ Caveat for now: have to do bins[1] = 0.5*(bins[1]+bins[2]) afterwards since sometimes
the first bin has low background statistics, so we move it closer to the second bin
⚫ This is an artifact of the sigmoid scaling being symmetric. We probably don’t want to
chop up the lower discriminant values too much.
20
def get_sigmoid_binning(nbins): x = np.linspace(0.,1.,nbins+1) k = 11.2-0.22*nbins # rederive after re-training bins = 1.0/(1+np.exp(-k*(x-0.5)))+2.0*(x-0.5)/(1+np.exp(k/2)) return bins
final significances
⚫ Example 13-bin discriminator (+1 bin for CRZ = 14 bins total) onright
⚫ Repeat for 10-20 bins (excluding CRZ) and plot significances for2016, 2017, 2018 vs nbins (left) and combined (right)
⚫ There’s a lot of variance from point to point, and the combinedsignificance gain is <2% for BDT wrt cut-based
⚫ Though, at first glance, nuisances look fine in cardsconclusions from before, but 2% vs 20% is a big difference… investigating this.
21
⚫ With simple BDT optimization, XGboost shows a few %
improvement in expected significance over TMVA
⚫ BDT in general showed ~20% gain in significance over cut-
based
⚫ We have a more sensible procedure for choosing the binning
(no more random cut values)
⚫ Ultimately, BDT-based significances aren’t significantly better
than cut-based — this is something I’m looking into
22
23
⚫ Optimal stretching function doesn’t need to be a sigmoid
⚫ Tried y(x) below, which is a linear combination of a function that passes
through 0,0 and 1,1 and a regular sigmoid
⚫ More complicated than just a sigmoid and slightly deviates from 0,0 and
1,1 (need to be fixed by hand afterwards)
24 y2(x) = 1 1 + e−k2(x− 1
2 )
y1(x) = 1 2 1 + k1(2x − 1) k1 − 2x − 1 + 1 y(x) = f1y1(x) + (1 − f1)y2(x) k1 ∈ [−1.7, − 1), k2 ∈ [7,14], f1 ∈ [−0.5,0.5]
⚫ Try the simple combine prescription (5+1 processes) with the cut-based SRs
⚫ There are 17 bins here (CRW + 16 SR bins); no CRZ here. ⚫ BDT gets ~2.73, which is ~20% better than cut-based
25
⚫ Try adaboost and gradient boosting from scikit-learn package with
default parameters (toning down number of trees to save time)
⚫ The shape difference is a consequence of the loss function/
algorithm, since I see the same in TMVA vs xgboost
⚫ Even though the AUC is the same in both cases below, a maximum
likelihood fit will prefer a signal shape like the left
26 Adaboost Gradient boost