tttt BDT Nick Amin September 29, 2018 Overview Last time, showed - - PowerPoint PPT Presentation

tttt bdt
SMART_READER_LITE
LIVE PREVIEW

tttt BDT Nick Amin September 29, 2018 Overview Last time, showed - - PowerPoint PPT Presentation

tttt BDT Nick Amin September 29, 2018 Overview Last time, showed cut-based analysis with latest data and lumi of (35.87+41.53+35.53=) 112.9fb -1 getting around 2.84 expected significance Repeat with updated BDT (previously, had


slide-1
SLIDE 1

tttt BDT

Nick Amin September 29, 2018

slide-2
SLIDE 2

⚫ Last time, showed cut-based analysis with latest data and

lumi of (35.87+41.53+35.53=)112.9fb-1 getting around 2.84𝜏 expected significance

⚫ Repeat with updated BDT (previously, had 19-variable

TMVA BDT trained with 2016 samples)

⚫ Explore xgboost instead of TMVA, and in any case, retrain

TMVA BDT with 2016+2017 samples for more statistics

  • One intermediate goal is to come up with a sane binning

scheme/formula (rather than trying random partitions and picking the best one)

Overview

2

slide-3
SLIDE 3

⚫ 19 variables on the right extracted from 2016+2017 MC

  • Looser baseline for more stats: Njets≥2, Nb≥1, HT≥250, MET≥30, lepton pT≥15
  • No CRZ included, because we will separate that into its own bin for the full

analysis

⚫ All numbers in these slides should be consistent, and associated with a luminosity

  • f 35.9+41.5=77.4fb-1 — multiply significances by 1.2 to project to 112.9fb-1, or 1.3

to project to 132fb-1

⚫ I checked that the discriminator shape for signal is essentially the same for OS and

SS events, so include signal OS events to double the statistics

  • ~400k unweighted signal and background events in total

⚫ Retrain TMVA BDT with below configuration (found from hyperparameter scan last

time)

  • Key points — 500 trees with a depth of 5 using the AdaBoost algorithm

Input details, TMVA

3

feature_names = [ "nbtags", "njets", "met", "ptl2", "nlb40", "ntb40", "nleps", "htb", "q1", "ptj1", "ptj6", "ptj7", "ml1j1", "dphil1l2", "maxmjoverpt", "ptl1", "detal1l2", "ptj8", "ptl3", ] method = factory.BookMethod(loader, ROOT.TMVA.Types.kBDT, "BDT", ":".join([ "!H", "!V", "NTrees=500", "nEventsMin=150", "MaxDepth=5", "BoostType=AdaBoost", "AdaBoostBeta=0.25", "SeparationType=GiniIndex", "nCuts=20", "PruneMethod=NoPruning", ]))

slide-4
SLIDE 4

⚫ Preprocessing

  • Use absolute value of weights in training for reasons of

stability

  • When re-weighting signal and background to have average

weights of 1, throw away a small (sub%) fraction of events that have large relative weights, from x+gamma mainly

⚫ Tried to use BayesianOptimization package to get optimal

hyperparameters

  • This attempts to iteratively find the best point by exploring

regions for which "information gained" is maximized

  • Turns out once you get the learning rate (eta), the number of

trees, and the subsampling fraction right, the rest don’t matter/matter very little

⚫ Also naively tried Condor (pick random points and submit ~4-5k

trainings)

  • Same story here

⚫ To avoid picking an overtrained hyperparameter set, rather than

pick exactly the best point, I used representative values for the parameters on the right (definitions documented here) and made numbers more round

⚫ Key points here

  • 500 trees, depth of 5 — same as TMVA
  • Gradient boosting algorithm instead of AdaBoost — this can

actually affect the shape of the discriminator output

xgboost

4

num_trees = 500 param['objective'] = 'binary:logistic' param['eta'] = 0.07 param['max_depth'] = 5 param['silent'] = 1 param['nthread'] = 15 param['eval_metric'] = "auc" param['subsample'] = 0.6 param['alpha'] = 8.0 param['gamma'] = 2.0 param['lambda'] = 1.0 param['min_child_weight'] = 1.0 param['colsample_bytree'] = 1.0

slide-5
SLIDE 5 ⚫ Bottom left plot shows discriminator shapes for signal/

bkg in train/test sets

  • Kolmogorov-Smirnov test shows good consistency —

no overtraining observed

⚫ Top right shows AUC of xgboost is ~1.2% higher than

TMVA

⚫ Bottom right shows maximal s/sqrt(s+b) (single cut) is

1.83 for xgboost, but 1.75 for TMVA (5% higher for xgboost)

  • The shape is qualitatively different however

Training results

5 xgboost

slide-6
SLIDE 6

⚫ Ran HiggsCombine 10-50k times, using a simplified card/

nuisance structure

  • Group fakes/flips into Others, rares/ttxx/tttx/xg into

"Rares" as shown in the plot on the right

  • Then compute two versions of the expected

significance

  • significance without MC stats: 5 background

processes + 1 signal process + 0 nuisances

  • significance with MC stats: 5 background processes +

1 signal process + (Nbins * (5+1)) uncorrelated nuisances representing MC statistical uncertainty in each bin

  • Use the latter for optimization/ranking to hopefully

avoid low MC statistic bins/fluctuations, though the difference in the two values is only a few percent because this analysis is statistically limited

⚫ I’m showing s/sqrt(s+b) as the metric for each bin in the

ratio panels, but I found that for low number of bins (e.g., 2-3), it is not indicative of the expected significance from

  • combine. However, the below higher-order likelihood-

approximation usually agrees with combine within ~2% (again, for 2-3 bins, so not useful in the right plot)

Significance metrics

6 TMVA output mapped from [-1,1] to [0,1] σ = 2(s + b)ln(1 + s/b) − 2s

Note, these discriminator plots require the actual baseline selection (HT>300, MET>50, Nb/Njets≥2, lepton pT>25,20)

slide-7
SLIDE 7 ⚫ Here we can see the shape difference between TMVA and xgboost, though both get very similar AUC and s/sqrt(s+b) ⚫ Note that I scaled the TMVA plot from the previous slide from [0.15,1] to [0,1] to avoid empty bins because the TMVA
  • utput doesn’t cover the full [-1,1] range initially
  • This is one source of slight ambiguity for binning since you can’t just equally partition [-1,1] — you have to decide

where to start binning on the left

⚫ Afterwards, create 20 equal-width bins for TMVA and xgboost and calculate the expected significance without MC

stat and with MC stat

  • TMVA is ~6% higher than xgboost even though s,b and AUC metrics indicate xgboost should be winning…
  • Presumably, combine likes several moderately high s/sqrt(s+b) bins (TMVA) rather than one really high one

(xgboost)

  • AUC doesn’t care about the squished signal on the right, but a fit probably does
⚫ As a quick comparison (in backup), I ran this procedure on the cut-based SR binning (18 bins) and get ~2.25𝜏
  • Exp. 𝜏 (out-of-the-box)

7 xgboost stretched TMVA from [0.15,1] to [0,1] 2.63477, 2.59117 2.60103, 2.44803

slide-8
SLIDE 8 ⚫ Run combine a few thousand times for TMVA and xgboost discriminators with a random

number of bins (between 10-20) and random binning

  • Get a set of flat- or gaussian-distributed random numbers (50-50 chance) and take

cumulative sum and squeeze to [0,1] to obtain a "random binning"

  • Reject binning scheme if there is an empty bin (or one with <0.05 s+b events)
  • Additionally, compute s/sqrt(s+b) and make sure >~80% of bins are increasing in this

metric to avoid weird-looking distributions (e.g., right)

⚫ Left plot shows significance (no MC stat) vs significance — on average, "sig no stat" is

~1.8% higher than "sig stat"

⚫ Middle plot has 1D distributions of "sig stat" for xgboost and TMVA
  • The difference here is quite striking. TMVA is better than xgboost and fairly stable
⚫ Right plot shows maximum s/sqrt(s+b) across all bins against the significance
  • Narrow orange line at the top left contains cases where the last xgboost bin has a higher

s/sqrt(s+b) than any other bin, so it dominates the result and is clearly correlated with the output of combine

  • TMVA has a lower maximum than xgboost on average, but obtains a better significance
  • This is along the lines of the suspicion on the previous slide about squishing the signal

Run combine a lot

8

slide-9
SLIDE 9

⚫ Plot expected significance for TMVA (left) and xgboost (right)

— the legend is more useful than the histograms though

⚫ For each bin count, display the mean significance and also

the mean of the highest 10% of significances

  • TMVA only has a ~1% gain going from lowest bin count to

highest

  • xgboost has a 5-8% gain

Dependence on bin count

9

slide-10
SLIDE 10

⚫ Now plot the difference between expected significance without MC

statistics nuisances and with, as a function of the number of bins

⚫ For TMVA, the difference decreases a little bit going from 10 to 19

bins

⚫ For xgboost, the difference increases going from 10 to 19 ⚫ I would expect less bins to mean a smaller effect of MC statistics,

along the lines of what xgboost shows

Effect of MC stats

10

slide-11
SLIDE 11

⚫ From an earlier slide, signal is compressed at disc=1 for xgboost. Naively try to reshape it to

look like TMVA by matching the relative signal counts in each bin

⚫ Take the equally-spaced bins in the xgboost discriminator (x-axis) and make them match TMVA

(y-axis) — bins more finely where signal is bunched up

⚫ Green dots are calculated by matching integrals, blue is a linear interpolation that we can apply ⚫ Two approaches

  • Convert the xgboost discriminator value on an event-by-event basis (blue)
  • Re-space the bins (orange, which is the inverse of blue)

⚫ Note that orange is very sigmoid-like…

Reshaping xgboost output

11

slide-12
SLIDE 12

⚫ Both approaches are equivalent, so take the first one (remap individual discriminator values) for now

  • Explicitly, take each xgboost discriminant and apply the blue function from the previous slide

⚫ The updated distribution on the left looks like TMVA’s now with an expected significance (including

stats) of 2.77𝜏 — ~7% higher than TMVA from slide 7.

⚫ The right plot is remade from slide 5 with an added green curve for the reshaped xgboost values —

the shape is the same between green and blue now, with green having an obvious higher peak

Reshaping xgboost ouptut

12 2.80443,2.76839 xgboost, 20 bins

slide-13
SLIDE 13

⚫ The TMVA manual (https://arxiv.org/pdf/physics/0703039.pdf) has a note in section

8.2.2 about transforming the output of a likelihood estimator, where signal and background can become squished at 0 and 1, making it hard to bin

  • Is this what is meant by "is inconvenient for use in subsequent analysis steps"?

⚫ It uses an inverse sigmoid (close to what we saw earlier) to un-squish the edges ⚫ This is not done by default for AdaBoost in TMVA as far as I can tell, so the

difference in distributions I see could be a product of the different loss functions/ training procedure (backup slide gives evidence for this)

Quick note on TMVA transformation

13

slide-14
SLIDE 14

⚫ Run combine some more with the exact xgboost —>

TMVA mapping

⚫ Flippable with slide 8 ⚫ Clear improvement in xgboost over TMVA on average in

the bottom middle plot (distribution of expected significances with MC stat)

Run combine some more

14

slide-15
SLIDE 15

⚫ Plot expected significance for TMVA (left) and

xgboost (right)

⚫ The significance values are fairly stable across

a number of bins ranging from 10 to 24

⚫ Summary plot of significance vs nbins on right

  • Comparing green (xgb) and blue (TMVA),

not much of an increase for TMVA after ~13 bins, and ~18 bins for xgboost

Dependence on bin count

15

slide-16
SLIDE 16

⚫ Now plot the difference between expected significance

without MC statistics nuisances and with, as a function of the number of bins

⚫ Now, for both TMVA and xgboost, the difference

decreases a little bit going from 10 to 24 bins

  • I’m not sure why it doesn’t increase

Effect of MC stats

16

slide-17
SLIDE 17

⚫ Currently, the reshaping requires the TMVA discriminant shape as an input, which is not ideal if we just

want a simple formula/prescription for making xgboost output give good results

⚫ Parameterize the mapping between equally-spaced bin thresholds (x-axis) and good bin thresholds for

xgboost (x-axis) as a sigmoid

  • Actually, not quite a sigmoid, since a sigmoid doesn’t pass through (0,0) and (1,1), so we add a linear

correction term to guarantee that it does

  • Sharpness controlled by parameter k. When k is large (red), we bin more finely in the left and right

extremes of the xgboost distribution. When k=0, 𝜏(x)=x and we don’t modify the bin thresholds at all.

⚫ Now we want to find the best k for a given number of bins

Revisit the reshaping

17 σ(x) = 1 1 + e

−k(x − 1

2)

+ 2x − 1 1 + ek/2

slide-18
SLIDE 18

⚫ Run combine scan with "sigmoid-binning" for xgboost

discriminator parameterized by number of bins and k

⚫ Higher values of k lead to higher expected significance,

but this isn’t necessary (left)

⚫ Don’t need many bins to get high expected significance

(if k is chosen appropriately)

Run combine some more

18

slide-19
SLIDE 19

2.78758,2.73789

⚫ Plot the median value of k for a given number of bins, and fit a line to the best

performers (𝜏>2.70) to get the table on the right

⚫ For higher nbins, lower k is favored, which means we don’t split up the signal

  • n the right as much —> not affected by MC stat issues as much as we would

have thought

⚫ Pick a conservative number of bins (13) to get k=8.3, compute the bin edges,

and we get a significance of 2.74 in the bottom right, only ~1% lower than the 20 bin case from slide 12

Formula for k

19

nbins k 10 9.0 11 8.8 12 8.5 13 8.3 14 8.1 15 7.9 16 7.7 17 7.4 18 7.2 19 7.0 20 6.8 21 6.5 22 6.3 23 6.1 24 5.9

slide-20
SLIDE 20

⚫ Now the BDT workflow is

  • Train xgboost
  • Run simple limits scanning over (k,nbins) in order to get "sigmoid bin thresholds"
  • Find optimal k given nbins
  • Run full analysis scanning over nbins of, say, 12…18 and pick the best one
  • Prediction is ~4-5x faster than TMVA because the discriminator function is stupidly

simple — no templates, classes — though making the function in the first place required lots of string parsing because of the xgboost version from CMSSW

⚫ Simple function to get sigmoid binning after modifying two empirically measured

parameters

⚫ Caveat for now: have to do bins[1] = 0.5*(bins[1]+bins[2]) afterwards since sometimes

the first bin has low background statistics, so we move it closer to the second bin

⚫ This is an artifact of the sigmoid scaling being symmetric. We probably don’t want to

chop up the lower discriminant values too much.

Analysis workflow

20

def get_sigmoid_binning(nbins): x = np.linspace(0.,1.,nbins+1) k = 11.2-0.22*nbins # rederive after re-training bins = 1.0/(1+np.exp(-k*(x-0.5)))+2.0*(x-0.5)/(1+np.exp(k/2)) return bins

slide-21
SLIDE 21 ⚫ Run the full looper with all systematics/nominal selections to get

final significances

⚫ Example 13-bin discriminator (+1 bin for CRZ = 14 bins total) on

right

⚫ Repeat for 10-20 bins (excluding CRZ) and plot significances for

2016, 2017, 2018 vs nbins (left) and combined (right)

⚫ There’s a lot of variance from point to point, and the combined

significance gain is <2% for BDT wrt cut-based

⚫ Though, at first glance, nuisances look fine in cards
  • The nuisances themselves could severely weaken the simpler

conclusions from before, but 2% vs 20% is a big difference… investigating this.

Full limits vs nbins

21

slide-22
SLIDE 22

⚫ With simple BDT optimization, XGboost shows a few %

improvement in expected significance over TMVA

⚫ BDT in general showed ~20% gain in significance over cut-

based

⚫ We have a more sensible procedure for choosing the binning

(no more random cut values)

⚫ Ultimately, BDT-based significances aren’t significantly better

than cut-based — this is something I’m looking into

Summary

22

slide-23
SLIDE 23

Backup

23

slide-24
SLIDE 24

⚫ Optimal stretching function doesn’t need to be a sigmoid

  • Just s-like and passing through 0,0 and 1,1

⚫ Tried y(x) below, which is a linear combination of a function that passes

through 0,0 and 1,1 and a regular sigmoid

⚫ More complicated than just a sigmoid and slightly deviates from 0,0 and

1,1 (need to be fixed by hand afterwards)

Other functions

24 y2(x) = 1 1 + e−k2(x− 1

2 )

y1(x) = 1 2 1 + k1(2x − 1) k1 − 2x − 1 + 1 y(x) = f1y1(x) + (1 − f1)y2(x) k1 ∈ [−1.7, − 1), k2 ∈ [7,14], f1 ∈ [−0.5,0.5]

slide-25
SLIDE 25

⚫ Try the simple combine prescription (5+1 processes) with the cut-based SRs

  • no nuisances — 2.28648
  • MC stat nuisances — 2.24895

⚫ There are 17 bins here (CRW + 16 SR bins); no CRZ here. ⚫ BDT gets ~2.73, which is ~20% better than cut-based

Simple combine with cut-based SRs

25

slide-26
SLIDE 26

⚫ Try adaboost and gradient boosting from scikit-learn package with

default parameters (toning down number of trees to save time)

⚫ The shape difference is a consequence of the loss function/

algorithm, since I see the same in TMVA vs xgboost

⚫ Even though the AUC is the same in both cases below, a maximum

likelihood fit will prefer a signal shape like the left

sci-kit learn adaboost vs gradient boost

26 Adaboost Gradient boost