[PPT] - EML Update Nick Amin July 10, 2018 Overview Last update (SNT), PowerPoint Presentation

SLIDE 1

EML Update

Nick Amin July 10, 2018

SLIDE 2

⚫ Last update (SNT), also presented at ML workshop ⚫ Feedback from ML talk

There were some concerns I might be taking advantage of b,c→e (or just being unfair in

general) if my network is learning isolation

Next few slides
Consider tracks for full electron ID if that’s the desired direction
Claimed before that raw track information underperformed wrt the 21-variable MVA

(BDT-based)

Try to replicate MVA via DNN rather than BDT and repeat the check using raw

track information with latest architecture, changes, etc to confirm

Could also do e vs 𝛿
This has already been done?
Barrel only, and 32x32 crystals
For each crystal, they consider energies and pulse profiles → CNN, LSTMs

⚫ Note: in rest of slides here, consider barrel only

Overview

2

SLIDE 3

⚫ Is the network using isolation? (29x15 used vs 𝜏i𝜃i𝜃’s 5x5) ⚫ In DY, virtually all events are "truth-matched" to not be in

the b,c→e category, so we can separate out the two to see how important isolation would be

⚫ Comparing AUCs in legends, b/c class backgrounds are

actually worse than unmatched in all pT bins except pT>45

Network performance vs flavor

3 signal vs all signal vs unmatched signal vs b,c→e signal

SLIDE 4

⚫ Switch from DY to tt̅ ⚫ From pictures below, network should have

no trouble distinguishing isolated/ nonisolated electron candidates since we’re using a large 29x15 window around the seed

Training/testing in tt̅

4 signal b/c unmatched bkg 20<pT<25 sig 20<pT<25

SLIDE 5

⚫ 𝜏i𝜃i𝜃 (calculated

from 5x5 crystals) does slightly better than in DY

⚫ CNN has large

improvement

Training/testing in tt̅

5 signal vs all signal vs unmatched signal vs b,c→e signal vs all shape BDT vs 𝜏i𝜃i𝜃 signal vs all shape BDT vs CNN CNN vs 𝜏i𝜃i𝜃

SLIDE 6

⚫ Now, as an exercise for just this slide

and the next, we try training/testing with electron images made only with supercluster cells/energies (implementation in backup)

This should make it so isolation can’t

be learned by the CNN as isolation quantities do not consider deposits belonging to the supercluster

⚫ Both signal and background images

become sparser

Training/testing in DY with SC

6 bkg 20<pT<25 sig 20<pT<25 avg. bkg. avg. sig. all cells SC only SC only

SLIDE 7

⚫ Using SC-only degrades the performance wrt the

riginal implementation ("all cells" in 29x15

around the seed) → ~half of the gain is lost

⚫ Below, show CNN, 𝜏i𝜃i𝜃, and 6-variable BDT

trained on shower shape variables

Training/testing in DY with SC

7 CNN vs 6-var BDT 6-var BDT vs 𝜏i𝜃i𝜃 CNN vs 𝜏i𝜃i𝜃 all cells SC

nly

all cells SC only pT<15 25% 11% 15<pT<25 10% 5% 25<pT<45 10% 6% pT>45 9% 5%

For background efficiency of 20%, signal efficiency increase of CNN wrt BDT

SLIDE 8

Tracks

8

SLIDE 9

⚫ First, see if feeding 21 MVA inputs into DNN achieves comparable performance to

(BDT-trained) MVA

With a 100k-parameter network of purely fully-connected layers, achieve similar

AUC to BDT

Train with 2M samples, test on 1M
Tried a 400k-parameter network and saw the same results, but have not yet

tried a network smaller than 100k parameters

Log scale amplifies slightly worse ROC curves for DNN, but performance is

~identical (compare AUC values in legend)

Replicating MVA with DNN

9 linear log

InputLayer (21) Dense (128) Dropout(0.1), LeakyReLU

2816

Dense (32) LeakyReLU Dense (2)

66

Dense (64) LeakyReLU

2080

Dense (128) Dropout(0.1), LeakyReLU Dense (128) LeakyReLU

16512 8256

Dense (256) Dropout(0.1), LeakyReLU

33024 32896

SLIDE 10

⚫ Now try to feed "lower-level" track information into the network ⚫ 19 variables

5 pre-computed position/momenta triplets (R,𝜃,𝜚)
supercluster 𝜃, 𝜚, E
charge

⚫ Subtract supercluster 𝜃, 𝜚 from the 𝜃, 𝜚 components of the triplets

Track information

10 https://github.com/cms-sw/cmssw/blob/0b70aea1b7723a6dfd453d9d015b670d0f735256/ DataFormats/EgammaCandidates/interface/GsfElectron.h#L279-L283

math::XYZPointF positionAtVtx ; // the track PCA to the beam spot math::XYZPointF positionAtCalo ; // the track PCA to the supercluster position math::XYZVectorF momentumAtVtx ; // the track momentum at the PCA to the beam spot math::XYZVectorF momentumAtCalo ; // the track momentum extrapolated at the supercluster position from the innermost track state math::XYZVectorF momentumOut ; // the track momentum extrapolated at the seed cluster position from the outermost track state

SLIDE 11

⚫ With the previous positions/momenta,

can compute the red track variables

Coupled with shape information

from the CNN, this should account for nearly all of the performance of the MVA

Track information

11

Most important Least important

1. ESC / PCA momentum
2. 1/E - 1/p
3. 𝜏i𝜃i𝜃
4. 𝛦𝜚(track,SC)
5. Brem. fraction
6. 𝛦𝜃(track,SC)
7. SC 𝜃 width
8. SC 𝜚 width
9. E3x3/ESC
10. 𝜏i𝜚i𝜚
11. ESC / calo momentum
12. NCTF hits
13. GSF 𝜓2
14. NGSF hits
15. Epreshower / ESC
16. H/E
17. 𝛦𝜃(calo track,seed)
18. SC circularity
19. CTF 𝜓2
20. missing inner hits
21. conversion vertex probability

SLIDE 12

⚫ After reweighting background to signal, split by charge

Track variable distributions

12 R 𝛦𝜃( · ,SC) 𝛦𝜚( · ,SC) vtx pos. calo pos. vtx mom. calo mom.

uter mom.

SLIDE 13

Training & Results

13

Dense (15) Dropout(0.1), LeakyReLU Dense (2) 32 Dense (50) Dropout(0.1), LeakyReLU 765 Dense (150) Dropout(0.3), LeakyReLU 7550 Concatenate (400) 60150 Dense (128) Dropout(0.1), LeakyReLU Dense (64) LeakyReLU 8256 Conv2D 3x3 (7,14,64) LeakyReLU MaxPooling2D (3,7,64) Conv2D 3x3 (3,7,16) Dropout(0.2), LeakyReLU 9232 Conv2D 3x3 (15,29,32) LeakyReLU MaxPooling2D (7,14,32) 18496 Dense (128) Dropout(0.1), LeakyReLU Dense (256) Dropout(0.1), LeakyReLU 33024 Dense (512) Dropout(0.1), LeakyReLU 131584 Dense (256) Dropout(0.1), LeakyReLU 131328 32896 InputLayer (19) 2560 Flatten (336) InputLayer (15,29,1) 320

19 track variables ⚫ Using two-prong network

~440k parameters
No batch normalization, as I found it to be unstable

⚫ At 10% bkg efficiency, signal efficiency for "NN" is

~2-8% worse than the full MVA

⚫ Can the remaining/lower-ranked variables make up

this difference?

SLIDE 14

Appending remaining variables

14

Dense (15) Dropout(0.1), LeakyReLU Dense (2) 32 Dense (50) Dropout(0.1), LeakyReLU 765 Dense (150) Dropout(0.3), LeakyReLU 7550 Concatenate (400) 60150 Dense (128) Dropout(0.1), LeakyReLU Dense (64) LeakyReLU 8256 Conv2D 3x3 (7,14,64) LeakyReLU MaxPooling2D (3,7,64) Conv2D 3x3 (3,7,16) Dropout(0.2), LeakyReLU 9232 Conv2D 3x3 (15,29,32) LeakyReLU MaxPooling2D (7,14,32) 18496 Dense (128) Dropout(0.1), LeakyReLU Dense (256) Dropout(0.1), LeakyReLU 33024 Dense (512) Dropout(0.1), LeakyReLU 131584 Dense (256) Dropout(0.1), LeakyReLU 131328 32896 InputLayer (19) 2560 Flatten (336) InputLayer (15,29,1) 320

19 track variables +9 MVA variables

⚫ Now take the network from the previous slide and append 9 variables from the

MVA (not covered by the CNN shape information or the 19 track variables), and retrain to see the effect of the "lower-ranked" variables

AUC improves noticeable over previous slide (flippable) — together, these

lower-ranked variables are not negligible

Still, after including this information, the network is still not matching the

performance of the MVA, so it is not fully utilizing the 19 raw track variables that we are feeding in (?)

⚫ Note, another way of viewing this network/training configuration…

Same as 21-variable MVA/BDT, except the 6 shape variables are replaced by a

CNN on raw 29x15 crystals, and 6 high-level track variables are replaced by 19 raw track variables … and the performance is slightly worse?

SLIDE 15

⚫ Try photons ⚫ Eventually get back to endcap training

Next steps

15

SLIDE 16

Backup

16

SLIDE 17

⚫ Implementation of "SC-only" cells for reference, as there could be a subtlety with the

"hits and fractions" for the supercluster

For "all cells" it’s easy — just look at all the rechits with energies per crystal
For "SC only", need to get a list of the rechit IDs and energy fractions, then reference

this to the full list of energies per crystal to get energies associated to the supercluster

All cells vs SC reference

17

// Get all EB rechits and store them in an (ieta,iphi)->energy map std::map<std::pair<int,int>, float> ietaiphi_to_energy; auto rechits = lazyToolnoZS->getEcalEBRecHitCollection(); for (EBRecHitCollection::const_iterator it = rechits->begin(); it != rechits->end(); ++it) { int hit_ieta = EBDetId(it->detid()).ieta(); int hit_iphi = EBDetId(it->detid()).iphi(); float energy = it->energy(); ietaiphi_to_energy[{hit_ieta,hit_iphi}] = energy; } // Get all hits & fractions for the supercluster // For each hit, look up the cell energy from the previous map and multiply // by the fraction // This is then the energy that goes into an ieta/iphi cell. auto supercluster = pat_ele->superCluster(); std::vector<std::pair<DetId,float> > hfSC = supercluster->hitsAndFractions(); for(std::vector<std::pair<DetId,float> >::const_iterator it = hfSC.begin(); it != hfSC.end(); ++it) { DetId id = (*it).first; if (!( id.subdetId() == EcalBarrel)) continue; int ieta = EBDetId(id).ieta(); int iphi = EBDetId(id).iphi(); float rawenergy = ietaiphi_to_energy[{ieta,iphi}]; float frac = (*it).second; float energy = rawenergy*frac; rhs_e.push_back(energy); rhs_iphi.push_back(iphi); rhs_ieta.push_back(ieta); }

SLIDE 18

⚫ How does an SC-only 5x5 CNN compare to BDT?

36k params, 2M training examples, 1M test

⚫ CNN worse than 6-variable shape BDT

5x5 CNN

18

InputLayer (5,5,1)

Conv2D 3x3 (5,5,32) LeakyReLU 320 Conv2D 3x3 (2,2,64) LeakyReLU

MaxPooling2D (1,1,64) Conv2D 3x3 (1,1,16) Dropout(0.2), LeakyReLU

9232

Flatten (16) Dense (150) Dropout(0.3), LeakyReLU MaxPooling2D (2,2,32)

18496

Dense (50) Dropout(0.1), LeakyReLU Dense (15) Dropout(0.1), LeakyReLU

765

Dense (2)

32 7550