Physics, Big Data Analysis and Philosophy (and all the rest) - - PowerPoint PPT Presentation

physics big data analysis and philosophy and all the rest
SMART_READER_LITE
LIVE PREVIEW

Physics, Big Data Analysis and Philosophy (and all the rest) - - PowerPoint PPT Presentation

Physics, Big Data Analysis and Philosophy (and all the rest) Wolfgang Rhode ... eritis sicut Deus ... Platos Cave Analogy Are true conclusions possible? Within the world of logic ? Yes Within the world of mathematics ? Yes Within


slide-1
SLIDE 1

Physics, Big Data Analysis and Philosophy (and all the rest)

Wolfgang Rhode

slide-2
SLIDE 2

... eritis sicut Deus ...

slide-3
SLIDE 3

Plato‘s Cave Analogy

slide-4
SLIDE 4

Are true conclusions possible?

´ Within the world of logic? Yes ´ Within the world of mathematics? Yes ´ Within the world of observations? No

´ Accuracy of observations ´ Conclusion from the observed effect to it’s cause

´ Within the world of teleology ? No

´ To reach a goal ? The program running on us -? > Big Data Analysis

slide-5
SLIDE 5

Consequence

´ Ancient and middle age philosophy:

´ Rationally invented systems based on logic and mathematics ´ However: these systems are inappropriate to describe the nature.

´ Galileo Galilei:

´ Introduction of the experiment as mean to understand the nature ´ Book of nature is written in the language of mathematics

slide-6
SLIDE 6

How to justify conclusions from experiments to the world of theories?

´ Success of classical physics (Newton … Einstein)

´ Btw:. Classical physics: deterministic, no probability elements

´ Aggravating: Success of Thermodynamics and Quantum Mechanics (etc.)

´ In very different ways probability based.

´ Epistemology: Different tries to answer the question, but no success … until ...

slide-7
SLIDE 7

Karl Popper: Logic of scientific discovery (1934)

´ Theories are (somehow) more or less rationally invented ´ Scientific theories allow an experimental test

´ Disagreement: rejection & search for a better theory ´ Agreement: acceptance for the moment & search for a more decisive experiment

´ Critical Rationalism

´ Logic based theory

´ Is it necessary to reject a theory, just because of one non-fitting experiment?

slide-8
SLIDE 8

Plato‘s Cave Revisited

slide-9
SLIDE 9

Plato‘s Cave Revisited

slide-10
SLIDE 10

Plato‘s Cave Revisited

slide-11
SLIDE 11

Plato‘s Cave Revisited

slide-12
SLIDE 12

The physical Problem: “Cave equation”

g = measured numbers b = background A = Kernel = transfer function = detector properties f = wanted function

slide-13
SLIDE 13

Computer Science View

13

Scientific Question

Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data

Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models Monte Carlo Simulation

slide-14
SLIDE 14

Monte Carlo: Virtual Reality

´ Physics … all that we know ...

´ Energy Spectra ´ Directional Distributions ´ Sky Maps

´ Data: … all we can measure ...

´ Charges, ´ Times, ´ Locations

14

slide-15
SLIDE 15

Data Analysis: Monte Carlo Description

(Goal: Measurement of the neutrino energy spectrum) ´ Monte Carlo Simulation of

´ Signal: Neutrinos

´ close to correct spatial and energy distribution ´ Neutrino interactions leading to charged particles (i.e. muons) ´ Muon interactions (path, range, deposited energy...) ´ Cherenkov-Light (Production by charged particles, propagation in ice) ´ Light detection and measurement (photomultiplier, read out electronics)

´ Physical Background: Cosmic Ray interaction in the Atmosphere

´ à (....) recorded light in the detector, correct simulation

´ Technical Background: correct simulation

´ Radioactivity of the ice or the detector itself à light ´ Photomultiplier noise (“signal without external reason“) ´ Artefacts of the readout electronics à fake signal

slide-16
SLIDE 16

hC hu pR m u O

  • hN

R m C

  • N

u m N pcpR m u r Em R m C

  • N

u m N pcpR m u hR hgR m u § T hgR u m C lgO §

  • u

lcchu §

  • §

N Em R m C lgO § r r § gT Ol . § e ShC § n a t

  • IceCube - Simulations

Atmospheric Muons (Corsika)

  • 1 Year Run Time of the Detector
  • One Configuration
slide-17
SLIDE 17

Computer Science View

17

Scientific Question

Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data

Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models Monte Carlo Simulation

slide-18
SLIDE 18

TRIGGER: Which Data might we want?

´ Keep as much from the signal as possible . ´ Discard as much of the background as possible. ´ Decide within ~1 ms ´ à Write data to disk, if ´ N Detectors within a ´ Volume V and a ´ Time Interval T have seen a signal. ´ Hardware (FPGA) and Software realization possible.

slide-19
SLIDE 19

First Analysis Steps: Which Events might we want?

´ Still: Keep as much from the signal as possible . ´ Still: Discard as much of the background as possible. ´ Decide within ~ minutes - hours (Computing Farm) ´ Can the event be reconstructed at all? ´ Is the result physical?

(IceCube: Movement with velocity of light?, Upward?)

´ Do different algorithms lead to compatible results?

slide-20
SLIDE 20

The Problem:

g = measured numbers b = background A = Kernel = transfer function = detector properties f = wanted function

slide-21
SLIDE 21

Signal – Background Separation

slide-22
SLIDE 22

How to obtain a clean signal data set?

slide-23
SLIDE 23

N S

µ-ν µ µ Bad reconstructed µ Well reconstructed µ Track length > 200 m Track interruption < 400 m Zenith angle > 86 deg CC-Interaction µ-Neutrino Atmospheric µ

slide-24
SLIDE 24

§ lcE

  • Pl. hC

Ol m C pT

  • PpR

p Oh R

  • §

. pC H u hPSC PpC R

  • lu

u hT hIp C R

  • C

m lOH Ip u lp T hO § GpH h PpR p pC P Gm C R h pu T m

  • Pm
  • C

m R

  • D

lR

  • pR
  • pT

T

  • lC
  • ghu

R plC

  • Ip u

lp T hO § y GpC H

  • Pl. hC

Ol m C Oy

  • gm

. N T lgpR h pC P . pH h N u hIh C R

  • R

Eh Oh N pu pR lm C Øt hT hgR lm C

  • m

D

  • p
  • O. pT

T

  • C
  • S. hu
  • m

D

  • pN

N u m N u lpR h Ip u lp T hO

slide-25
SLIDE 25

§ Lpu lp T h t hT hgR lm C

  • pR

p Gm C R h pu T m

  • m

. N pu lOm C

  • ; u
  • h. m

Ih lT T

  • Ip u

lp T hO D u m . R Eh pC pT HOl O

  • pT

gST pR lm C O

  • gm

u u hgR

  • N

N u m A

  • l. pR

lm C O

  • n
  • r u

m ghPSu hO

  • pN

N T lgp T h R m

  • PpR

p e Gm C R h pu T m

slide-26
SLIDE 26

§ Lpu lp T h t hT hgR lm C

  • pR

p Gm C R h pu T m

  • m

. N pu lOm C

  • ; u
  • h. m

Ih lT T

  • Ip u

lp T hO D u m . R Eh pC pT HOl O

  • a h. m

I h u hPSC PpC R

  • pC

P . hpC lC cT hO O

  • Lpu

lp T hO

slide-27
SLIDE 27

§ Lpu lp T h t hT hgR lm C

  • pR

p Gm C R h pu T m

  • m

. N pu lOm C

  • ; u
  • h. m

Ih lT T

  • Ip u

lp T hO D u m . R Eh pC pT HOl O

  • a h. m

I h u hPSC PpC R

  • pC

P . hpC lC cT hO O

  • Lpu

lp T hO

  • Ga Ga

hpR Su h t hT hgR lm C

slide-28
SLIDE 28

Jaccard Index: Kuncheva‘s Index:

B A B A J È Ç =

| | | | | | ) ( ) , (

2

B A r k B A k n k k rn B A IC Ç = = =

  • =

Stability of the MRMR Selection:

MRMR: Minimum Redundancy Maximum Relevance

slide-29
SLIDE 29

§ Lpu lp T h t hT hgR lm C

  • pR

p Gm C R h pu T m

  • m

. N pu lOm C

  • ; u
  • h. m

Ih lT T

  • Ip u

lp T hO D u m . R Eh pC pT HOl O

  • a h. m

I h u hPSC PpC R

  • pC

P . hpC lC cT hO O

  • Lpu

lp T hO

  • Ga Ga

hpR Su h t hT hgR lm C Dimensions

M ∼ 2000 M = 30 M ∼ 120

1 & 2 3

slide-30
SLIDE 30

§ a pC Pm . m u hOR

  • m

u

  • m

R Ehu

  • PpR

p . lC lC c . hR Em P

  • §

t hpu gE hP lO p . m PhT

  • u

hgm cC ldlC c OR u SgR Su hO pN N u m N u lpR h D m u

  • Ol

cC pT

  • pgic u

m SC P Oh N pu pR lm C

  • D

u m . Gm C R h pu T m § . N m u R pC R

  • Gm

C R h pu T m

  • gpu

u lhO p hT O D m u

  • t lcC

pT

  • pC

P pgic u m SC P §A

  • p. N

T h Ehu h

  • a pC

Pm . m u hOR

  • §r u

hPlgR lm C

  • hO

R

  • l. pR

lm C

  • hR

KhhC

  • pC

P

  • D

m u

  • R

Eh ,O lcC pT C hO O y

  • m

D

  • p

O lC cT h hIh C R

slide-31
SLIDE 31

Random Forest

Attributes at knot Attributes for the complete RF Signal/Background -Relation at training

slide-32
SLIDE 32

Signal Selection

32

slide-33
SLIDE 33

§

  • hO

R

  • m

D

  • ,t R

p lT lR Hy

  • pC

P ,n I hu R u plC lC cy

Error Estimation: Cross Validation

slide-34
SLIDE 34
  • S

P

  • µ
  • RP
  • µ

m . N T hR h a pC ch

  • I
  • hL
  • I
  • hL
  • §

99.6±0.2 % Purity of Muon-Neutrino Events § 99.9999% Background Rejected

Data

slide-35
SLIDE 35

A new way of doing data science

´ Classical: ´ Simple Signature ´ à Physical Interpretation: Information loss ´ à Continue as Human (Physicist;) ´ à Compare to a predictive Theory ´ New: ´ Multidimensional Signature ´ à Probability Cloud ´ à Find & Calculate & Add Relevant Attributes ´ à Remove: Irrelevant Attributes / Information Noise ´ à Unfolding / inverse Problem:

´ à Collapse of the Probability Cloud into a Physical Distribution

´ à Compare to a structural Theory

´ à limit Ranges of Physical Relevant Parameters

35

http://cerncourier.com/cws/article/cern/27925

slide-36
SLIDE 36

Phase 2: Unfolding

slide-37
SLIDE 37

Unfolding …

R P

  • P
  • à
  • O
  • P

S

  • RP
  • P
  • RP
  • P
  • P
  • à R
  • P
  • O
  • P
  • P
  • P
slide-38
SLIDE 38

The problem, to be filled with life:

g = measured numbers b = background, small A = Kernel = transfer function = detector properties f = wanted physical distribution

slide-39
SLIDE 39

Change of paradigm !

´ Separation:

´ Monte Carlo describes the data as perfect as possible.

´ Unfolding:

´ Monte Carlo description of the detector is “perfect” ´ However, the searched for physical input is unknown!

´ Completely unknown?

´ … good guessing allowed ( a good start value saves CPU time) ´ ... prove, that the unfolding algorithm works Monte Carlo independent

slide-40
SLIDE 40

Are there variables correlated to the searched-for energy ?

´ Number of hit detectors... ´ ... and many others: Number of hit detectors...

slide-41
SLIDE 41

´ Number of hit detectors... ´ ... and many others: Number of hit detectors... ... and many others

Are there variables correlated to the searched-for energy ?

slide-42
SLIDE 42

Inverse Problems: ill-posed, bad condition number

Translate to the matrix equation:

g = Af + b

f = A-1(g- b)

Damp elements of A, leading to oscillating results:

Regularization ßà Addition of Assumptions

slide-43
SLIDE 43

Example: Unfolding of the Muon Neutrino Energy Spectrum

g(y) = Z A(E, y)f(E)dx + b(y)

Background well understood Reconstructed Track length Energy Estimator Number of direct Photons

§ s C D m T PlC c KlR E R Eu hh

  • C

N SR

  • Lpu

lp T hO § r T SO SN

  • R

m

  • Oh

Ih C

  • gm

C R u m T

  • Lpu

lp T hO Requirement after Unfolding: MC and DATA agree in all Variables

slide-44
SLIDE 44
  • 1. Problem solution in the Monte Carlo world:

´ Many parameter combinations to be evaluated:

´ Number of elements in the input & output vectors, ´ regularization strength ...

slide-45
SLIDE 45
  • 2. Optimize the Regularization

à Growing fluctuations ´ à Growing correlations

slide-46
SLIDE 46
  • 3. Test the Stability of the Result

(Pull-Mode)

§ T T

  • p O
  • r ST

T

  • GhpC

O

  • Ol
  • c. p
  • 0.24σ

0.25σ

slide-47
SLIDE 47
  • 4. Proof of the result

Before unfolding: MC and data do/should not agree Thereafter: Data MC agreement for the used Input parameter is trivial

  • but what about the reweighted control variables?
slide-48
SLIDE 48
  • 5. Acceptance correction :

Event Numbers Detector Area = geom. Area x Reconstruction prob. Flux = Events / ( Area x Time)

slide-49
SLIDE 49
  • 6. Physical Signal (Muon Neutrino Flux)

Preliminary

slide-50
SLIDE 50

Computer Science View

50

Scientific Question

Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data

Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models

Monte Carlo Simulation

slide-51
SLIDE 51

Monte Carlo: Virtual Reality

´ Structures of Theories:

´ Monte Carlo Parameter Selection

51 51

´ Experimental Physics:

´ Energy Spectra ´ Directional Distributions

´ Data: Charges, Times, Locations ´ Translation: Monte Carlo Simulation

´ Extreme resource requesting ´ à Code/System Optimization (GPU) ´ à Abort needles Branches ´ à Active Learning

slide-52
SLIDE 52

Signal Selection

´ Real Time Analysis in Data Streams

´ Fast Trigger (FPGA) ´ Feature Generation ´ Feature Extraction ´ Data Mining: Random Forests & Co.

52

Data Mining: Random Forests & Co.

slide-53
SLIDE 53

Inverse Problems – ill-posed Problems

53

g(y) = Z A(E, y)f(E)dx + b(y)

g = Af + b f = A-1(g- b)

Preliminary

slide-54
SLIDE 54

A new way of doing data science

´ Classical: ´ Simple Signature ´ à Physical Interpretation ´ à Continue as Human (Physicist;) ´ à Compare to a predictive Theory ´ New: ´ Multidimensional Signature ´ à Probability Cloud ´ à Find & Calculate & Add Relevant Attributes ´ à Remove: Irrelevant Attributes / Information Noise ´ à Unfolding / inverse Problem:

´ à Collapse of the Probability Cloud into a Physical Distribution

´ à Compare to a structural Theory

´ à limit Ranges of Physical Relevant Parameters

54

http://cerncourier.com/cws/article/cern/27925

slide-55
SLIDE 55

Automated discovery of hypotheses and their validations?

´ Physics Data à Multi-dimensional Probability Clouds ´ Physics Theories à Multi-parameter Ranges of Possibilities ´ Data Analysis à Limit the Parameter Range ´ Theorists à Simplify your Theory ´ Computer Science

´ à Resource Aware (fast, cheap) Hypotheses Testing ´ à Towards Automated Testing of the Hypothesis Space

55

slide-56
SLIDE 56

Why we need new methods for machine learning:

´ Typical finance needs for present Experiments: ´ Hardware ~ Computing ~ 100 Mio. $ | € ´ Next generation: ´ Sensitivity 10 x higher à 100 x more data ´ Increasing theory parameter space to test ´ Constant or decreasing financial resources ´ Consequence: ´ Replace learning Physicists by learning machines ´ Distributed computing ´ Minimize data transfer, storage, memory, CPU

56

slide-57
SLIDE 57

Challenge of distributed computational power and data

´ Distributed Data: ´ Big Data Volume: ´Intelligent Access, ´Intelligent Transfer, ´Information Conserving Data Reduction ´ Distributed Computing: ´ Platform independent code ´ Accepted and well tested Methods ´ Optimized in every respect

57

slide-58
SLIDE 58

„Probabilistic Rationalism“

´ g(y), f(x) 2dim ? à World of language & logic ´ g(y), f(x) analytic functions ? à World of classical physics ´ g(y), f(x) probability distributions? à World of big data analysis

slide-59
SLIDE 59

Conclusion

Big Data Analysis leads to new Perspectives: Physics: A New way of understanding Data Analysis Computer Science: A New Integral View on Problem Solution Philosophy: A New Perspective in Epistemology

59

slide-60
SLIDE 60