Physics, Big Data Analysis and Philosophy (and all the rest)
Wolfgang Rhode
Physics, Big Data Analysis and Philosophy (and all the rest) - - PowerPoint PPT Presentation
Physics, Big Data Analysis and Philosophy (and all the rest) Wolfgang Rhode ... eritis sicut Deus ... Platos Cave Analogy Are true conclusions possible? Within the world of logic ? Yes Within the world of mathematics ? Yes Within
Wolfgang Rhode
´ Within the world of logic? Yes ´ Within the world of mathematics? Yes ´ Within the world of observations? No
´ Accuracy of observations ´ Conclusion from the observed effect to it’s cause
´ Within the world of teleology ? No
´ To reach a goal ? The program running on us -? > Big Data Analysis
´ Ancient and middle age philosophy:
´ Rationally invented systems based on logic and mathematics ´ However: these systems are inappropriate to describe the nature.
´ Galileo Galilei:
´ Introduction of the experiment as mean to understand the nature ´ Book of nature is written in the language of mathematics
´ Success of classical physics (Newton … Einstein)
´ Btw:. Classical physics: deterministic, no probability elements
´ Aggravating: Success of Thermodynamics and Quantum Mechanics (etc.)
´ In very different ways probability based.
´ Epistemology: Different tries to answer the question, but no success … until ...
´ Theories are (somehow) more or less rationally invented ´ Scientific theories allow an experimental test
´ Disagreement: rejection & search for a better theory ´ Agreement: acceptance for the moment & search for a more decisive experiment
´ Critical Rationalism
´ Logic based theory
´ Is it necessary to reject a theory, just because of one non-fitting experiment?
g = measured numbers b = background A = Kernel = transfer function = detector properties f = wanted function
Computer Science View
13
Scientific Question
Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data
Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models Monte Carlo Simulation
´ Physics … all that we know ...
´ Energy Spectra ´ Directional Distributions ´ Sky Maps
´ Data: … all we can measure ...
´ Charges, ´ Times, ´ Locations
14
(Goal: Measurement of the neutrino energy spectrum) ´ Monte Carlo Simulation of
´ Signal: Neutrinos
´ close to correct spatial and energy distribution ´ Neutrino interactions leading to charged particles (i.e. muons) ´ Muon interactions (path, range, deposited energy...) ´ Cherenkov-Light (Production by charged particles, propagation in ice) ´ Light detection and measurement (photomultiplier, read out electronics)
´ Physical Background: Cosmic Ray interaction in the Atmosphere
´ à (....) recorded light in the detector, correct simulation
´ Technical Background: correct simulation
´ Radioactivity of the ice or the detector itself à light ´ Photomultiplier noise (“signal without external reason“) ´ Artefacts of the readout electronics à fake signal
hC hu pR m u O
R m C
u m N pcpR m u r Em R m C
u m N pcpR m u hR hgR m u § T hgR u m C lgO §
lcchu §
N Em R m C lgO § r r § gT Ol . § e ShC § n a t
Atmospheric Muons (Corsika)
Computer Science View
17
Scientific Question
Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data
Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models Monte Carlo Simulation
´ Keep as much from the signal as possible . ´ Discard as much of the background as possible. ´ Decide within ~1 ms ´ à Write data to disk, if ´ N Detectors within a ´ Volume V and a ´ Time Interval T have seen a signal. ´ Hardware (FPGA) and Software realization possible.
´ Still: Keep as much from the signal as possible . ´ Still: Discard as much of the background as possible. ´ Decide within ~ minutes - hours (Computing Farm) ´ Can the event be reconstructed at all? ´ Is the result physical?
(IceCube: Movement with velocity of light?, Upward?)
´ Do different algorithms lead to compatible results?
g = measured numbers b = background A = Kernel = transfer function = detector properties f = wanted function
N S
µ-ν µ µ Bad reconstructed µ Well reconstructed µ Track length > 200 m Track interruption < 400 m Zenith angle > 86 deg CC-Interaction µ-Neutrino Atmospheric µ
§ lcE
Ol m C pT
p Oh R
. pC H u hPSC PpC R
u hT hIp C R
m lOH Ip u lp T hO § GpH h PpR p pC P Gm C R h pu T m
m R
lR
T
R plC
lp T hO § y GpC H
Ol m C Oy
. N T lgpR h pC P . pH h N u hIh C R
Eh Oh N pu pR lm C Øt hT hgR lm C
D
T
D
N u m N u lpR h Ip u lp T hO
§ Lpu lp T h t hT hgR lm C
p Gm C R h pu T m
. N pu lOm C
Ih lT T
lp T hO D u m . R Eh pC pT HOl O
gST pR lm C O
u u hgR
N u m A
lm C O
m ghPSu hO
N T lgp T h R m
p e Gm C R h pu T m
§ Lpu lp T h t hT hgR lm C
p Gm C R h pu T m
. N pu lOm C
Ih lT T
lp T hO D u m . R Eh pC pT HOl O
I h u hPSC PpC R
P . hpC lC cT hO O
lp T hO
§ Lpu lp T h t hT hgR lm C
p Gm C R h pu T m
. N pu lOm C
Ih lT T
lp T hO D u m . R Eh pC pT HOl O
I h u hPSC PpC R
P . hpC lC cT hO O
lp T hO
hpR Su h t hT hgR lm C
Jaccard Index: Kuncheva‘s Index:
2
Stability of the MRMR Selection:
§ Lpu lp T h t hT hgR lm C
p Gm C R h pu T m
. N pu lOm C
Ih lT T
lp T hO D u m . R Eh pC pT HOl O
I h u hPSC PpC R
P . hpC lC cT hO O
lp T hO
hpR Su h t hT hgR lm C Dimensions
M ∼ 2000 M = 30 M ∼ 120
1 & 2 3
§ a pC Pm . m u hOR
u
R Ehu
p . lC lC c . hR Em P
t hpu gE hP lO p . m PhT
hgm cC ldlC c OR u SgR Su hO pN N u m N u lpR h D m u
cC pT
m SC P Oh N pu pR lm C
u m . Gm C R h pu T m § . N m u R pC R
C R h pu T m
u lhO p hT O D m u
pT
P pgic u m SC P §A
T h Ehu h
Pm . m u hOR
hPlgR lm C
R
lm C
KhhC
P
m u
Eh ,O lcC pT C hO O y
D
O lC cT h hIh C R
Random Forest
Attributes at knot Attributes for the complete RF Signal/Background -Relation at training
32
§
R
D
p lT lR Hy
P ,n I hu R u plC lC cy
Error Estimation: Cross Validation
P
m . N T hR h a pC ch
99.6±0.2 % Purity of Muon-Neutrino Events § 99.9999% Background Rejected
Data
´ Classical: ´ Simple Signature ´ à Physical Interpretation: Information loss ´ à Continue as Human (Physicist;) ´ à Compare to a predictive Theory ´ New: ´ Multidimensional Signature ´ à Probability Cloud ´ à Find & Calculate & Add Relevant Attributes ´ à Remove: Irrelevant Attributes / Information Noise ´ à Unfolding / inverse Problem:
´ à Collapse of the Probability Cloud into a Physical Distribution
´ à Compare to a structural Theory
´ à limit Ranges of Physical Relevant Parameters
35
http://cerncourier.com/cws/article/cern/27925
R P
S
g = measured numbers b = background, small A = Kernel = transfer function = detector properties f = wanted physical distribution
´ Separation:
´ Monte Carlo describes the data as perfect as possible.
´ Unfolding:
´ Monte Carlo description of the detector is “perfect” ´ However, the searched for physical input is unknown!
´ Completely unknown?
´ … good guessing allowed ( a good start value saves CPU time) ´ ... prove, that the unfolding algorithm works Monte Carlo independent
´ Number of hit detectors... ´ ... and many others: Number of hit detectors...
´ Number of hit detectors... ´ ... and many others: Number of hit detectors... ... and many others
Are there variables correlated to the searched-for energy ?
Translate to the matrix equation:
Damp elements of A, leading to oscillating results:
Background well understood Reconstructed Track length Energy Estimator Number of direct Photons
§ s C D m T PlC c KlR E R Eu hh
N SR
lp T hO § r T SO SN
m
Ih C
C R u m T
lp T hO Requirement after Unfolding: MC and DATA agree in all Variables
´ Many parameter combinations to be evaluated:
´ Number of elements in the input & output vectors, ´ regularization strength ...
à Growing fluctuations ´ à Growing correlations
(Pull-Mode)
§ T T
T
O
0.25σ
Before unfolding: MC and data do/should not agree Thereafter: Data MC agreement for the used Input parameter is trivial
Event Numbers Detector Area = geom. Area x Reconstruction prob. Flux = Events / ( Area x Time)
Preliminary
Computer Science View
50
Scientific Question
Experiment Sensors, Data Acquisition Data Reduction, Data Pre- Processing Analysis Evaluation Regulation, Improvements Data
Trigger Streams Signal Identification Concept Shift Storage Inverse Problems Evaluation Complex Programming Models
Monte Carlo Simulation
´ Structures of Theories:
´ Monte Carlo Parameter Selection
51 51
´ Experimental Physics:
´ Energy Spectra ´ Directional Distributions
´ Data: Charges, Times, Locations ´ Translation: Monte Carlo Simulation
´ Extreme resource requesting ´ à Code/System Optimization (GPU) ´ à Abort needles Branches ´ à Active Learning
´ Real Time Analysis in Data Streams
´ Fast Trigger (FPGA) ´ Feature Generation ´ Feature Extraction ´ Data Mining: Random Forests & Co.
52
Data Mining: Random Forests & Co.
53
g(y) = Z A(E, y)f(E)dx + b(y)
Preliminary
´ Classical: ´ Simple Signature ´ à Physical Interpretation ´ à Continue as Human (Physicist;) ´ à Compare to a predictive Theory ´ New: ´ Multidimensional Signature ´ à Probability Cloud ´ à Find & Calculate & Add Relevant Attributes ´ à Remove: Irrelevant Attributes / Information Noise ´ à Unfolding / inverse Problem:
´ à Collapse of the Probability Cloud into a Physical Distribution
´ à Compare to a structural Theory
´ à limit Ranges of Physical Relevant Parameters
54
http://cerncourier.com/cws/article/cern/27925
´ Physics Data à Multi-dimensional Probability Clouds ´ Physics Theories à Multi-parameter Ranges of Possibilities ´ Data Analysis à Limit the Parameter Range ´ Theorists à Simplify your Theory ´ Computer Science
´ à Resource Aware (fast, cheap) Hypotheses Testing ´ à Towards Automated Testing of the Hypothesis Space
55
´ Typical finance needs for present Experiments: ´ Hardware ~ Computing ~ 100 Mio. $ | € ´ Next generation: ´ Sensitivity 10 x higher à 100 x more data ´ Increasing theory parameter space to test ´ Constant or decreasing financial resources ´ Consequence: ´ Replace learning Physicists by learning machines ´ Distributed computing ´ Minimize data transfer, storage, memory, CPU
56
´ Distributed Data: ´ Big Data Volume: ´Intelligent Access, ´Intelligent Transfer, ´Information Conserving Data Reduction ´ Distributed Computing: ´ Platform independent code ´ Accepted and well tested Methods ´ Optimized in every respect
57
´ g(y), f(x) 2dim ? à World of language & logic ´ g(y), f(x) analytic functions ? à World of classical physics ´ g(y), f(x) probability distributions? à World of big data analysis
Big Data Analysis leads to new Perspectives: Physics: A New way of understanding Data Analysis Computer Science: A New Integral View on Problem Solution Philosophy: A New Perspective in Epistemology
59