How Computers Discover How Computers Discover A Mini-Review of - - PowerPoint PPT Presentation

how computers discover how computers discover
SMART_READER_LITE
LIVE PREVIEW

How Computers Discover How Computers Discover A Mini-Review of - - PowerPoint PPT Presentation

laboratory Gerstner Discover how to discover best How Computers Discover How Computers Discover A Mini-Review of Algorithmic Meta-Discovery Filip Zelezn y CVUT Prague, School of Electrical Engineering, Dept. of Cybernetics


slide-1
SLIDE 1

laboratory

Gerstner

“Discover how to discover best”

How Computers Discover How Computers Discover

A Mini-Review of Algorithmic Meta-Discovery

Filip ˇ Zelezn´ y ˇ CVUT Prague, School of Electrical Engineering, Dept. of Cybernetics The Gerstner Laboratory for Intelligent Decision Making and Control

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 1/17

slide-2
SLIDE 2

laboratory

Gerstner

Introduction

:: Traditional scientific discovery: a human forming a hypothesis explaining

  • bservations of some natural phenomena.

:: Computer-based scientific discovery, usually employing machine learning algorithms.

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 2/17

slide-3
SLIDE 3

laboratory

Gerstner

Automated Discovery

:: Computer programs constructing hypotheses from data − Machine Learning − Data Mining − Knowledge Discovery in Databases :: Highlight: the Robot-Scientist project (UK) − Robot develops predicate-logic hypotheses in functional genomics − Designs optimal experiments to validate hypotheses − Realizes the experiments physically − King et al, Nature vol. 427, 2004

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 3/17

slide-4
SLIDE 4

laboratory

Gerstner

Meta-Discovery

:: Viewing computer-based scientific discovery as an empirical phenomenon :: Inferring hypotheses thereabout.

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 4/17

slide-5
SLIDE 5

laboratory

Gerstner

Phase Transitions

:: Originally: runtime statistics of problem solving algorithms on randomly generated problem instances. Example: propositional logic SATisfiability.

#clauses / #variables # backtracks 1 5 10 250,000 500,000

Soluble Insoluble

#clauses / #variables

  • avg. # backtracks

1 5 10 25,000 50,000 Figure 1: The NP-complete logic satisfiability problem. Algorithm: Davis-Putnam search

:: ← Underconstrained (many solutions) vs. → Overconstrained (small search) Hardest problems on the transition between.

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 5/17

slide-6
SLIDE 6

laboratory

Gerstner

Phase Transitions in Learning?

:: “Inductive Logic Programming” (ILP): First-Order Logic representation of Data / Hypotheses. :: Example: Biochemistry. Predicting mutagenic activity by compound structure. :: Example Hypothesis active(A) ← atm(A, B, c, 10, C) ∧ atm(A, D, c, 10, C) ∧ bond(A, B, D, 1) :: Verifying the rule for given examples (chem. compounds) ≡ SAT problem :: Empirical studies Serra et al, IJCAI 01; Botta et al, JMLR 4:2003: ILP systems tend to generate hypotheses in the Phase Transition region.

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 6/17

slide-7
SLIDE 7

laboratory

Gerstner

Heavy-Tailed Runtime Distributions

:: What goes on in the PT region? Model runtime distributions. :: P(not achieving solution in time t) − normal: decays exponentially with t − heavy-tailed decays by power-law (may have infinite moments, eg. mean)

1e+03 5e+03 2e+04 5e+04 2e+05 1e−06 1e−04 1e−02 1e−00 # backtracks ~ cpu time (logscale) 1−F(x) (logscale)

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 7/17

slide-8
SLIDE 8

laboratory

Gerstner

Heavy-Tailed Runtime Distributions

:: HT Distribs: “Statistical Curiosity”, early 20th century: − V. Pareto: Income Distributions, − B. Mandelbrot: Fractal Phenomena in Nature :: Empirical finding Gomes et al, Jr Autom Reas 2001 Important combinatorial problems/algorithms exhibit heavy-tailed RTD. Surveyed randomized algorithms AND/OR problem instances :: In hypothesis learning: Zelezny et al, ILP 2002 Heavy-tailed RTD’s manifest themselves in ILP. − Not only a consequence of involved hypothesis checking (=SAT) − HT RTD also in terms of # of hypotheses searched

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 8/17

slide-9
SLIDE 9

laboratory

Gerstner

Restarted Randomized Search

:: HT RTD: Intriguing consequences. − f(t)∆t

1−F(t) prob finding a solution in the next ∆t if not found until now = t.

− Decreases with t. − The longer you search, the lower your chances... :: Makes sense to restart search every now and then ?! :: Indeed, − Non-Restarted search RT cdf F(t): infinite mean, but F(γ) > 0 for some γ > 0 − Search restarted each time γ (“cut-off”) time achieved. RT cdf: Fγ(N) = 1 − (1 − F(γ))N: exponential ⇒ finite mean

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 9/17

slide-10
SLIDE 10

laboratory

Gerstner

Restarted Randomized Search in ILP

:: Expected runtime of ILP algorithm with restart cut-off time γ, to find hypothesis of given quality. Log-scale, orders of magnitude performance gains.

cutoff (logscale) [1:65536] score [1:20] cost (logscale) [1:200000]

:: Large empirical study (Zelezny et al, ILP 2004): − 100-200 Condor Cluster PC’s – UW Madison − SGI Altix SuperComputer – CTU Prague

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 10/17

slide-11
SLIDE 11

laboratory

Gerstner

Occam’s razor: Empirical Assessment

:: William of Ockham, 14th century English logician. “Entities should not be multiplied beyond necessity.” :: Traditional Machine Learning interpretation “If several hypotheses explain data with roughly same accuracy, keep the simplest.” :: Reasons:

  • 1. Evident: ease of human interpretation
  • 2. Postulated: predictive ability (theory does not give a clue)

:: Thanks to automated discovery, Reason 2 can be empirically tested.

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 11/17

slide-12
SLIDE 12

laboratory

Gerstner

Occam’s razor: Empirical Assessment

:: Some seminal empirical studies (Holte, Mach Learn 1993) apparently support the simplicity bias, but misinterpretation here. :: Detrimental effect on predictive accuracy due to − too many hypotheses tested − rather than too complex hypotheses tested Relation hyp space size / avg hyp complexity only incidental :: Domingos, Data Mining & Know Disc 1999 reviews empirical evidence againts Reason 2 for Occam’s razor. Successes of − Ensemble Learning (combining numerous complex hypotheses) − Support Vector Machines (transforming data to high dimensional spaces) − Excessive search leading to simple inaccurate hyps Quinlan et al, IJCAI 1995

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 12/17

slide-13
SLIDE 13

laboratory

Gerstner

Computerized Meta-Learning

:: So far: :: Now shifting to Meta-Learning: “Learn how to learn best”

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 13/17

slide-14
SLIDE 14

laboratory

Gerstner

Meta-Learning Achievements

:: Traditional approaches: see Mach Learn spec issue Meta Learning, 54:2004. Examples: − Meta-hypothesize on Which learning algorithm best for given data? − Predict range of parameters (eg. kernel-width for SVM’s) given meta-data. :: Unorthodox approaches: Maloberti, Sebag: Mach Learn 55(2):2004 − Detect position of problem w.r.t the Phase Transition region − Use to determine the best learning algorithm :: Other: Bensusan, ECML 1998 meta-learns how much pruning should be used. − Pruning ≈ simplifying hypotheses at some accuracy sacrifice − Occam’s razor motivated (title: “God does not always shave with Occam’s razor”)

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 14/17

slide-15
SLIDE 15

laboratory

Gerstner

Speculations

:: Given Meta-Learning is useful, would Meta-Meta-Learning be? :: And Meta − . . . Meta−

learning?

1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = Simple rules work best. 1 2 gt2 s = 1 2 gt2 s =1 2 gt2 s = ........... etc. ........... Simple rules work best.

What if n infinite? (much like Lisp/Prolog meta-interpretation towers)

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 15/17

slide-16
SLIDE 16

laboratory

Gerstner

Speculations

:: Recent research on links btw. Machine Learning and Philosophy of Science (eg. ECML 2001 Dedicated Workshop) − mostly to improve Machine Learning − note: also Vapnik, Nat Stat Learn Th, Springer 2001 translates Popper’s falsifiability thesis to learning theory. :: But inversely: Do computerized meta-discovery lessons apply to scientific inquiry in general? Eg. − Should scientist randomize and restart hypothesis forming? − Should scientist combine diverse hypotheses to draw a conclusion? − Should they devise overly complex hypotheses to generate variance needed for good ensembles? − . . .

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 16/17

slide-17
SLIDE 17

laboratory

Gerstner

“Discover how to discover best”

* THE END *

c Filip ˇ Zelezn´ y 2005 Czech Institute of Technology (ˇ CVUT) in Prague / School of Electrical Engineering / Dept. of Cybernetics / The Gerstner Lab 17/17