software development in appstat
play

Software development in AppStat B. K egl / AppStat 1 AppStat: - PowerPoint PPT Presentation

Software development in AppStat B. K egl / AppStat 1 AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu ee Bal azs K egl Linear Accelerator Laboratory, CNRS/University of


  1. Software development in AppStat B. K´ egl / AppStat 1 AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu´ ee Bal´ azs K´ egl Linear Accelerator Laboratory, CNRS/University of Paris Sud Service Informatique Nov 30, 2010 1

  2. B. K´ egl / AppStat 2 Overview • Introduction • me • the team • collaborations • Scientific projects → software • discriminative learning → boosting → multiboost.org • inference, Monte-Carlo integration → adaptive MCMC → integration into root (save it for next time)

  3. B. K´ egl / AppStat 3 Scientific path Hungary 1989 – 94 M.Eng. Computer Science BUTE 1994 – 95 research assistant BUTE Canada 1995 – 99 Ph.D. Computer Science Concordia U 2000 postdoc Queen’s U 2001 – 06 assistant professor U of Montreal France 2006 – research scientist (CR1) CNRS / U Paris Sud • Research interests: machine learning, pattern recognition, signal pro- cessing, applied statistics • Applications: image and music processing, bioinformatics, software en- gineering, grid control, experimental physics

  4. B. K´ egl / AppStat 4 The team B. Kégl (team leader) 2006 - - boosting - MCMC - Auger D. Benbouzid (Ph.D. student) 2010 - R. Busa-Fekete (postdoc) - boosting 2008 - """"""""""""""""""" - JEM EUSO - boosting - optimization R. Bardenet (Ph.D student) - SysBio 2009 - - MCMC - optimization - Auger � ������������������������������������������������������������� � ������������������������������������ ��������������������� � ������������������������������������ ���������������������� F-D. Collin (software D. Garcia (postdoc; 01/01/2011) engineer; 01/12/2010) - generative models � - multiboost.org - Auger / JEM EUSO - MCMC in root - tutoring - system integration � � � ����������������������������������������������������������������������������������� ������������������������������������ � � ����������������������������������� ����������������������������������������������� � ������������������������������������������������������������������������������������������ ����������������

  5. B. K´ egl / AppStat 5 Collaborations Telecom ParisTech LRI LTCI TAO MCMC boosting optimization LAL AppStat o n t i a Hungarian Academy m i z p t i o l t a i k o c c g r u d ESBG reconstruction MCMC boosting r hypothesis test e g g i boosting r t JEM EUSO Auger ILC, LSST, Computer Science etc. Experimental Science Existing link Future link

  6. B. K´ egl / AppStat 6 Funding • ANR “jeune chercheur” MetaModel • 2007–2010, 150K e • ANR “COSINUS” Siminole • 2010–2014, 1043K e (658K e at LAL) • MRM Grille Paris Sud • 2010–2012, 60K e (31K e at LAL)

  7. B. K´ egl / AppStat 7 Siminole within ANR COSINUS • COSINUS = Conception and Simulation • Theme 1: simulation and supercomputing • Theme 2: conception and optimization • Theme 3: large-scale data storage and processing • Siminole • principal theme: Theme 2 • secondary theme: Theme 1

  8. B. K´ egl / AppStat 8 Siminole within ANR COSINUS • Simulation: third pillar of scientific discovery • Improving simulation • algorithmic development inside the simulator • implementation on high-end computing devices • our approach: control the number of calls to the simulator

  9. B. K´ egl / AppStat 9 Siminole within ANR COSINUS • Optimization • simulate from f ( x ) , find max f ( x ) x • Inference • simulate from p ( x | θ ) , find p ( θ | x ) • Discriminative learning • simulate from p ( x , θ ) , find θ = f ( x )

  10. B. K´ egl / AppStat 10 Discriminative learning → boosting → multiboost.org • Discriminative learning (classification) � � • Infer f ( x ) : R d → 1 ,..., K from a database D = ( x 1 , y 1 ) ,..., ( x n , y n ) • boosting, AdaBoost • one of the state-of-the-art classification algorithms • multiboost.org • our implementation

  11. B. K´ egl / AppStat 11 Machine learning at the crossroads Artificial intelligence Probability theory Statistique Optimization Machine learning Cognitive science Signal processing Neuroscience Information theory

  12. B. K´ egl / AppStat 12 Machine Learning • From a statistical point of view • non-parametric fitting, capacity/complexity control • large dimensionality • large data sets, computational issues • mostly classification (categorization, discrimination)

  13. B. K´ egl / AppStat 13 Discriminative learning • observation vector: x ∈ R d • class label: y ∈ {− 1 , 1 } – binary classification • class label: y ∈ { 1 ,..., K } – multi-class classification • classifier: g : R d �→ {− 1 , 1 } • discriminant function: f : R d �→ [ − 1 , 1 ] � 1 , if f ( x ) ≥ 0 , g ( x ) = − 1 , if f ( x ) < 0

  14. B. K´ egl / AppStat 14 Discriminative learning • Inductive learning � � • training sample: D n = ( x 1 , y 1 ) ,..., ( x n , y n ) • function set: F � � n �→ F R d ×{− 1 , 1 } • learning algorithm: A LGO : A LGO ( D n ) → f � � • goal: small generalization error P f ( X ) � = Y

  15. B. K´ egl / AppStat 15 Data for two � class classification problem x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  16. B. K´ egl / AppStat 16 2 � D Gaussian fit for class 1 x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  17. B. K´ egl / AppStat 17 2 � D Gaussian fit for class 2 x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  18. B. K´ egl / AppStat 18 Classification • Terminology • Conditional densities: p ( x | Y = 1 ) , p ( x | Y = − 1 ) • Prior probabilities: p ( Y = 1 ) , p ( Y = − 1 ) • Posterior probabilities: p ( Y = 1 | x ) , p ( Y = − 1 | x ) • Bayes theorem: p ( Y = 1 | x ) = p ( x | Y = 1 ) p ( Y = 1 ) ∼ p ( x | Y = 1 ) p ( Y = 1 ) p ( x ) • Decision: � p ( x | Y = 1 ) p ( Y = 1 ) if p ( x | Y = − 1 ) p ( Y = − 1 ) > 1 , 1 g ( x ) = − 1 otherwise.

  19. B. K´ egl / AppStat 19 Discriminant function with Gaussian fits x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  20. B. K´ egl / AppStat 20 ' Two Moons' data for two � class classification problem x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  21. B. K´ egl / AppStat 21 2 � D Gaussian fit for class 1 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  22. B. K´ egl / AppStat 22 2 � D Gaussian fit for class 2 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  23. B. K´ egl / AppStat 23 Discriminant function with Gaussian fits x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  24. B. K´ egl / AppStat 24 2 � D Parzen fit for class 1, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  25. B. K´ egl / AppStat 25 2 � D Parzen fit for class 2, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  26. B. K´ egl / AppStat 26 Discriminant function with Parzen fits, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  27. B. K´ egl / AppStat 27 2 � D Parzen fit for class 1, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  28. B. K´ egl / AppStat 28 2 � D Parzen fit for class 2, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  29. B. K´ egl / AppStat 29 Discriminant function with Parzen fits, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  30. B. K´ egl / AppStat 30 2 � D Parzen fit for class 1, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  31. B. K´ egl / AppStat 31 2 � D Parzen fit for class 2, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  32. B. K´ egl / AppStat 32 Discriminant function with Parzen fits, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  33. B. K´ egl / AppStat 33 Training and test error rates for Parzen fits with different bandwidths error rate 0.20 0.15 0.10 0.05 0.00 h 0 0.2 0.4 0.6 0.8

  34. B. K´ egl / AppStat 34 Non-parametric fitting • Capacity control, regularization • trade-off between approximation error and estimation error • complexity grows with data size • no need to correctly guess the function class

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend