master recherche iac apprentissage statistique
play

Master Recherche IAC Apprentissage Statistique, Optimisation & - PowerPoint PPT Presentation

Master Recherche IAC Apprentissage Statistique, Optimisation & Applications Anne Auger Balazs K egl Mich` ele Sebag TAO Nov. 28th, 2012 Contents WHO Anne Auger, optimization TAO, LRI Balazs K egl, machine learning


  1. Master Recherche IAC Apprentissage Statistique, Optimisation & Applications Anne Auger − Balazs K´ egl − Mich` ele Sebag TAO Nov. 28th, 2012

  2. Contents WHO ◮ Anne Auger, optimization TAO, LRI ◮ Balazs K´ egl, machine learning TAO, LAL ◮ Mich` ele Sebag, machine learning TAO, LRI WHAT 1. Neural Nets 2. Stochastic Optimization 3. Reinforcement Learning 4. Ensemble learning WHERE: http://tao.lri.fr/tiki-index.php?page=Courses

  3. Exam Final: same as for TC2: ◮ Questions ◮ Problems Volunteers ◮ Some pointers are in the slides ◮ Volunteer: reads material, writes one page, sends it. Tutorials/Videolectures ◮ http://www.iro.umontreal.ca/ ∼ bengioy/talks/icml2012-YB- tutorial.pdf ◮ Part 1: 1-56; Part 2: 79-133 ◮ Group 1 (group 2) prepares Part 1 (Part 2) ◮ Course Dec. 12th: ◮ Group 1 presents part 1; group 2 asks questions; ◮ Group 2 presents part 2; group 1 asks questions.

  4. Questionaire Admin: Ouassim Ait El Hara Debriefing ◮ What is clear/unclear ◮ Pre-requisites ◮ Work organization

  5. This course Bio-inspired algorithms Classical Neural Nets History Structure Applications

  6. Bio-inspired algorithms Facts ◮ 10 11 neurons ◮ 10 4 connexions per neuron ◮ Firing time: ∼ 10 − 3 second 10 − 10 computers

  7. Bio-inspired algorithms, 2 Human beings are the best ! ◮ How do we do ? ◮ What matters is not the number of neurons as one could think in the 80s, 90s... ◮ Massive parallelism ? ◮ Innate skills ? = anything we can’t yet explain ◮ Is it the training process ?

  8. Beware of bio-inspiration ◮ Misleading inspirations (imitate birds to build flying machines) ◮ Limitations of the state of the art ◮ Difficult for a machine <> difficult for a human

  9. Synaptic plasticity Hebb 1949 Conjecture When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased. Learning rule Cells that fire together, wire together If two neurons are simultaneously excitated, their connexion weight increases. Remark: unsupervised learning.

  10. This course Bio-inspired algorithms Classical Neural Nets History Structure Applications

  11. History of artificial neural nets (ANN) 1. Non supervised NNs and logical neurons 2. Supervised NNs: Perceptron and Adaline algorithms 3. The NN winter: theoretical limitations 4. Multi-layer perceptrons.

  12. History

  13. Thresholded neurons Mc Culloch et Pitt 1943 Ingredients ◮ Input (dendrites) x i ◮ Weights w i ◮ Threshold θ ◮ Output: 1 iff � i w i x i > θ Remarks ◮ Neurons → Logics → Reasoning → Intelligence ◮ Logical NNs: can represent any boolean function ◮ No differentiability.

  14. Perceptron Rosenblatt 1958 � y = sign ( w i x i − θ ) x = ( x 1 , . . . , x d ) �→ ( x 1 , . . . , x d , 1). w = ( w 1 , . . . , w d ) �→ ( w 1 , . . . w d , − θ ) y = sign ( � w , x � )

  15. Learning a Perceptron Given R d , y i ∈ { 1 , − 1 } , i = 1 . . . n } ◮ E = { ( x i , y i ) , x i ∈ I For i = 1 . . . n , do ◮ If no mistake, do nothing no mistake ⇔ � w , x � same sign as y ⇔ y � w , x � > 0 ◮ If mistake w ← w + y . x i Enforcing algorithmic stability: w t +1 ← w t + α t y . x ℓ α t decreases to 0 faster than 1 / t .

  16. Convergence: upper bounding the number of mistakes Assumptions: R d , C ) ◮ x i belongs to B ( I || x i || < C ◮ E is separable, i.e. exists solution w ∗ s.t. ∀ i = 1 . . . n , y i � w ∗ , x i � > δ > 0

  17. Convergence: upper bounding the number of mistakes Assumptions: R d , C ) ◮ x i belongs to B ( I || x i || < C ◮ E is separable, i.e. exists solution w ∗ s.t. ∀ i = 1 . . . n , y i � w ∗ , x i � > δ > 0 with || w ∗ || = 1.

  18. Convergence: upper bounding the number of mistakes Assumptions: R d , C ) ◮ x i belongs to B ( I || x i || < C ◮ E is separable, i.e. exists solution w ∗ s.t. ∀ i = 1 . . . n , y i � w ∗ , x i � > δ > 0 with || w ∗ || = 1. δ ) 2 mistakes. Then The perceptron makes at most ( C

  19. Bouding the number of misclassifications Proof Upon the k -th misclassification for some x i w k +1 = w k + y i x i � w k +1 , w ∗ � = � w k , w ∗ � + y i � x i , w ∗ � ≥ � w k , w ∗ � + δ ≥ � w k − 1 , w ∗ � + 2 δ ≥ k δ In the meanwhile: || w k + y i x i || 2 ≤ || w k || 2 + C 2 || w k +1 || 2 = kC 2 ≤ Therefore: √ kC > k δ

  20. Going farther... Remark: Linear programming: Find w , δ such that Max δ, subject to ∀ i = 1 . . . n , y i � w , x i � > δ gives the floor to Support Vector Machines...

  21. Adaline Widrow 1960 Adaptive Linear Element Given R d , y i ∈ I E = { ( x i , y i ) , x i ∈ I R , i = 1 . . . n } Learning Minimization of a quadratic function w ∗ = argmin { Err ( w ) = � ( y i − � w , x i � ) 2 } Gradient algorithm w i = w i − 1 + α i ∇ Err ( w i )

  22. The NN winter Limitation of linear hypotheses Minsky Papert 1969 The XOR problem.

  23. Multi-Layer Perceptrons, Rumelhart McClelland 1986 Issues ◮ Several layers, non linear separation, addresses the XOR problem ◮ A differentiable activation function 1 ouput ( x ) = 1 + exp {−� w , x �}

  24. The sigmoid function 1 ◮ σ ( t ) = 1+ exp ( − a . t ) , a > 0 ◮ approximates step function (binary decision) ◮ linear close to 0 ◮ Strong increase close to 0 ◮ σ ′ ( x ) = a σ ( x )(1 − σ ( x ))

  25. Back-propagation algorithm, Rumelhart McClelland 1986; Le Cun 1986 ◮ Given ( x , y ) a training sample uniformly randomly drawn ◮ Set the d entries of the network to x 1 . . . x d ◮ Compute iteratively the output of each neuron until final layer: output ˆ y ; y − y ) 2 ◮ Compare ˆ y and y Err ( w ) = (ˆ ◮ Modify the NN weights on the last layer based on the gradient value ◮ Looking at the previous layer: we know what we would have liked to have as output; infer what we would have liked to have as input, i.e. as output on the previous layer. And back-propagate... ◮ Errors on each i -th layer are used to modify the weights used to compute the output of i -th layer from input of i -th layer.

  26. Back-propagation of the gradient Notations Input x = ( x 1 , . . . x d ) From input to the first hidden layer = � w jk x k z (1) j x (1) = f ( z (1) ) j j From layer i to layer i + 1 = � w ( i ) z ( i +1) jk x ( i ) j k x ( i +1) = f ( z ( i +1) ) j j ( f : e.g. sigmoid)

  27. Back-propagation of the gradient R d , y ∈ {− 1 , 1 } Input( x , y ), x ∈ I Phase 1 Propagate information forward ◮ For layer i = 1 . . . ℓ For every neuron j on layer i z ( i ) k w ( i ) j , k x ( i − 1) = � j k x ( i ) = f ( z ( i ) j ) j Phase 2 Compare the target output ( y ) to what you get ( x ( ℓ ) 1 ) NB: for simplicity one assumes here that there is a single output (the label is a scalar value). y = x ( ℓ ) ◮ Error: difference between ˆ and y . 1 Define e sortie = f ′ ( z ℓ 1 )[ˆ y − y ] where f ′ ( t ) is the (scalar) derivative of f at point t .

  28. Back-propagation of the gradient Phase 3 retro-propagate the errors e ( i − 1) = f ′ ( z ( i − 1) w ( i ) kj e ( i ) � ) j j k k Phase 4: Update weights on all layers ∆ w ( k ) = α e ( k ) x ( k − 1) ij i j where α is the learning rate ( < 1 . )

  29. This course Bio-inspired algorithms Classical Neural Nets History Structure Applications

  30. Neural nets Ingredients ◮ Activation function ◮ Connexion topology = directed graph feedforward ( ≡ DAG, directed acyclic graph) or recurrent ◮ A (scalar, real-valued) weight on each connexion Activation(z) ◮ thresholded 0 if z < threshold , 1 otherwise ◮ linear z ◮ sigmoid 1 / (1 + e − z ) e − z 2 /σ 2 ◮ Radius-based

  31. Neural nets Ingredients ◮ Activation function ◮ Connexion topology = directed graph feedforward ( ≡ DAG, directed acyclic graph) or recurrent ◮ A (scalar, real-valued) weight on each connexion Feedforward NN (C) David McKay - Cambridge Univ. Press

  32. Neural nets Ingredients ◮ Activation function ◮ Connexion topology = directed graph feedforward ( ≡ DAG, directed acyclic graph) or recurrent ◮ A (scalar, real-valued) weight on each connexion Recurrent NN ◮ Propagate until stabilisation ◮ Back-propagation does not apply ◮ Memory of the recurrent NN: value of hidden neurons Beware that memory fades exponentially fast ◮ Dynamic data (audio, video)

  33. Structure / Connexion graph / Topology Prior knowledge ◮ Invariance under translation, rotation,.. op ◮ → Complete E consider ( op ( x i ) , y i ) ◮ or use weight sharing: convolutionnal networks 100,000 weights → 2,600 parameters Details ◮ http://yann.lecun.com/exdb/lenet/ Demos ◮ http://deeplearning.net/tutorial/lenet.html

  34. Hubel & Wiesel 1968 Visual cortex of the cat ◮ cells arranged in such a way that ◮ ... each cell observes a fraction of the visual field receptive field ◮ the union of which covers the whole field Characteristics ◮ Simple cells check the presence of a pattern ◮ More complex cells consider a larger receptive field, detect the presence of a pattern up to translation/rotation

  35. Sparse connectivity ◮ Reducing the number of weights ◮ Layer m : detect local patterns ◮ Layer m + 1: non linear aggregation, more global field

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend