learning and meta learning
play

Learning and Meta-learning computation making predictions choosing - PDF document

Learning and Meta-learning computation making predictions choosing actions acquiring episodes statistics algorithm gradient ascent ( eg of the likelihood) correlation Kalman filtering implementation


  1. Learning and Meta-learning • computation – making predictions – choosing actions – acquiring episodes – statistics • algorithm – gradient ascent ( eg of the likelihood) – correlation – Kalman filtering • implementation – Hebbian synpatic plasticity – neuromodulation 1

  2. Types of Learning supervised v | u inputs u and desired or target outputs v both provided, eg prediction → outcome reinforce max r | u input u and scalar evaluation r often with temporal credit assignment problem unsupervised or self-supervised u learn structure from statistics These are closely related: supervised learn P [ v | u ] unsupervised learn P [ v , u ] 2

  3. Hebb Famously suggested: if cell A consistently contributes to the activity of cell B, then the synapse from A to B should be strengthened • strong element of causality • what about weakening (LTD)? • multiple timescales – STP to protein synthesis • multiple biochemical mechanisms • systems: – hippocampus – multiple sub-areas – neocortex – layer and area differences – cerebellum – LTD is the norm 3

  4. Neural Rules field potential amplitude ( mV ) 0.4 LTP LTD 0.3 potentiated level depressed, partially depotentiated level 0.2 control level 0.1 1 s 10 min 100 2 Hz Hz 0 40 0 10 20 30 time (min) 4

  5. Stability and Competition Hebbian learning involves positive feedback. Control by: LTD usually not enough – covariance versus correlation saturation prevent synaptic weights from getting too big (or too small) – triviality beckons competition spike-time dependent learning rules normalization over pre-synaptic or post-synaptic arbors • subtractive: decrease all synapses by the same amount whether large or small • multiplicative: decrease large synapses by more than small synapses 5

  6. Preamble Linear firing rate model N u dv � dt = − v + w · u = − v + τ r w b u b b =1 assume that τ r is small compared with the rate of change of the weights, then v = w · u during plasticity Then have d w τ w dt = f ( v, u , w ) Supervised rules use targets to specify v – neural basis in ACh? 6

  7. The Basic Hebb Rule d w τ w dt = u v averaged �� over input statistics gives d w dt = � u v � = � uu · w � = Q · w τ w where Q is the input correlation matrix. Positive feedback instability dt | w | 2 = 2 τ w w · d w d dt = 2 v 2 τ w Also have discretised version w → w + T Q · w . τ w integrating over time, presenting patterns for T seconds. 7

  8. Covariance Rule Since LTD really exists, contra Hebb: d w τ w dt = ( v − θ v ) u or d w τ w dt = ( u − θ θ θ u ) v If θ v = � v � or θ θ θ u = � u � then d w dt = C · w τ w where C = � ( u − � u � )( u − � u � ) � is the input covariance matrix. Still unstable d dt | w | 2 = 2 v ( v − � v � ) τ w which averages to the (positive) covariance of v . 8

  9. BCM Rule 0 Odd to have LTD with v = 0 or u = 0 0. Evidence for d w τ w dt = v u ( v − θ v ) . 1.5 weight change/u 1 0.5 0 −0.5 −1 0 0.5 1 1.5 v If θ v slides to match a high power of v dθ v dt = v 2 − θ v τ θ with a fast τ θ , then get competition between synapses – intrinsic stabilization. 9

  10. Subtractive Normalisation Could normalise | w | 2 or � w b = n · w n = (1 , 1 . . . , 1) For subtractive normalisation of n · w : dt = v u − v ( n · u ) d w τ w n N u with dynamic subtraction, since d n · w 1 − n · n � � = v n · u τ w = 0 . dt N u as n · n = N u . Strongly competitive – typically all the weights bar one go to 0. Therefore use upper saturating limit. 10

  11. The Oja Rule A multiplicative way to ensure | w | 2 is constant d w dt = v u − αv 2 w τ w gives d | w | 2 = 2 v 2 (1 − α | w | 2 ) . τ w dt so | w | 2 → 1 /α . Dynamic normalisation – could also enforce normalisation all the time. 11

  12. Timing-Based Rules A B 140 90 epsp amplitude (% of control) 130 percent potentiation 60 (+10 ms ) 120 110 30 ( ± 100 ms ) 100 0 90 -30 80 (-10 ms ) 70 -60 0 50 25 -100 -50 0 50 100 time (min) t post - t pre (ms) slice cortical pyramidal cells; Xenopus retinotectal system • window of 50ms • gets Hebbian causality right • rate-description � ∞ d w dτ ( H ( τ ) v ( t ) u ( t − τ ) + H ( − τ ) v ( t − τ ) u ( t )) . τ w dt = 0 • spike-based description necessary if an input spike can have a measurable impact on an output spike. • critical factor is the overall integral – net LTD with ‘local’ LTP. • partially self-stabilizing 12

  13. Timing-Based Rules Gutig et al; van Rossum et al: � − λf − ( w i ) K (∆ t ) if ∆ t ≤ 0 ∆ w i = λf + ( w i ) K (∆ t ) if ∆ t > 0 K (∆ t ) = e −| ∆ t | /τ f + ( w ) = (1 − w ) µ f − ( w ) = αw µ 13

  14. FP Analysis How can we predict the weight distribution? 1 ∂P ( w, t ) = − p p P ( w, t ) − p d P ( w, t )+ ρ in ∂t p p P ( w − w p , t ) + p d P ( w + w d , t ) Taylor-expand about P ( w, t ) leads to a Fokker-Planck equation. Need to work out p d and p p ; assume steady firing Depression: p d = t window /t isi � t w Potentiation: I affects O: p p = 0 P ( δt ) dδt 14

  15. Single Postsynaptic Neuron Basic Hebb rule: d w dt = Q · w τ w analyse using an eigendecomposition of Q : Q · e µ = λ µ e µ λ 1 ≥ λ 2 . . . Since Q is symmetric and positive (semi-)definite • complete set of real orthonormal evecs • with non-negative eigenvalues • whose growth is decoupled Write N u � w ( t ) = c µ ( t ) e µ µ =1 then t � � c µ ( t ) = c µ (0) exp λ µ τ w and w ( t ) → α ( t ) e 1 as t → ∞ 15

  16. Constraints α ( t ) = exp( λ µ t/τ w ) → ∞ . • Oja makes w ( t ) → e 1 / √ α • saturation can disturb outcome A B 1 1 0.8 0.8 0.6 0.6 w 2 w 2 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w 1 w 1 • subtractive constraint w = Q · w − ( w · Q · n ) n τ w ˙ . N u Sometimes e 1 ∝ n – so its growth is stunted; and e µ · n = 0 for µ � = 1 so w ( t ) = ( w (0) · e 1 ) e 1 + N u � λ µ t � � exp ( w (0) · e µ ) e µ τ w µ =2 16

  17. Translation Invariance Particularly important case for development has Q bb ′ = Q ( b − b ′ ) � u b � = � u � Write n = (1 , . . . , 1) and J = nn T , then Q ′ = Q − N � u � 2 J 1. e µ · n = 0, AC modes are unaffected 2. e µ · n � = 0, DC modes are affected 3. Q has discrete sines and cosines as eigenvectors 4. fourier spectrum of Q are the eigenvalues 17

  18. PCA What is the significance of e 1 ? A B C u 2, w 2 2 4 4 u 2, w 2 3 u 2, w 2 3 u 1, w 1 2 2 -2 2 1 1 -2 0 0 0 1 2 3 4 0 1 2 3 4 u 1, w 1 u 1, w 1 • optimal linear reconstruction: minimise � | u − g v | 2 � E ( w , g ) = • information maximisation: I [ v, u ] = H [ v ] − H [ v | x ] under a linear model • assume � u � = 0 0 0 or use C instead of Q . 18

  19. Linear Reconstruction � | u − g v | 2 � E ( w , g ) = K − 2 w · Q · g + � g � 2 w · Q · w = quadratic in w with minimum at g w ∗ = � g � 2 making E ( w ∗ , g ) = K − g · Q · g . � g � 2 k ( e k · g ) e k and � g � 2 =1: look for soln with g = � N ( e k · g ) 2 λ k E ( w ∗ , g ) = K − � k =1 clearly has e 1 · g = 1 and e 2 · g = e 3 · g = . . . = 0 0 0 Therefore g and w both point along principal component 19

  20. Infomax (Linsker) argmax w I [ v, u ] = H [ v ] − H [ v | u ] Very general unsupervised learning suggestion: • H [ v | u ] is not quite well defined unless v = w · u + η where η is arbitrarily deterministic 2 log 2 πeσ 2 for a Gaussian. • H [ v ] = 1 If P [ u ] ∼ N [0 0 0 , Q ] then v ∼ N [0 , w · Q · w + υ 2 ] maximise wQw T subject to � w � 2 = 1 Same problem as above: implies that w ∝ e 1 . note the normalisation If non-Gaussian, only maximising an upper bound on I [ v, u ]. 20

  21. v ( a ) W ( a; b ) A ( a; b ) W ( a; b ) A ( a; b ) Ocular Dominance u ( b ) u ( b ) cortex competitive interaction L R L R left thalamus right • retina-thalamus-cortex • OD develops around eye-opening • interaction with refinement of topography � A W W • interaction with orientation a • interaction with ipsi/contra-innervation • effect of manipulations to input b b b b b ocularity L R L R L R 21

  22. Start Simple Consider one input from each eye v = w R u R + w L u L . Then � � q S q D Q = � uu � = q D q S has √ e 1 = (1 , 1) / 2 λ 1 = q S + q D √ e 2 = (1 , − 1) / 2 λ 2 = q S − q D so if w + = w R + w L , w − = w R − w L then dw + dw − τ w = ( q S + q D) w + τ w = ( q S − q D) w − . dt dt Since q D ≥ 0, w + dominates – so use subtractive normalisation dw + dw − = 0 = ( q S − q D) w − . τ w τ w dt dt so w − → ± ω and one eye dominates. 22

  23. Orientation Selectivity Model is exactly the same – input correlations come from ON/OFF cells: C) D) 1 1.5 0.5 1 Q (b) ~ − Q (b) ~ 0 0.5 −0.5 −1 0 −6 −4 −2 0 2 4 6 0 2 4 6 ~ b b Now dominant mode of Q − has spatial structure: centre-surround version also possible, but is usually dominated because of non-linear effects. 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend