1 Real Neural Learning Artificial Neuron Model Model network as a - PDF document

Neural Networks • Analogy to biological neural systems, the most robust learning systems we know. • Attempt to understand natural biological systems CS 391L: Machine Learning through computational modeling. • Massive parallelism allows for computational Neural Networks efficiency. • Help understand “distributed” nature of neural representations (rather than “localist” representation) that allow robustness and graceful degradation. Raymond J. Mooney • Intelligent behavior as an “emergent” property of large number of simple units rather than from University of Texas at Austin explicitly encoded symbolic rules and algorithms. 1 2 Neural Speed Constraints Neural Network Learning • Neurons have a “switching time” on the order of a • Learning approach based on modeling few milliseconds, compared to nanoseconds for adaptation in biological neural systems. current computing hardware. • However, neural systems can perform complex • Perceptron: Initial algorithm for learning cognitive tasks (vision, speech understanding) in simple neural networks (single layer) tenths of a second. developed in the 1950’s. • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude • Backpropagation: More complex algorithm more for current computers. for learning multi-layer neural networks • Must be exploiting “massive parallelism.” • Human brain has about 10 11 neurons with an developed in the 1980’s. average of 10 4 connections each. 3 4 Real Neurons Neural Communication • Electrical potential across cell membrane exhibits spikes • Cell structures called action potentials. – Cell body • Spike originates in cell body, travels down – Dendrites axon, and causes synaptic terminals to release neurotransmitters. – Axon • Chemical diffuses across synapse to – Synaptic terminals dendrites of other neurons. • Neurotransmitters can be excititory or inhibitory. • If net input of neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential. 5 6 1

Real Neural Learning Artificial Neuron Model • Model network as a graph with cells as nodes and synaptic • Synapses change size and strength with connections as weighted edges from node i to node j , w ji experience. 1 • Hebbian learning: When two connected • Model net input to cell as w 12 w 16 w 15 neurons are firing at the same time, the w 13 w 14 = net w o j ji i strength of the synapse between them ∑ i 2 3 4 5 6 increases. • Cell output is: o j net < T 0 if • “Neurons that fire together, wire together.” = j j o ≥ j net T 1 1 if i j ( T j is threshold for unit j ) 0 T j net j 7 8 Neural Computation Perceptron Training • McCollough and Pitts (1943) showed how such model • Assume supervised training examples neurons could compute logical functions and be used to giving the desired output for a unit given a construct finite-state machines. set of known input activations. • Can be used to simulate logic gates: – AND: Let all w ji be T j / n, where n is the number of inputs. • Learn synaptic weights so that unit – OR: Let all w ji be T j produces the correct output for each – NOT: Let threshold be 0, single input with a negative weight. example. • Can build arbitrary logic circuits, sequential machines, and computers with such gates. • Perceptron uses iterative update algorithm • Given negated inputs, two layer network can compute any to learn a correct set of weights. boolean function using a two level AND-OR network. 9 10 Perceptron Learning Rule Perceptron Learning Algorithm • Update weights by: • Iteratively update weights until convergence. w = w + η t − o o ( ) ji ji j j i Initialize weights to random values where η is the “learning rate” Until outputs of all training examples are correct For each training pair, E , do: t j is the teacher specified output for unit j . Compute current output o j for E given its inputs • Equivalent to rules: Compare current output to target value, t j , for E Update synaptic weights and threshold using learning rule – If output is correct do nothing. • Each execution of the outer loop is typically – If output is high, lower weights on active inputs called an epoch . – If output is low, increase weights on active inputs • Also adjust threshold to compensate: = − η − T T t o ( ) j j j j 11 12 2

Perceptron as a Linear Separator Concept Perceptron Cannot Learn • Since perceptron uses linear threshold function, it is • Cannot learn exclusive-or, or parity searching for a linear separator that discriminates the function in general. classes. o 3 o 3 w o + w o > T 12 2 13 3 1 1 + – ?? ?? w T > − + o o 12 1 3 w 2 w 13 13 Or hyperplane in – + n -dimensional space o 2 0 o 2 1 13 14 Perceptron Convergence Perceptron Limits and Cycling Theorems • Perceptron convergence theorem : If the data is • System obviously cannot learn concepts it linearly separable and therefore a set of weights cannot represent. exist that are consistent with the data, then the • Minksy and Papert (1969) wrote a book Perceptron algorithm will eventually converge to a consistent set of weights. analyzing the perceptron and demonstrating • Perceptron cycling theorem : If the data is not many functions it could not learn. linearly separable, the Perceptron algorithm will • These results discouraged further research eventually repeat a set of weights and threshold at on neural nets; and symbolic AI became the the end of some epoch and therefore enter an infinite loop. dominate paradigm. – By checking for repeated weights+threshold, one can guarantee termination with either a positive or negative result. 15 16 Perceptron as Hill Climbing Perceptron Performance • Linear threshold functions are restrictive (high bias) but • The hypothesis space being search is a set of weights and a still reasonably expressive; more general than: threshold. – Pure conjunctive • Objective is to minimize classification error on the training set. – Pure disjunctive • Perceptron effectively does hill-climbing (gradient descent) in – M-of-N (at least M of a specified set of N features must be this space, changing the weights a small amount at each point present) to decrease training set error. • In practice, converges fairly quickly for linearly separable data. • For a single model neuron, the space is well behaved with a • Can effectively use even incompletely converged results single minima. when only a few outliers are misclassified. • Experimentally, Perceptron does quite well on many training benchmark data sets. error weights 17 18 0 3

Multi-Layer Networks Hill-Climbing in Multi-Layer Nets • Multi-layer networks can represent arbitrary functions, but • Since “greed is good” perhaps hill-climbing can be used to an effective learning algorithm for such networks was learn multi-layer networks in practice although its thought to be difficult. theoretical limits are clear. • A typical multi-layer network consists of an input, hidden • However, to do gradient descent, we need the output of a and output layer, each fully connected to the next, with unit to be a differentiable function of its input and weights. activation feeding forward. • Standard linear threshold function is not differentiable at output the threshold. activation hidden o i input • The weights determine the function computed. Given an 1 arbitrary number of hidden units, any boolean function can be computed with a single hidden layer. 0 T j net j 19 20 Differentiable Output Function Gradient Descent • Need non-linear output function to move beyond linear • Define objective to minimize error: functions. = ∑∑ − E W t o 2 ( ) ( ) kd kd – A multi-layer linear network is still linear. d ∈ D k ∈ K • Standard solution is to use the non-linear, differentiable where D is the set of training examples, K is the set of sigmoidal “logistic” function: output units, t kd and o kd are, respectively, the teacher and current output for unit k for example d . 1 • The derivative of a sigmoid unit with respect to net input is: o = 1 ∂ o j − net j T − + ( ) e j j = o − o 1 ( 1 ) ∂ j j net j 0 T j net j • Learning rule to change weights to minimize error is: ∂ E ∆ = − η w Can also use tanh or Gaussian output function ji ∂ w ji 21 22 Backpropagation Learning Rule Error Backpropagation • First calculate error of output units and use this to • Each weight changed by: ∆ w = ηδ o change the top layer of weights. ji j i Current output: o j =0.2 δ = − − o o t o j Correct output: t j = 1.0 ( 1 )( ) if is an output unit j j j j j Error δ j = o j (1– o j )( t j – o j ) output δ = − δ o o w j ( 1 ) if is a hidden unit 0.2(1–0.2)(1–0.2)=0.128 j j j k kj ∑ k Update weights into j where η is a constant called the learning rate ∆ w = ηδ o hidden ji j i t j is the correct teacher output for unit j δ j is the error measure for unit j input 23 24 4

1 Real Neural Learning Artificial Neuron Model Model network as a - PDF document

Neural Networks Analogy to biological neural systems, the most robust learning systems we know. Attempt to understand natural biological systems CS 391L: Machine Learning through computational modeling. Massive parallelism allows

Learning on Humanoid Robots Vadym Gryshchuk 19.11.2018 Outline Motivation Background

Unsupervised clustering with growing self-organizing neural network A comparison with non-neural

Pattern Analysis and Machine Intelligence Lecture Notes on Clustering (II) 2013-2014 Davide

Self-Organizing Maps Kyle Thayer Organizing Marbles Self-Organizing Maps Algorithm

Hebbian Learning Algorithms for Training Convolutional Neural Networks Gabriele Lagani Computer

Distributed Computation of Feature-Detectors for Medical Image Processing on GPGPU and Cell

Clustering I Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015

NANOMATERIALS DISCOVERY Michael Fernandez | OCE-Postdoctoral Fellow September 2016

Descriptive Statistics Chapter 3 1 Summarizing Data With lots of playtesting, there is a

I can interpret my answer in terms of the question. National 5 WB 26th Feb Statistics 3, 5, 5, 8,

Markov process In the definition of a Markov process we stated that the next state only

Behavioural Preorders on Stochastic Systems - Logical, Topological, and Computational Aspects

Non-parametric duration modelling for speech synthesis with a joint model of acoustics and

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and CSBE,

Urban Computing Dr. Mitra Baratchi October 5, 2020 Leiden Institute of Advanced Computer Science

On the Total Variation Distance of SMPs Giorgio Bacci, Giovanni Bacci, Kim G. Larsen, Radu

Invariant, super and quasi-martingale functions of a Markov process Lucian Beznea Simion Stoilow

Stochastic processes arising from non commutative symmetries Final conference of MADACA Domaine

Survivability Analysis of a Computer System under an Advanced Persistent Threat Attack guez

Generative Models I Ian Goodfellow, Sta ff Research Scientist, Google Brain MILA Deep Learning

On the discretization of Feynman-Kac semi-groups. Application to rare events sampling and Di ff

Stochastic Perrons Method in Linear and Nonlinear Problems Mihai S rbu, The University of

Generative and discriminative classification techniques Machine Learning and Category

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho