 
              CSCE 478/878 Lecture 4: Artificial Neural Networks Stephen D. Scott (Adapted from Tom Mitchell’s slides) September 24 , 200 8 1
Outline • Threshold units: Perceptron, Winnow • Gradient descent/exponentiated gradient • Multilayer networks • Backpropagation • Advanced topics • Support Vector Machines 2
Connectionist Models Consider humans: • Total number of neurons ≈ 10 10 • Neuron switching time ≈ 10 − 3 second (vs. 10 − 10 ) • Connections per neuron ≈ 10 4 – 10 5 • Scene recognition time ≈ 0 . 1 second • 100 inference steps doesn’t seem like enough ⇒ much parallel computation Properties of artificial neural nets (ANNs): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically Strong differences between ANNs for ML and ANNs for biological modeling 3
When to Consider Neural Networks • Input is high-dimensional discrete- or real-valued (e.g. raw sensor input) • Output is discrete- or real-valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant • Long training times acceptable Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction 4
The Perceptron & Winnow x 1 w 1 x 0 =1 w 0 x 2 w 2 Σ . n . Σ wi xi n Σ wi xi { . 1 if > 0 i =0 o = w n i =0 -1 otherwise x n � +1 if w 0 + w 1 x 1 + · · · + w n x n > 0 o ( x 1 , . . . , x n ) = − 1 otherwise (sometimes use 0 instead of − 1 ) Sometimes we’ll use simpler vector notation: � +1 if � w · � x > 0 o ( � x ) = − 1 otherwise 5
Decision Surface of Perceptron/Winnow x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 ) ? But some functions not representable • I.e. those not linearly separable • Therefore, we’ll want networks of neurons 6
Perceptron Training Rule w i ← w i + ∆ w add , where ∆ w add = η ( t − o ) x i i i and • t = c ( � x ) is target value • o is perceptron output • η is small constant (e.g. 0.1) called learning rate I.e. if ( t − o ) > 0 then increase w i w.r.t. x i , else decrease Can prove rule will converge if training data is linearly sep- arable and η sufficiently small 7
Winnow Training Rule w i ← w i · ∆ w mult , where ∆ w mult = α ( t − o ) x i i i and α > 1 I.e. use multiplicative updates vs. additive updates Problem: Sometimes negative weights are required w + and � w − and replace • Maintain two weight vectors � � w − � w + − � w · � � x with � · � x w + and � w − independently as above, using • Update � ∆ w + i = 1 / ∆ w + = α ( t − o ) x i and ∆ w − i i Can also guarantee convergence 8
Perceptron vs. Winnow Winnow works well when most attributes irrelevant, i.e. w ∗ is sparse (many 0 entries) when optimal weight vector � x ∈ { 0 , 1 } n be labeled by a E.g. let examples � k -disjunction over n attributes, k ≪ n • Remaining n − k are irrelevant • E.g. c ( x 1 , . . . , x 150 ) = x 5 ∨ x 9 ∨ ¬ x 12 , n = 150 , k = 3 • For disjunctions, number of prediction mistakes (in on- line model) is O ( k log n ) for Winnow and (in worst case) Ω ( kn ) for Perceptron • So in worst case, need exponentially fewer updates for learning with Winnow than Perceptron Bound is only for disjunctions, but improvement for learn- ing with irrelevant attributes is often true w ∗ not sparse, sometimes Perceptron better When � Also, have proofs for agnostic error bounds for both algo- rithms 9
Gradient Descent and Exponentiated Gradient • Useful when linear separability impossible but still want to minimize training error • Consider simpler linear unit , where o = w 0 + w 1 x 1 + · · · + w n x n (i.e. no threshold) • For moment, assume that we update weights after seeing each example � x d • For each example, want to compromise between correctiveness and conservativeness – Correctiveness: Tendency to improve on � x d (re- duce error) – Conservativeness: Tendency to keep w d +1 close to � � w d (minimize distance) • Use cost function that measures both:   curr ex, new wts � � � �� � U ( � w ) = dist + η error w d +1 , � � w d  t d , w d +1 · � � x d  10
Gradient Descent and Exponentiated Gradient (cont’d) 25 20 15 E[w] 10 5 0 2 1 -2 -1 0 0 1 2 -1 3 w1 w0 � � ∂U ∂U , ∂U , · · · , ∂U w = ∂ � ∂w 0 ∂w 1 ∂w n 11
Gradient Descent conserv. corrective coef. � �� � � �� � w d � 2 x d ) 2 ���� U ( � w ) = � � w d +1 − � 2 + η ( t d − � w d +1 · � 2   n n � 2 + η � � � = w i,d +1 − w i,d  t d − w i,d +1 x i,d  i =1 i =1 w d +1 and set to � Take gradient w.r.t. � 0 :   n � � �  x i,d 0 = 2 w i,d +1 − w i,d − 2 η  t d − w i,d +1 x i,d i =1 Approximate with   n � � �  x i,d , 0 = 2 w i,d +1 − w i,d − 2 η  t d − w i,d x i,d i =1 which yields ∆ w add i,d � �� � w i,d +1 = w i,d + η ( t d − o d ) x i,d 12
Exponentiated Gradient Conserv. portion uses unnormalized relative entropy: conserv. corrective � �� � coef. n � � w i,d − w i,d +1 + w i,d +1 ln w i,d +1 � �� � � ���� x d ) 2 U ( � w ) = + η ( t d − � w d +1 · � w i,d i =1 w d +1 and set to � Take gradient w.r.t. � 0 :   n 0 = ln w i,d +1 �  x i,d − 2 η  t d − w i,d +1 x i,d w i,d i =1 Approximate with   n 0 = ln w i,d +1 �  x i,d , − 2 η  t d − w i,d x i,d w i,d i =1 which yields (for η = ln α/ 2 ) ∆ w mult i,d � �� � � � α ( t d − o d ) x i,d w i,d +1 = w i,d exp 2 η ( t d − o d ) x i,d = w i,d 13
Implementation Approaches • Can use rules on previous slides on an example-by- example basis, sometimes called incremental, stochastic, or on-line GD/EG – Has a tendency to “jump around” more in search- ing, which helps avoid getting trapped in local min- ima • Alternatively, can use standard or batch GD/EG, in which the classifier is evaluated over all training exam- ples, summing the error, and then updates are made – I.e. sum up ∆ w i for all examples, but don’t update w i until summation complete (p. 93, Table 4.1) – This is an inherent averaging process and tends to give better estimate of the gradient 14
Remarks • Perceptron and Winnow update weights based on thresh- olded output, while GD and EG use unthresholded outputs • P/W converge in finite number of steps to perfect hyp if data linearly separable; GD/EG work on non-linearly separable data, but only converge asymptotically (to wts with minimum squared error) • As with P vs. W, EG tends to work better than GD when many attributes are irrelevant – Allows the addition of attributes that are nonlinear combinations of original ones, to work around lin- ear sep. problem (perhaps get linear separability in new, higher-dimensional space) – E.g. if two attributes are x 1 and x 2 , use as EG inputs � � x 1 , x 2 , x 1 x 2 , x 2 1 , x 2 x = � 2 • Also, both have provable agnostic results 15
Handling Nonlinearly Separable Data The XOR Problem x 2 D: (1,1) B: (0,1) neg pos g (x) 2 neg > 0 < 0 x 1 A: (0,0) C: (1,0) > 0 < 0 g (x) 1 • Can’t represent with a single linear separator, but can with intersection of two: g 1 ( � x ) = 1 · x 1 + 1 · x 2 − 1 / 2 g 2 ( � x ) = 1 · x 1 + 1 · x 2 − 3 / 2 � � x ∈ ℜ ℓ : g 1 ( � pos = � x ) > 0 AND g 2 ( � x ) < 0 � � x ∈ ℜ ℓ : g 1 ( � neg = � x ) , g 2 ( � x ) < 0 OR g 1 ( � x ) , g 2 ( � x ) > 0 16
The XOR Problem (cont’d)  0 if g i ( � x ) < 0  • Let y i = 1 otherwise  Class ( x 1 , x 2 ) g 1 ( � x ) y 1 g 2 ( � x ) y 2 B: (0 , 1) 1 / 2 1 − 1 / 2 pos 0 C: (1 , 0) 1 / 2 1 − 1 / 2 pos 0 neg A: (0 , 0) − 1 / 2 0 − 3 / 2 0 neg D: (1 , 1) 3 / 2 1 1 / 2 1 • Now feed y 1 , y 2 into: g ( � y ) = 1 · y 1 − 2 · y 2 − 1 / 2 y 2 g(y) < 0 D: (1,1) > 0 neg pos y 1 A: (0,0) B, C: (1,0) 17
The XOR Problem (cont’d) • In other words, we remapped all vectors � x to � y such that the classes are linearly separable in the new vec- tor space w = -1/2 Hidden Layer 30 y 1 w = -1/2 Σ 50 w = 1 w x 31 w = 1 i 3i 53 i x 1 Σ Input Layer w = 1 w y 32 5i i i x w = 1 2 41 Σ w = -2 Output w = 1 w x i 54 42 4i i Layer y 2 w = -3/2 40 • This is a two-layer perceptron or two-layer feedforward neural network • Each neuron outputs 1 if its weighted sum exceeds its threshold, 0 otherwise 18
Generally Handling Nonlinearly Separable Data • By adding up to 2 hidden layers of perceptrons, can represent any union of intersection of halfspaces neg pos pos neg pos neg pos • Problem: The above is still defined linearly 19
Recommend
More recommend