artificial neural networks and deep learning
play

Artificial Neural Networks and Deep Learning Christian Borgelt This - PowerPoint PPT Presentation

Textbooks Artificial Neural Networks and Deep Learning Christian Borgelt This lecture follows the first parts of these books Dept. of Mathematics / Dept. of Computer Sciences fairly closely, which treat Paris Lodron University of Salzburg


  1. Threshold Logic Units: Geometric Interpretation Threshold Logic Units: Limitations The biimplication problem x 1 ↔ x 2 : There is no separating line. (1 , 1 , 1) Visualization of 3-dimensional Boolean functions: 1 x 1 x 2 y x 3 x 2 0 0 1 x 2 1 0 0 x 1 0 1 0 0 (0 , 0 , 0) 1 1 1 x 1 0 1 Threshold logic unit for ( x 1 ∧ x 2 ) ∨ ( x 1 ∧ x 3 ) ∨ ( x 2 ∧ x 3 ) . Formal proof by reductio ad absurdum : x 1 since (0 , 0) �→ 1: 0 ≥ θ, (1) 2 since (1 , 0) �→ 0: w 1 < θ, (2) x 3 − 2 x 2 x 2 y 1 since (0 , 1) �→ 0: w 2 < θ, (3) x 1 since (1 , 1) �→ 1: w 1 + w 2 ≥ θ. (4) 2 x 3 (2) and (3): w 1 + w 2 < 2 θ . With (4): 2 θ > θ , or θ > 0. Contradiction to (1). Christian Borgelt Artificial Neural Networks and Deep Learning 21 Christian Borgelt Artificial Neural Networks and Deep Learning 22 Linear Separability Linear Separability Definition : Two sets of points in a Euclidean space are called linearly separable , Definition : A set of points in a Euclidean space is called convex if it is non-empty iff there exists at least one point, line, plane or hyperplane (depending on the dimension and connected (that is, if it is a region ) and for every pair of points in it every point of the Euclidean space), such that all points of the one set lie on one side and all points on the straight line segment connecting the points of the pair is also in the set. of the other set lie on the other side of this point, line, plane or hyperplane (or on it). That is, the point sets can be separated by a linear decision function . Formally: Definition : The convex hull of a set of points X in a Euclidean space is the R m are linearly separable iff � R m and θ ∈ I Two sets X, Y ⊂ I w ∈ I R exist such that smallest convex set of points that contains X . Alternatively, the convex hull of a w ⊤ � w ⊤ � set of points X is the intersection of all convex sets that contain X . ∀ � x ∈ X : � x < θ and ∀ � y ∈ Y : � y ≥ θ. • Boolean functions define two points sets, namely the set of points that are Theorem : Two sets of points in a Euclidean space are linearly separable mapped to the function value 0 and the set of points that are mapped to 1. if and only if their convex hulls are disjoint (that is, have no point in common). ⇒ The term “linearly separable” can be transferred to Boolean functions. • For the biimplication problem, the convex hulls are the diagonal line segments. • As we have seen, conjunction and implication are linearly separable (as are disjunction , NAND , NOR etc.). • They share their intersection point and are thus not disjoint. • The biimplication is not linearly separable • Therefore the biimplication is not linearly separable. (and neither is the exclusive or ( XOR )). Christian Borgelt Artificial Neural Networks and Deep Learning 23 Christian Borgelt Artificial Neural Networks and Deep Learning 24

  2. Threshold Logic Units: Limitations Networks of Threshold Logic Units Total number and number of linearly separable Boolean functions Solving the biimplication problem with a network. (On-Line Encyclopedia of Integer Sequences, oeis.org , A001146 and A000609): x 1 ↔ x 2 ≡ ( x 1 → x 2 ) ∧ ( x 2 → x 1 ) Idea: logical decomposition inputs Boolean functions linearly separable functions 1 4 4 computes y 1 = x 1 → x 2 2 16 14 3 256 104 − 2 x 1 − 1 computes y = y 1 ∧ y 2 4 65,536 1,882 2 2 5 4,294,967,296 94,572 y = x 1 ↔ x 2 6 18,446,744,073,709,551,616 15,028,134 3 2 (2 n ) n no general formula known 2 2 x 2 − 1 − 2 computes y 2 = x 2 → x 1 • For many inputs a threshold logic unit can compute almost no functions. • Networks of threshold logic units are needed to overcome the limitations. Christian Borgelt Artificial Neural Networks and Deep Learning 25 Christian Borgelt Artificial Neural Networks and Deep Learning 26 Networks of Threshold Logic Units Representing Arbitrary Boolean Functions Algorithm : Let y = f ( x 1 , . . . , x n ) be a Boolean function of n variables. Solving the biimplication problem: Geometric interpretation (i) Represent the given function f ( x 1 , . . . , x n ) in disjunctive normal form. That is, g 2 0 determine D f = C 1 ∨ . . . ∨ C m , where all C j are conjunctions of n literals, that is, 1 C j = l j 1 ∧ . . . ∧ l jn with l ji = x i (positive literal) or l ji = ¬ x i (negative literal). g 3 g 1 1 ac d c 1 0 (ii) Create a neuron for each conjunction C j of the disjunctive normal form 1 1 0 b (having n inputs — one input for each variable), where = ⇒ y 2 x 2 � n θ j = n − 1 + 1 � +2 , if l ji = x i , a b w ji = and w ji . d 0 0 − 2 , if l ji = ¬ x i , 2 i =1 x 1 y 1 0 1 0 1 (iii) Create an output neuron (having m inputs — one input for each neuron that was created in step (ii)), where • The first layer computes new Boolean coordinates for the points. w ( n +1) k = 2 , k = 1 , . . . , m, and θ n +1 = 1 . • After the coordinate transformation the problem is linearly separable. Remark: weights are set to ± 2 instead of ± 1 in order to ensure integer thresholds. Christian Borgelt Artificial Neural Networks and Deep Learning 27 Christian Borgelt Artificial Neural Networks and Deep Learning 28

  3. Representing Arbitrary Boolean Functions Representing Arbitrary Boolean Functions Example: First layer (conjunctions): Example: ternary Boolean function: ternary Boolean function: Resulting network of threshold logic units: C 1 = x 1 ∧ x 2 ∧ x 3 x 1 x 2 x 3 y C j x 1 x 2 x 3 y C j 0 0 0 0 0 0 0 0 2 x 1 1 1 0 0 1 x 1 ∧ x 2 ∧ x 3 1 0 0 1 x 1 ∧ x 2 ∧ x 3 − 2 0 1 0 0 0 1 0 0 2 D f = C 1 ∨ C 2 ∨ C 3 2 C 1 = C 2 = C 3 = 1 1 0 0 1 1 0 0 − 2 2 x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 0 0 1 0 0 0 1 0 2 x 2 y 3 1 1 0 1 0 1 0 1 0 2 Second layer (disjunction): x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 0 1 1 1 0 1 1 1 − 2 2 x 1 ∧ x 2 ∧ x 3 x 1 ∧ x 2 ∧ x 3 1 1 1 1 1 1 1 1 2 x 3 C 2 = x 1 ∧ x 2 ∧ x 3 5 C 3 2 D f = C 1 ∨ C 2 ∨ C 3 C 2 D f = C 1 ∨ C 2 ∨ C 3 C 1 C 3 = x 1 ∧ x 2 ∧ x 3 One conjunction for each row One conjunction for each row where the output y is 1 with where the output y is 1 with D f = C 1 ∨ C 2 ∨ C 3 literals according to input values. literals according to input values. Christian Borgelt Artificial Neural Networks and Deep Learning 29 Christian Borgelt Artificial Neural Networks and Deep Learning 30 Reminder: Convex Hull Theorem Theorem : Two sets of points in a Euclidean space are linearly separable if and only if their convex hulls are disjoint (that is, have no point in common). Example function on the preceding slide: y = f ( x 1 , x 2 , x 3 ) = ( x 1 ∧ x 2 ∧ x 3 ) ∨ ( x 1 ∧ x 2 ∧ x 3 ) ∨ ( x 1 ∧ x 2 ∧ x 3 ) Training Threshold Logic Units Convex hull of points with y = 0 Convex hull of points with y = 1 • The convex hulls of the two point sets are not disjoint (red: intersection). • Therefore the function y = f ( x 1 , x 2 , x 3 ) is not linearly separable. Christian Borgelt Artificial Neural Networks and Deep Learning 31 Christian Borgelt Artificial Neural Networks and Deep Learning 32

  4. Training Threshold Logic Units Training Threshold Logic Units • Geometric interpretation provides a way to construct threshold logic units Single input threshold logic unit for the negation ¬ x . with 2 and 3 inputs, but: x y ◦ Not an automatic method (human visualization needed). w y x 0 1 θ 1 0 ◦ Not feasible for more than 3 inputs. • General idea of automatic training: Output error as a function of weight and threshold. ◦ Start with random values for weights and threshold. 2 2 2 ◦ Determine the error of the output for a set of training patterns. e e e 1 1 1 2 2 2 ◦ Error is a function of the weights and the threshold: e = e ( w 1 , . . . , w n , θ ). 2 1 2 1 2 1 1 1 1 w 0 w 0 w 0 ◦ Adapt weights and threshold so that the error becomes smaller. 2 2 2 –1 –1 –1 1 1 1 0 0 0 1 1 1 –2 – –2 – –2 – θ θ θ – 2 – 2 – 2 ◦ Iterate adaptation until the error vanishes. error for x = 0 error for x = 1 sum of errors Christian Borgelt Artificial Neural Networks and Deep Learning 33 Christian Borgelt Artificial Neural Networks and Deep Learning 34 Training Threshold Logic Units Training Threshold Logic Units • The error function cannot be used directly, because it consists of plateaus. Schemata of resulting directions of parameter changes. • Solution: If the computed output is wrong, 2 2 2 take into account how far the weighted sum is from the threshold 1 1 1 (that is, consider “how wrong” the relation of weighted sum and threshold is). w w w 0 0 0 − 1 − 1 − 1 Modified output error as a function of weight and threshold. − 2 − 2 − 2 − 2 − 1 0 1 2 − 2 − 1 0 1 2 − 2 − 1 0 1 2 θ θ θ 4 4 4 changes for x = 0 changes for x = 1 sum of changes e e e 3 3 3 4 4 4 2 2 2 3 3 3 1 1 1 2 2 2 • Start at a random point. 2 2 2 1 1 1 1 1 1 w 0 w 0 w 0 • Iteratively adapt parameters 2 2 2 –1 –1 –1 1 1 1 0 0 0 1 1 1 –2 – –2 – –2 – θ θ θ – 2 – 2 – 2 according to the direction corresponding to the current point. error for x = 0 error for x = 1 sum of errors • Stop if the error vanishes. Christian Borgelt Artificial Neural Networks and Deep Learning 35 Christian Borgelt Artificial Neural Networks and Deep Learning 36

  5. Training Threshold Logic Units: Delta Rule Training Threshold Logic Units: Delta Rule x = ( x 1 , . . . , x n ) ⊤ be an input vector of a threshold Formal Training Rule: Let � procedure online training ( var � w, var θ, L, η ); logic unit, o the desired output for this input vector and y the actual output of the var y , e ; (* output, sum of errors *) If y � = o , then the threshold θ and the weight vector � threshold logic unit. w = ( w 1 , . . . , w n ) ⊤ are adapted as follows in order to reduce the error: begin repeat (* training loop *) θ (new) = θ (old) + ∆ θ e := 0; (* initialize the error sum *) = − η ( o − y ) , with ∆ θ for all ( � x, o ) ∈ L do begin (* traverse the patterns *) ∀ i ∈ { 1 , . . . , n } : w (new) = w (old) + ∆ w i with ∆ w i = η ( o − y ) x i , i i w ⊤ � if ( � x ≥ θ ) then y := 1; (* compute the output *) else y := 0; (* of the threshold logic unit *) where η is a parameter that is called learning rate . It determines the severity of the if ( y � = o ) then begin (* if the output is wrong *) weight changes. This procedure is called Delta Rule or Widrow–Hoff Procedure θ := θ − η ( o − y ); (* adapt the threshold *) [Widrow and Hoff 1960]. w := � � w + η ( o − y ) � x ; (* and the weights *) e := e + | o − y | ; (* sum the errors *) end; • Online Training: Adapt parameters after each training pattern. end; until ( e ≤ 0); • Batch Training: Adapt parameters only at the end of each epoch , (* repeat the computations *) end; (* until the error vanishes *) that is, after a traversal of all training patterns. Christian Borgelt Artificial Neural Networks and Deep Learning 37 Christian Borgelt Artificial Neural Networks and Deep Learning 38 Training Threshold Logic Units: Delta Rule Training Threshold Logic Units: Online procedure batch training ( var � w, var θ, L, η ); var y , e , θ c , � w c ; (* output, sum of errors, sums of changes *) xw − θ epoch x o y e ∆ θ ∆ w θ w begin 1.5 2 repeat (* training loop *) 1 0 1 − 1 . 5 0 1 − 1 0 0.5 2 w c := � e := 0; θ c := 0; � 0; (* initializations *) 1 0 1.5 1 − 1 1 − 1 1.5 1 for all ( � x, o ) ∈ L do begin (* traverse the patterns *) 2 0 1 − 1 . 5 0 1 − 1 0 0.5 1 w ⊤ � if ( � x ≥ θ ) then y := 1; (* compute the output *) − 1 − 1 1 0 0.5 1 1 1.5 0 else y := 0; (* of the threshold logic unit *) if ( y � = o ) then begin 3 0 1 − 1 . 5 0 1 − 1 0 0.5 0 (* if the output is wrong *) 1 0 0.5 0 0 0 0 0.5 0 θ c := θ c − η ( o − y ); (* sum the changes of the *) w c := � � w c + η ( o − y ) � x ; (* threshold and the weights *) 4 0 1 − 0 . 5 0 1 − 1 0 − 0 . 5 0 e := e + | o − y | ; (* sum the errors *) 1 0 0.5 1 − 1 1 − 1 0.5 − 1 end; 5 0 1 − 0 . 5 0 1 − 1 0 − 0 . 5 − 1 end; − 0 . 5 − 0 . 5 − 1 1 0 0 0 0 0 θ := θ + θ c ; (* adapt the threshold *) 6 0 1 0.5 1 0 0 0 − 0 . 5 − 1 w := � � w + � w c ; (* and the weights *) 1 0 − 0 . 5 0 0 0 0 − 0 . 5 − 1 until ( e ≤ 0); (* repeat the computations *) end; (* until the error vanishes *) Christian Borgelt Artificial Neural Networks and Deep Learning 39 Christian Borgelt Artificial Neural Networks and Deep Learning 40

  6. Training Threshold Logic Units: Batch Training Threshold Logic Units xw − θ epoch x o y e ∆ θ ∆ w θ w Example training procedure: Online and batch training. 1.5 2 1 0 1 − 1 . 5 0 1 − 1 0 2 2 1 0 0.5 1 − 1 1 − 1 1.5 1 4 e 1 1 3 2 0 1 − 1 . 5 0 1 − 1 0 4 2 w w 0 0 3 1 0 − 0 . 5 0 0 0 0 0.5 1 1 2 2 − 1 − 1 − 0 . 5 − 1 3 0 1 0 1 0 1 1 w 0 1 0 0.5 1 − 1 1 − 1 0.5 0 − 2 − 2 2 –1 1 0 1 − 2 − 1 0 1 2 − 2 − 1 0 1 2 –2 – θ 2 4 0 1 − 0 . 5 0 1 − 1 0 – θ θ 1 0 − 0 . 5 0 0 0 0 − 0 . 5 0 Online Training Batch Training Batch Training 5 0 1 0.5 1 0 0 0 1 0 0.5 1 − 1 1 − 1 0.5 − 1 − 0 . 5 − 1 6 0 1 0 1 0 x − 1 − 1 y x 1 0 − 1 . 5 0 0 0 0 − 0 . 5 − 1 2 0 1 7 0 1 0.5 1 0 0 0 1 0 − 0 . 5 0 0 0 0 − 0 . 5 − 1 Christian Borgelt Artificial Neural Networks and Deep Learning 41 Christian Borgelt Artificial Neural Networks and Deep Learning 42 Training Threshold Logic Units: Conjunction Training Threshold Logic Units: Conjunction x ⊤ � epoch x 1 x 2 o � w − θ y e ∆ θ ∆ w 1 ∆ w 2 θ w 1 w 2 0 0 0 Threshold logic unit with two inputs for the conjunction. − 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 − 1 0 0 0 0 0 1 0 0 x 1 1 0 0 − 1 0 0 0 0 0 1 0 0 x 1 x 2 y w 1 1 1 1 − 1 0 1 − 1 1 1 0 1 1 0 0 0 2 0 0 0 0 1 − 1 1 0 0 1 1 1 y θ 0 1 0 0 1 − 1 1 0 − 1 2 1 0 1 0 0 1 0 0 − 1 0 0 0 0 0 2 1 0 0 1 0 1 1 1 − 1 0 1 − 1 1 1 1 2 1 w 2 x 2 1 1 1 3 0 0 0 − 1 0 0 0 0 0 1 2 1 0 1 0 0 1 − 1 1 0 − 1 2 2 0 1 0 0 0 1 − 1 1 − 1 0 3 1 0 − 2 − 1 1 1 1 0 1 1 1 2 2 1 4 0 0 0 − 2 0 0 0 0 0 2 2 1 x 1 − 1 0 1 0 0 0 0 0 0 2 2 1 1 2 1 0 0 0 1 − 1 1 − 1 0 3 1 1 0 1 1 1 1 − 1 0 1 − 1 1 1 2 2 2 x 2 y 5 0 0 0 − 2 0 0 0 0 0 2 2 2 3 0 1 0 0 1 − 1 1 0 − 1 3 2 1 0 1 0 0 − 1 0 0 0 0 0 3 2 1 1 1 1 1 0 1 0 0 0 0 3 2 1 x 2 6 0 0 0 − 3 0 0 0 0 0 3 2 1 x 1 0 1 − 2 0 1 0 0 0 0 0 0 3 2 1 1 0 0 − 1 0 0 0 0 0 3 2 1 1 1 1 0 1 0 0 0 0 3 2 1 Christian Borgelt Artificial Neural Networks and Deep Learning 43 Christian Borgelt Artificial Neural Networks and Deep Learning 44

  7. Training Threshold Logic Units: Biimplication Training Threshold Logic Units: Convergence Convergence Theorem: Let L = { ( � x m , o m ) } be a set of training x 1 , o 1 ) , . . . ( � R n and a desired output o i ∈ { 0 , 1 } . x ⊤ � patterns, each consisting of an input vector � x i ∈ I epoch x 1 x 2 o � w − θ y e ∆ θ ∆ w 1 ∆ w 2 θ w 1 w 2 0 0 0 Furthermore, let L 0 = { ( � x, o ) ∈ L | o = 0 } and L 1 = { ( � x, o ) ∈ L | o = 1 } . 1 0 0 1 0 1 0 0 0 0 0 0 0 R n and θ ∈ I If L 0 and L 1 are linearly separable, that is, if � w ∈ I R exist such that − 1 − 1 − 1 0 1 0 0 1 1 0 1 0 1 0 0 − 1 0 0 0 0 0 1 0 − 1 w ⊤ � 1 1 1 − 2 0 1 − 1 1 1 0 1 0 ∀ ( � x, 0) ∈ L 0 : � x < θ and 2 0 0 1 0 1 0 0 0 0 0 1 0 w ⊤ � ∀ ( � x, 1) ∈ L 1 : � x ≥ θ, − 1 − 1 − 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 − 1 1 − 1 0 2 0 − 1 then online as well as batch training terminate. 1 1 1 − 3 0 1 − 1 1 1 1 1 0 3 0 0 1 0 0 1 -1 0 0 0 1 0 0 1 0 0 1 − 1 1 0 − 1 1 1 − 1 1 0 0 0 1 − 1 1 − 1 0 2 0 − 1 • The algorithms terminate only when the error vanishes. 1 1 1 − 3 0 1 − 1 1 1 1 1 0 4 0 0 1 0 0 1 -1 0 0 0 1 0 • Therefore the resulting threshold and weights must solve the problem. 0 1 0 0 1 − 1 1 0 − 1 1 1 − 1 − 1 − 1 − 1 1 0 0 0 1 1 0 2 0 • For not linearly separable problems the algorithms do not terminate 1 1 1 − 3 0 1 − 1 1 1 1 1 0 (oscillation, repeated computation of same non-solving � w and θ ). Christian Borgelt Artificial Neural Networks and Deep Learning 45 Christian Borgelt Artificial Neural Networks and Deep Learning 46 Training Threshold Logic Units: Delta Rule Training Threshold Logic Units: Delta Rule Formal Training Rule (with threshold turned into a weight): Turning the threshold value into a weight: x = ( x 0 = 1 , x 1 , . . . , x n ) ⊤ be an (extended) input vector of a threshold logic unit, Let � o the desired output for this input vector and y the actual output of the threshold +1 = x 0 w = ( w 0 = − θ, w 1 , . . . , w n ) ⊤ logic unit. If y � = o , then the (extended) weight vector � − 1 w 0 = − θ is adapted as follows in order to reduce the error: x 1 x 1 + θ w 1 w (new) = w (old) ∀ i ∈ { 0 , . . . , n } : + ∆ w i with ∆ w i = η ( o − y ) x i , w 1 i i y y x 2 x 2 θ 0 w 2 w 2 where η is a parameter that is called learning rate . It determines the severity of the weight changes. This procedure is called Delta Rule or Widrow–Hoff Procedure [Widrow and Hoff 1960]. w n w n x n x n • Note that with extended input and weight vectors, there is only one update rule (no distinction of threshold and weights). n n � � w i x i ≥ θ w i x i − θ ≥ 0 x = ( x 0 = − 1 , x 1 , . . . , x n ) ⊤ • Note also that the (extended) input vector may be � i =1 i =1 w = ( w 0 = + θ, w 1 , . . . , w n ) ⊤ . and the corresponding (extended) weight vector � Christian Borgelt Artificial Neural Networks and Deep Learning 47 Christian Borgelt Artificial Neural Networks and Deep Learning 48

  8. Training Networks of Threshold Logic Units • Single threshold logic units have strong limitations: They can only compute linearly separable functions. • Networks of threshold logic units can compute arbitrary Boolean functions. • Training single threshold logic units with the delta rule is easy and fast General (Artificial) Neural Networks and guaranteed to find a solution if one exists. • Networks of threshold logic units cannot be trained , because ◦ there are no desired values for the neurons of the first layer(s), ◦ the problem can usually be solved with several different functions computed by the neurons of the first layer(s) (non-unique solution). • When this situation became clear, neural networks were first seen as a “research dead end”. Christian Borgelt Artificial Neural Networks and Deep Learning 49 Christian Borgelt Artificial Neural Networks and Deep Learning 50 General Neural Networks General Neural Networks Basic graph theoretic notions General definition of a neural network An (artificial) neural network is a (directed) graph G = ( U, C ), A (directed) graph is a pair G = ( V, E ) consisting of a (finite) set V of vertices whose vertices u ∈ U are called neurons or units and or nodes and a (finite) set E ⊆ V × V of edges . whose edges c ∈ C are called connections . We call an edge e = ( u, v ) ∈ E directed from vertex u to vertex v . The set U of vertices is partitioned into Let G = ( V, E ) be a (directed) graph and u ∈ V a vertex. • the set U in of input neurons , Then the vertices of the set • the set U out of output neurons , and pred( u ) = { v ∈ V | ( v, u ) ∈ E } • the set U hidden of hidden neurons . are called the predecessors of the vertex u and the vertices of the set It is U = U in ∪ U out ∪ U hidden , succ( u ) = { v ∈ V | ( u, v ) ∈ E } are called the successors of the vertex u . U in � = ∅ , U out � = ∅ , U hidden ∩ ( U in ∪ U out ) = ∅ . Christian Borgelt Artificial Neural Networks and Deep Learning 51 Christian Borgelt Artificial Neural Networks and Deep Learning 52

  9. General Neural Networks General Neural Networks Each connection ( v, u ) ∈ C possesses a weight w uv and Types of (artificial) neural networks: each neuron u ∈ U possesses three (real-valued) state variables: • If the graph of a neural network is acyclic , • the network input net u , it is called a feed-forward network . • the activation act u , and • If the graph of a neural network contains cycles (backward connections), it is called a recurrent network . • the output out u . Each input neuron u ∈ U in also possesses a fourth (real-valued) state variable, • the external input ext u . Representation of the connection weights as a matrix: Furthermore, each neuron u ∈ U possesses three functions: u 1 u 2 . . . u r   f ( u ) R 2 | pred( u ) | + κ 1 ( u ) → I • the network input function net : I R , w u 1 u 1 w u 1 u 2 . . . w u 1 u r u 1    w u 2 u 1 w u 2 u 2 w u 2 u r  u 2 f ( u ) R κ 2 ( u ) → I   • the activation function act : I R , and . . .  . .  . . . .   f ( u ) w u r u 1 w u r u 2 . . . w u r u r u r • the output function out : I R → I R , which are used to compute the values of the state variables. Christian Borgelt Artificial Neural Networks and Deep Learning 53 Christian Borgelt Artificial Neural Networks and Deep Learning 54 General Neural Networks: Example Structure of a Generalized Neuron A simple recurrent neural network A generalized neuron is a simple numeric processor. − 2 ext u x 1 u 1 u 3 y u out v 1 = in uv 1 4 1 w uv 1 3 x 2 u 2 f ( u ) f ( u ) f ( u ) net net u act act u out out u Weight matrix of this network out v n = in uv n u 1 u 2 u 3 w uv n   0 0 4 u 1   1 0 0 u 2   σ 1 , . . . , σ l θ 1 , . . . , θ k − 2 3 0 u 3 Christian Borgelt Artificial Neural Networks and Deep Learning 55 Christian Borgelt Artificial Neural Networks and Deep Learning 56

  10. General Neural Networks: Example General Neural Networks: Example Updating the activations of the neurons u 1 u 3 − 2 x 1 y 1 1 u 1 u 2 u 3 4 input phase 1 0 0 1 work phase 1 0 0 net u 3 = − 2 < 1 3 0 0 0 net u 1 = 0 < 1 x 2 1 0 0 0 net u 2 = 0 < 1 0 0 0 net u 3 = 0 < 1 u 2 0 0 0 net u 1 = 0 < 1 � � f ( u ) w u , � • Order in which the neurons are updated: net ( � in u ) = w uv in uv = w uv out v u 3 , u 1 , u 2 , u 3 , u 1 , u 2 , u 3 , . . . v ∈ pred( u ) v ∈ pred( u ) � • Input phase: activations/outputs in the initial state. 1 , if net u ≥ θ , f ( u ) act (net u , θ ) = 0 , otherwise. • Work phase: activations/outputs of the next neuron to update (bold) are com- puted from the outputs of the other neurons and the weights/threshold. f ( u ) out (act u ) = act u • A stable state with a unique output is reached. Christian Borgelt Artificial Neural Networks and Deep Learning 57 Christian Borgelt Artificial Neural Networks and Deep Learning 58 General Neural Networks: Example General Neural Networks: Training Updating the activations of the neurons Definition of learning tasks for a neural network u 1 u 2 u 3 A fixed learning task L fixed for a neural network with input phase 1 0 0 • n input neurons U in = { u 1 , . . . , u n } and work phase 1 0 0 net u 3 = − 2 < 1 1 1 0 net u 2 = 1 ≥ 1 • m output neurons U out = { v 1 , . . . , v m } , 0 1 0 net u 1 = 0 < 1 ≥ 1 0 1 1 net u 3 = 3 ı ( l ) ,� o ( l ) ), each consisting of is a set of training patterns l = ( � 0 0 1 net u 2 = 0 < 1 1 0 1 net u 1 = 4 ≥ 1 ı ( l ) = ( ext ( l ) u 1 , . . . , ext ( l ) • an input vector � u n ) and 1 0 0 net u 3 = − 2 < 1 o ( l ) = ( o ( l ) v 1 , . . . , o ( l ) • an output vector � v m ). • Order in which the neurons are updated: u 3 , u 2 , u 1 , u 3 , u 2 , u 1 , u 3 , . . . A fixed learning task is solved, if for all training patterns l ∈ L fixed the neural network • No stable state is reached (oscillation of output). ı ( l ) of a training computes from the external inputs contained in the input vector � o ( l ) . pattern l the outputs contained in the corresponding output vector � Christian Borgelt Artificial Neural Networks and Deep Learning 59 Christian Borgelt Artificial Neural Networks and Deep Learning 60

  11. General Neural Networks: Training General Neural Networks: Training Solving a fixed learning task: Error definition Definition of learning tasks for a neural network A free learning task L free for a neural network with • Measure how well a neural network solves a given fixed learning task. • Compute differences between desired and actual outputs. • n input neurons U in = { u 1 , . . . , u n } , • Do not sum differences directly in order to avoid errors canceling each other. ı ( l ) ), each consisting of is a set of training patterns l = ( � • Square has favorable properties for deriving the adaptation rules. ı ( l ) = ( ext ( l ) u 1 , . . . , ext ( l ) • an input vector � u n ). � � � � e ( l ) = e ( l ) Properties: e = e v = v , l ∈ L fixed v ∈ U out l ∈ L fixed v ∈ U out • There is no desired output for the training patterns. � � 2 • Outputs can be chosen freely by the training method. e ( l ) o ( l ) v − out ( l ) where v = v • Solution idea: Similar inputs should lead to similar outputs. (clustering of input vectors) Christian Borgelt Artificial Neural Networks and Deep Learning 61 Christian Borgelt Artificial Neural Networks and Deep Learning 62 General Neural Networks: Preprocessing Normalization of the input vectors • Compute expected value and (corrected) variance for each input: � � 2 � � µ k = 1 1 ext ( l ) ext ( l ) σ 2 and k = u k − µ k , u k | L | | L | − 1 l ∈ L l ∈ L Multi-layer Perceptrons (MLPs) • Normalize the input vectors to expected value 0 and standard deviation 1: = ext ( l )(old) − µ k ext ( l )(new) u k � u k σ 2 k • Such a normalization avoids unit and scaling problems. It is also known as z-scaling or z-score standardization/normalization . Christian Borgelt Artificial Neural Networks and Deep Learning 63 Christian Borgelt Artificial Neural Networks and Deep Learning 64

  12. Multi-layer Perceptrons Multi-layer Perceptrons An r-layer perceptron is a neural network with a graph G = ( U, C ) General structure of a multi-layer perceptron that satisfies the following conditions: (i) U in ∩ U out = ∅ , x 1 y 1 (ii) U hidden = U (1) hidden ∪ · · · ∪ U ( r − 2) hidden , U ( i ) hidden ∩ U ( j ) x 2 y 2 ∀ 1 ≤ i < j ≤ r − 2 : hidden = ∅ , � � �� r − 3 � � � U in × U (1) i =1 U ( i ) hidden × U ( i +1) U ( r − 2) (iii) C ⊆ ∪ ∪ hidden × U out hidden hidden or, if there are no hidden neurons ( r = 2 , U hidden = ∅ ), x n y m C ⊆ U in × U out . U (1) U (2) U ( r − 2) U in U out hidden hidden hidden • Feed-forward network with strictly layered structure. Christian Borgelt Artificial Neural Networks and Deep Learning 65 Christian Borgelt Artificial Neural Networks and Deep Learning 66 Multi-layer Perceptrons Sigmoid Activation Functions step function: semi-linear function: • The network input function of each hidden neuron and of each output neuron is � � 1 , if net > θ + 1 the weighted sum of its inputs, that is, 2 , 1 , if net ≥ θ, f act (net , θ ) = if net < θ − 1 f act (net , θ ) = 0 , 2 , 0 , otherwise. � f ( u ) (net − θ ) + 1 w u , � w ⊤ u � 2 , otherwise. ∀ u ∈ U hidden ∪ U out : net ( � in u ) = � in u = w uv out v . v ∈ pred ( u ) 1 1 1 1 2 2 • The activation function of each hidden neuron is a so-called net net sigmoid function , that is, a monotonically increasing function 0 0 θ − 1 θ + 1 θ θ 2 2 f : I R → [0 , 1] with x →−∞ f ( x ) = 0 lim and x →∞ f ( x ) = 1 . lim sine until saturation: logistic function:   if net > θ + π 1 , 2 , 1 f act (net , θ ) = if net < θ − π 0 , 2 , f act (net , θ ) = 1 + e − (net − θ )  • The activation function of each output neuron is either also a sigmoid function or sin(net − θ )+1 , otherwise. 2 a linear function , that is, 1 1 f act (net , θ ) = α net − θ. 1 1 2 2 net Only the step function is a neurobiologically plausible activation function. net 0 0 θ − π θ + π θ θ − 4 θ − 2 θ θ + 2 θ + 4 2 2 Christian Borgelt Artificial Neural Networks and Deep Learning 67 Christian Borgelt Artificial Neural Networks and Deep Learning 68

  13. Sigmoid Activation Functions Multi-layer Perceptrons: Weight Matrices • All sigmoid functions on the previous slide are unipolar , Let U 1 = { v 1 , . . . , v m } and U 2 = { u 1 , . . . , u n } be the neurons that is, they range from 0 to 1. of two consecutive layers of a multi-layer perceptron. • Sometimes bipolar sigmoid functions are used (ranging from − 1 to +1), Their connection weights are represented by an n × m matrix like the hyperbolic tangent ( tangens hyperbolicus ).   w u 1 v 1 w u 1 v 2 . . . w u 1 v m     w u 2 v 1 w u 2 v 2 . . . w u 2 v m   W =  , . . .  . . .  . . . hyperbolic tangent:  w u n v 1 w u n v 2 . . . w u n v m f act (net , θ ) = tanh(net − θ ) 1 = e (net − θ ) − e − (net − θ ) where w u i v j = 0 if there is no connection from neuron v j to neuron u i . e (net − θ ) + e − (net − θ ) net 0 Advantage: The computation of the network input can be written as θ − 2 θ − 1 θ + 1 θ + 2 θ = 1 − e − 2(net − θ ) 1 + e − 2(net − θ ) − 1 net U 2 = W · � � in U 2 = W · � out U 1 2 = 1 + e − 2(net − θ ) − 1 net U 2 = (net u 1 , . . . , net u n ) ⊤ and � out U 1 = (out v 1 , . . . , out v m ) ⊤ . where � in U 2 = � Christian Borgelt Artificial Neural Networks and Deep Learning 69 Christian Borgelt Artificial Neural Networks and Deep Learning 70 Multi-layer Perceptrons: Biimplication Multi-layer Perceptrons: Fredkin Gate Solving the biimplication problem with a multi-layer perceptron. s s s 0 0 0 0 1 1 1 1 x 1 y 1 − 2 x 1 0 0 1 1 0 0 1 1 x 1 − 1 x 2 y 2 2 x 2 0 1 0 1 0 1 0 1 2 0 0 1 1 y 1 0 0 1 1 0 1 0 1 y 3 a a a b y 2 0 1 0 1 0 0 1 1 2 2 a b b b x 2 − 1 − 2 U in U hidden U out y 1 y 2 Note the additional input neurons compared to the TLU solution. x 3 x 3 x 2 x 2 x 1 � � x 1 � � − 2 2 W 1 = and W 2 = 2 2 2 − 2 Christian Borgelt Artificial Neural Networks and Deep Learning 71 Christian Borgelt Artificial Neural Networks and Deep Learning 72

  14. Multi-layer Perceptrons: Fredkin Gate Reminder: Convex Hull Theorem • The Fredkin gate (after Edward Fredkin ∗ 1934) or controlled swap gate Theorem : Two sets of points in a Euclidean space are linearly separable if and only if their convex hulls are disjoint (that is, have no point in common). (CSWAP) is a computational circuit that is used in conservative logic and reversible computing . Both outputs y 1 and y 2 of a Fredkin gate are not linearly separable: • Conservative logic is a model of computation that explicitly reflects the physical properties of computation, like the reversibility of the dynamical laws and the conservation of certain quantities (e.g. energy) [Fredkin and Toffoli 1982]. • The Fredkin gate is reversible in the sense that the inputs can be computed as functions of the outputs in the same way in which the outputs can be computed Convex hull of points with y 2 = 0 Convex hull of points with y 1 = 0 as functions of the inputs (no information loss, no entropy gain). • The Fredkin gate is universal in the sense that all Boolean functions can be computed using only Fredkin gates. • Note that both outputs, y 1 and y 2 are not linearly separable , because the convex hull of the points mapped to 0 and the convex hull of the points mapped Convex hull of points with y 2 = 1 to 1 share the point in the center of the cube. Convex hull of points with y 1 = 1 Christian Borgelt Artificial Neural Networks and Deep Learning 73 Christian Borgelt Artificial Neural Networks and Deep Learning 74 Excursion: Billiard Ball Computer Excursion: The (Negated) Fredkin Gate • A billiard-ball computer (a.k.a. conservative logic circuit ) is an idealized model of a reversible mechanical computer that simulates the computations of circuits by the movements of spherical billiard balls in a frictionless environment [Fredkin and Toffoli 1982]. Switch (control output location of an input ball by another ball; equivalent to an AND gate in its second/lower output): • Only reflectors (black) and inputs/outputs (color) shown. • Meaning of switch variable is negated compared to value table on previous slide: 0 switches inputs, 1 passes them through. Christian Borgelt Artificial Neural Networks and Deep Learning 75 Christian Borgelt Artificial Neural Networks and Deep Learning 76

  15. Excursion: The (Negated) Fredkin Gate Excursion: The (Negated) Fredkin Gate • All possible paths of billiard-balls through this gate are shown. • This Fredkin gate implementation contains four switches (shown in light blue). Inputs/outputs are in color; red: switch, green: x 1 /y 1 , blue: x 2 /y 2 . • The other parts serve the purpose to equate the travel times of the balls • Billiard-balls are shown in places where their trajectories may change direction. through the gate. Christian Borgelt Artificial Neural Networks and Deep Learning 77 Christian Borgelt Artificial Neural Networks and Deep Learning 78 Excursion: The (Negated) Fredkin Gate Excursion: The (Negated) Fredkin Gate • Reminder: in this implementation the meaning of the switch variable is negated. • If a ball is entered at the switch input, it is passed through. • Switch is zero (no ball enters at red arrow): • Since there are no other inputs, the other outputs remain 0 blue and green balls switch to the other output. (no ball in, no ball out). Christian Borgelt Artificial Neural Networks and Deep Learning 79 Christian Borgelt Artificial Neural Networks and Deep Learning 80

  16. Excursion: The (Negated) Fredkin Gate Excursion: The (Negated) Fredkin Gate • A ball is entered at the switch input and passed through. • A ball is entered at the switch input and passed through. • A ball entered at the first input x 1 is also passed through to output y 1 . • A ball entered at the second input x 2 is also passed through to output y 2 . (Note the symmetry of the trajectories.) (Note the symmetry of the trajectories.) Christian Borgelt Artificial Neural Networks and Deep Learning 81 Christian Borgelt Artificial Neural Networks and Deep Learning 82 Excursion: The (Negated) Fredkin Gate Multi-layer Perceptrons: Fredkin Gate 1   2 2 − 2 0   2 x 1  2 2 0    W 1 =   0 2 2 2   y 1 0 − 2 2 3 1 − 2 2 2 s 2 2 − 2 y 2 3 1 2 x 2 � � 2 0 2 0 2 W 2 = 2 0 2 0 2 1 • A ball is entered at the switch input and passed through. U in U hidden U out • Ball entered at both inputs x 1 and x 2 and are also passed through to the outputs y 1 and y 2 . (Note the symmetry of the trajectories.) Christian Borgelt Artificial Neural Networks and Deep Learning 83 Christian Borgelt Artificial Neural Networks and Deep Learning 84

  17. Why Non-linear Activation Functions? Why Non-linear Activation Functions? With weight matrices we have for two consecutive layers U 1 and U 2 If the output function is also linear, it is analogously act U 2 − � out U 2 = D out · � � net U 2 = W · � � in U 2 = W · � ξ, out U 1 . where If the activation functions are linear, that is, out U 2 = (out u 1 , . . . , out u n ) ⊤ is the output vector, � • f act (net , θ ) = α net − θ. • D out is again an n × n diagonal matrix of factors, and the activations of the neurons in the layer U 2 can be computed as ξ = ( ξ u 1 , . . . , ξ u n ) ⊤ a bias vector. • � net U 2 − � act U 2 = D act · � � θ, Combining these computations we get where � � � � − � − � � W · � out U 2 = D out · D act · out U 1 θ ξ act U 2 = (act u 1 , . . . , act u n ) ⊤ is the activation vector, • � and thus out U 1 + � out U 2 = A 12 · � � • D act is an n × n diagonal matrix of the factors α u i , i = 1 , . . . , n, and b 12 θ = ( θ u 1 , . . . , θ u n ) ⊤ is a bias vector. • � with an n × m matrix A 12 and an n -dimensional vector � b 12 . Christian Borgelt Artificial Neural Networks and Deep Learning 85 Christian Borgelt Artificial Neural Networks and Deep Learning 86 Why Non-linear Activation Functions? Multi-layer Perceptrons: Function Approximation • Up to now: representing and learning Boolean functions f : { 0 , 1 } n → { 0 , 1 } . Therefore we have out U 1 + � R n → I out U 2 = A 12 · � � b 12 • Now: representing and learning real-valued functions f : I R. and General idea of function approximation: out U 2 + � out U 3 = A 23 · � � b 23 • Approximate a given function by a step function. for the computations of two consecutive layers U 2 and U 3 . • Construct a neural network that computes the step function. These two computations can be combined into y y 4 y 3 out U 1 + � out U 3 = A 13 · � � b 13 , y 2 where A 13 = A 23 · A 12 and � b 13 = A 23 · � b 12 + � b 23 . y 1 y 0 Result: With linear activation and output functions any multi-layer perceptron x can be reduced to a two-layer perceptron. x 1 x 2 x 3 x 4 Christian Borgelt Artificial Neural Networks and Deep Learning 87 Christian Borgelt Artificial Neural Networks and Deep Learning 88

  18. Multi-layer Perceptrons: Function Approximation Multi-layer Perceptrons: Function Approximation Theorem: Any Riemann-integrable function can be approximated with arbitrary accuracy by a four-layer perceptron. x 1 2 1 − 2 1 • But: Error is measured as the area between the functions. y 1 x 2 2 1 y 2 y x 1 id − 2 1 x 3 2 y 3 1 1 − 2 x 4 • More sophisticated mathematical examination allows a stronger assertion: A neural network that computes the step function shown on the preceding slide. With a three-layer perceptron any continuous function can be approximated According to the input value only one step is active at any time. with arbitrary accuracy (error: maximum function value difference). The output neuron has the identity as its activation and output functions. Christian Borgelt Artificial Neural Networks and Deep Learning 89 Christian Borgelt Artificial Neural Networks and Deep Learning 90 Multi-layer Perceptrons as Universal Approximators Multi-layer Perceptrons: Function Approximation Universal Approximation Theorem [Hornik 1991]: y y y 4 y 3 Let ϕ ( · ) be a continuous, bounded and non-constant function, ∆ y 4 R m , and let X denote an arbitrary compact subset of I ∆ y 3 y 2 let C ( X ) denote the space of continuous functions on X . ∆ y 2 y 1 Given any function f ∈ C ( X ) and ε > 0, there exists an integer N , real constants y 0 R m , i = 1 , . . . , N , such that we may define ∆ y 1 v i , θ i ∈ I w i ∈ I R and real vectors � x x N � � � x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 w ⊤ F ( � x ) = v i ϕ � i � x − θ i i =1 as an approximate realization of the function f where f is independent of ϕ . That is, 1 · ∆ y 4 0 1 · ∆ y 3 | F ( � x ) − f ( � x ) | < ε 0 By using relative step heights 1 · ∆ y 2 x ∈ X . In other words, functions of the form F ( � for all � x ) are dense in C ( X ). one layer can be saved. 0 1 · ∆ y 1 0 Note that it is not the shape of the activation function, but the layered structure of the feed-forward network that renders multi-layer perceptrons universal approximators. Christian Borgelt Artificial Neural Networks and Deep Learning 91 Christian Borgelt Artificial Neural Networks and Deep Learning 92

  19. Multi-layer Perceptrons: Function Approximation Multi-layer Perceptrons: Function Approximation y y y 4 x 1 y 3 ∆ y 4 ∆ y 3 y 2 ∆ y 1 1 ∆ y 2 x 2 y 1 ∆ y 2 1 y 0 ∆ y 1 y x id x x x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 1 x 3 ∆ y 3 ∆ y 4 1 1 · ∆ y 4 0 x 4 1 · ∆ y 3 0 By using semi-linear functions the 1 · ∆ y 2 approximation can be improved. 0 1 · ∆ y 1 A neural network that computes the step function shown on the preceding slide. 0 The output neuron has the identity as its activation and output functions. Christian Borgelt Artificial Neural Networks and Deep Learning 93 Christian Borgelt Artificial Neural Networks and Deep Learning 94 Multi-layer Perceptrons: Function Approximation θ 1 1 ∆ y 1 ∆ x 1 θ 2 ∆ y 2 ∆ x y x id Mathematical Background: Regression 1 ∆ y 3 θ 3 ∆ x 1 ∆ y 4 ∆ x θ i = x i ∆ x θ 4 ∆ x = x i +1 − x i A neural network that computes the step function shown on the preceding slide. The output neuron has the identity as its activation and output functions. Christian Borgelt Artificial Neural Networks and Deep Learning 95 Christian Borgelt Artificial Neural Networks and Deep Learning 96

  20. Regression: Method of Least Squares Reminder: Function Optimization Regression is also known as Method of Least Squares . (Carl Friedrich Gauß) Task: Find values � x = ( x 1 , . . . , x m ) such that f ( � x ) = f ( x 1 , . . . , x m ) is optimal. (also known as Ordinary Least Squares, abbreviated OLS) Often feasible approach: Given: • A data set (( � x 1 , y 1 ) , . . . , ( � x n , y n )) of n data tuples • A necessary condition for a (local) optimum (maximum or minimum) is that (one or more input values and one output value) and the partial derivatives w.r.t. the parameters vanish (Pierre de Fermat, 1607–1665). • a hypothesis about the functional relationship • Therefore: (Try to) solve the equation system that results from setting between response and predictor values, all partial derivatives w.r.t. the parameters equal to zero. e.g. Y = f ( X ) = a + bX + ε . f ( x, y ) = x 2 + y 2 + xy − 4 x − 5 y . Desired: • A parameterization of the conjectured function Example task: Minimize that minimizes the sum of squared errors (“best fit”). Solution procedure: Depending on 1. Take the partial derivatives of the objective function and set them to zero: • the hypothesis about the functional relationship and ∂f ∂f ∂x = 2 x + y − 4 = 0 , ∂y = 2 y + x − 5 = 0 . • the number of arguments to the conjectured function 2. Solve the resulting (here: linear) equation system: x = 1 , y = 2. different types of regression are distinguished. Christian Borgelt Artificial Neural Networks and Deep Learning 97 Christian Borgelt Artificial Neural Networks and Deep Learning 98 Mathematical Background: Linear Regression Linear Regression: Example of Error Functional Training neural networks is closely related to regression. 2 Given: • A data set (( x 1 , y 1 ) , . . . , ( x n , y n )) of n data tuples and 3 0 y F ( a, b ) • a hypothesis about the functional relationship, e.g. y = g ( x ) = a + bx . 2 0 1 1 0 Approach: Minimize the sum of squared errors, that is, 0 n n –2 � � 0 ( g ( x i ) − y i ) 2 = –1 1 ( a + bx i − y i ) 2 . F ( a, b ) = 0 a 1 0 0 1 2 3 2 b i =1 i =1 x –1 3 4 Necessary conditions for a minimum • A very simple data set (4 points), 30 (a.k.a. Fermat’s theorem, after Pierre de Fermat, 1607–1665): to which a line is to be fitted. F ( a, b ) n 20 ∂F � • The error functional for linear regression = 2( a + bx i − y i ) = 0 and ∂a n 10 � i =1 ( a + bx i − y i ) 2 F ( a, b ) = n 0 � ∂F i =1 1 = 2( a + bx i − y i ) x i = 0 (same function, two different views). ∂b 0 4 3 i =1 b 2 –1 1 0 a – 1 – 2 Christian Borgelt Artificial Neural Networks and Deep Learning 99 Christian Borgelt Artificial Neural Networks and Deep Learning 100

  21. Mathematical Background: Linear Regression Linear Regression: Example Result of necessary conditions: System of so-called normal equations , that is, Normal equations: x 1 2 3 4 5 6 7 8   8 a + 36 b = 27 , y 1 3 2 3 4 3 5 6 n n � � 36 a + 204 b = 146 .  b =  na + x i y i , i =1 i =1 Solution: Assumption:     y = 3 4 + 7 n n n 12 x. � � � x 2 y = a + bx  a +  b =   x i x i y i . i i =1 i =1 i =1 y y 6 6 5 5 • Two linear equations for two unknowns a and b . 4 4 • System can be solved with standard methods from linear algebra. 3 3 2 2 • Solution is unique unless all x -values are identical. 1 1 x x • The resulting line is called a regression line . 0 0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Christian Borgelt Artificial Neural Networks and Deep Learning 101 Christian Borgelt Artificial Neural Networks and Deep Learning 102 Mathematical Background: Polynomial Regression Mathematical Background: Polynomial Regression Generalization to polynomials System of normal equations for polynomials     n n n y = p ( x ) = a 0 + a 1 x + . . . + a m x m � � �  a 1 + . . . + x m  a m =   na 0 + x i y i i i =1 i =1 i =1       n n n n � � � � Approach: Minimize the sum of squared errors, that is, x 2 x m +1  a 0 +  a 1 + . . . +  a m =    x i x i y i i i i =1 i =1 i =1 i =1 n n � � ( p ( x i ) − y i ) 2 = ( a 0 + a 1 x i + . . . + a m x m i − y i ) 2 . . . . F ( a 0 , a 1 , . . . , a m ) = . .       i =1 i =1 n n n n � � � � x m x m +1 x 2 m x m  a 0 +  a 1 + . . . +  a m =    i y i , i i i i =1 i =1 i =1 i =1 Necessary conditions for a minimum: All partial derivatives vanish, that is, • m + 1 linear equations for m + 1 unknowns a 0 , . . . , a m . ∂F ∂F , ∂F = 0 , = 0 , . . . = 0 . • System can be solved with standard methods from linear algebra. ∂a 0 ∂a 1 ∂a m • Solution is unique unless the points lie exactly on a polynomial of lower degree. Christian Borgelt Artificial Neural Networks and Deep Learning 103 Christian Borgelt Artificial Neural Networks and Deep Learning 104

  22. Mathematical Background: Multivariate Linear Regression Mathematical Background: Multivariate Linear Regression Generalization to more than one argument System of normal equations for several arguments     z = f ( x, y ) = a + bx + cy n n n � � �  b +  c =   na + x i y i z i Approach: Minimize the sum of squared errors, that is, i =1 i =1 i =1       n n n n n n � � � � � � ( f ( x i , y i ) − z i ) 2 = ( a + bx i + cy i − z i ) 2  a + x 2  b +  c =    F ( a, b, c ) = x i x i y i z i x i i i =1 i =1 i =1 i =1 i =1 i =1       Necessary conditions for a minimum: All partial derivatives vanish, that is, n n n n � � � � y 2  a +  b +  c =    y i x i y i z i y i i n i =1 i =1 i =1 i =1 � ∂F = 2( a + bx i + cy i − z i ) = 0 , ∂a i =1 n ∂F � • 3 linear equations for 3 unknowns a , b , and c . 2( a + bx i + cy i − z i ) x i = = 0 , ∂b i =1 • System can be solved with standard methods from linear algebra. n � ∂F = 2( a + bx i + cy i − z i ) y i = 0 . • Solution is unique unless all data points lie on a straight line. ∂c i =1 Christian Borgelt Artificial Neural Networks and Deep Learning 105 Christian Borgelt Artificial Neural Networks and Deep Learning 106 Multivariate Linear Regression Multivariate Linear Regression General multivariate linear case: • � ∇ a F ( � a ) may easily be computed by remembering that the differential operator � m � ∂ � � ∂ y = f ( x 1 , . . . , x m ) = a 0 + a k x k � ∇ a = , . . . , � ∂a 0 ∂a m k =1 behaves formally like a vector that is “multiplied” to the sum of squared errors. Approach: Minimize the sum of squared errors, that is, • Alternatively, one may write out the differentiation component-wise. y ) ⊤ ( X � F ( � a ) = ( X � a − � a − � y ) , where With the former method we obtain for the derivative: (leading ones capture constant a 0 )       a 0 y ) ⊤ ( X � 1 x 11 . . . x m 1 y 1 � �   ∇ a F ( � a ) = ∇ a (( X � a − � a − � y ))   a 1 � �  . . ... .   .  . . . .   X = . . .  , � y = .  , and � a = � � � ⊤ ( X � � y ) ⊤ � � �� ⊤   .  .  . ∇ a − � a − � a − � ∇ a − �   = a ( X � y ) y ) + ( X � a ( X � y ) 1 x 1 n . . . x mn y n � � a m � � � ⊤ ( X � � � � ⊤ ( X � = ∇ a ( X � a − � y ) a − � y ) + ∇ a ( X � a − � y ) a − � y ) � � Necessary conditions for a minimum: ( � ∇ is a differential operator called “nabla” or “del”.) = 2 X ⊤ ( X � a − � y ) � a ) = � y ) ⊤ ( X � y ) = � ∇ ∇ a − � a − � a F ( � a ( X � 0 = 2 X ⊤ X � a − 2 X ⊤ � = � � � y 0 Christian Borgelt Artificial Neural Networks and Deep Learning 107 Christian Borgelt Artificial Neural Networks and Deep Learning 108

  23. Multivariate Linear Regression Mathematical Background: Logistic Function y Necessary condition for a minimum therefore: Logistic Function: 1 Y � � y ) ⊤ ( X � ∇ a F ( � a ) = ∇ a ( X � a − � a − � y ) y = f ( x ) = � � 1 + e − a ( x − x 0 ) 1 ! = 2 X ⊤ X � a − 2 X ⊤ � = � y 0 2 Special case Y = a = 1, x 0 = 0: As a consequence we obtain the system of normal equations : x 1 0 y = f ( x ) = 1 + e − x − 4 − 2 0 +2 +4 X ⊤ X � a = X ⊤ � y Application areas of the logistic function: This system has a solution unless X ⊤ X is singular. If it is regular, we have • Can be used to describe saturation processes a = ( X ⊤ X ) − 1 X ⊤ � � y. (growth processes with finite capacity/finite resources Y ). ( X ⊤ X ) − 1 X ⊤ is called the (Moore-Penrose-) Pseudoinverse of the matrix X . Derivation e.g. from a Bernoulli differential equation f ′ ( x ) = k · f ( x ) · ( Y − f ( x )) (yields a = kY ) With the matrix-vector representation of the regression problem an extension to multivariate polynomial regression is straightforward: • Can be used to describe a linear classifier Simply add the desired products of powers (monomials) to the matrix X . (especially for two-class problems, considered later). Christian Borgelt Artificial Neural Networks and Deep Learning 109 Christian Borgelt Artificial Neural Networks and Deep Learning 110 Mathematical Background: Logistic Function Mathematical Background: Logistic Function Example: two-dimensional logistic function Example: two-dimensional logistic function 1 1 1 1 y = f ( � x ) = 1 + exp( − ( x 1 + x 2 − 4)) = � � y = f ( � x ) = 1 + exp( − (2 x 1 + x 2 − 6)) = � � − ((1 , 1) ⊤ ( x 1 , x 2 ) − 4) − ((2 , 1) ⊤ ( x 1 , x 2 ) − 6) 1 + exp 1 + exp x 2 x 2 4 4 3 3 0 1 . 1 9 0 . 8 2 0 2 . 7 0 . 0 6 y y 0 . 0 9 4 . 4 5 0 . 0 8 0 . . 4 0 7 . 0 6 0 . 5 . 0 . 3 1 3 3 1 4 0 . 0 3 . 2 . 0 2 . 1 x 2 2 x 2 2 0 0 0 . 1 4 4 x 1 x 1 3 3 1 1 0 0 2 2 x 1 x 1 1 1 0 1 2 3 4 0 1 2 3 4 0 0 0 0 The “contour lines” of the logistic function are parallel lines/hyperplanes. The “contour lines” of the logistic function are parallel lines/hyperplanes. Christian Borgelt Artificial Neural Networks and Deep Learning 111 Christian Borgelt Artificial Neural Networks and Deep Learning 112

  24. Mathematical Background: Logistic Regression Logistic Regression: Example Generalization of regression to non-polynomial functions . Data points: x 1 2 3 4 5 y 0.4 1.0 3.0 5.0 5.6 y = ax b Simple example: Apply the logit transform Idea: Find a transformation to the linear/polynomial case . � � y z = ln , Y = 6 . Transformation for the above example: ln y = ln a + b · ln x. Y − y y ′ = ln y x ′ = ln x . ⇒ Linear regression for the transformed data and Transformed data points: (for linear regression) x 1 2 3 4 5 a ⊤ Special case: Logistic Function (with a 0 = � x 0 ) � z − 2 . 64 − 1 . 61 0.00 1.61 2.64 a ⊤ y = 1 + e − ( � � x + a 0 ) Y 1 Y − y a ⊤ = e − ( � � x + a 0 ) . The resulting regression line and therefore the desired function are ⇔ ⇔ y = a ⊤ 1 + e − ( � x + a 0 ) � Y y 6 6 z ≈ 1 . 3775 x − 4 . 133 and y ≈ 1 + e − (1 . 3775 x − 4 . 133) ≈ 1 + e − 1 . 3775( x − 3) . Result: Apply so-called Logit Transform � � Attention: Note that the error is minimized only in the transformed space! y a ⊤ z = ln = � x + a 0 . � Therefore the function in the original space may not be optimal! Y − y Christian Borgelt Artificial Neural Networks and Deep Learning 113 Christian Borgelt Artificial Neural Networks and Deep Learning 114 Logistic Regression: Example Logistic Regression: Example z z y y Y = 6 Y = 6 4 4 6 6 3 3 5 5 2 2 4 4 1 1 x x 0 0 3 3 1 2 3 4 5 1 2 3 4 5 − 1 − 1 2 2 − 2 − 2 1 1 − 3 x − 3 x 0 0 − 4 − 4 0 1 2 3 4 5 0 1 2 3 4 5 The resulting regression line and therefore the desired function are The logistic regression function can be computed by a single neuron with 6 6 • network input function f net ( x ) ≡ wx with w ≈ 1 . 3775, z ≈ 1 . 3775 x − 4 . 133 and y ≈ 1 + e − (1 . 3775 x − 4 . 133) ≈ 1 + e − 1 . 3775( x − 3) . � 1 + e − (net − θ ) � − 1 with θ ≈ 4 . 133 and • activation function f act (net , θ ) ≡ Attention: Note that the error is minimized only in the transformed space! • output function f out (act) ≡ 6 act. Therefore the function in the original space may not be optimal! Christian Borgelt Artificial Neural Networks and Deep Learning 115 Christian Borgelt Artificial Neural Networks and Deep Learning 116

  25. Multivariate Logistic Regression: Example Multivariate Logistic Regression: Example 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • Example data were drawn from a logistic function and noise was added. • The black “contour lines” show the resulting logistic function. (The gray “contour lines” show the ideal logistic function.) Is the deviation from the ideal logistic function (gray) caused by the added noise? • Reconstructing the logistic function can be reduced to a multivariate linear regres- • Attention: Note that the error is minimized only in the transformed space! sion by applying a logit transform to the y -values of the data points. Therefore the function in the original space may not be optimal! Christian Borgelt Artificial Neural Networks and Deep Learning 117 Christian Borgelt Artificial Neural Networks and Deep Learning 118 Logistic Regression: Optimization in Original Space Reminder: Gradient Methods for Optimization Approach analogous to linear/polynomial regression � The gradient (symbol � ∇ z | � ∇ , “nabla”) p =( x 0 ,y 0 ) Given: data set D = { ( � x 1 , y 1 ) , . . . , ( � x n , y n ) } with n data points, y i ∈ (0 , 1). y 0 is a differential operator that turns z ∂z ∂y | � i = (1 , x i 1 , . . . , x im ) ⊤ and � x ∗ a = ( a 0 , a 1 , . . . , a m ) ⊤ . ∂z a scalar function into a vector field. Simplification: Use � ∂x | � p p x ∗ (By the leading 1 in � i the constant a 0 is captured.) Illustration of the gradient of y x 0 a real-valued function z = f ( x, y ) Minimize sum of squared errors / deviations: at a point ( x 0 , y 0 ). � � 2 n � 1 ! � ∂z � F ( � a ) = y i − = min . ∂x | x 0 , ∂z It is � a ⊤ x ∗ ∇ z | ( x 0 ,y 0 ) = ∂y | y 0 . 1 + e − � � x i i =1 Necessary condition for a minimum: ! � The gradient at a point shows the direction of the steepest ascent of the function = � ∇ Gradient of the objective function F ( � a ) w.r.t. � a vanishes: a F ( � a ) 0 � at this point; its length describes the steepness of the ascent. Problem: The resulting equation system is not linear. Principle of Gradient Methods: Solution possibilities: Starting at a (possibly randomly chosen) initial point, make (small) steps in (or against) the direction of the gradient of the objective function • Gradient descent on objective function F ( � a ). (considered in the following) at the current point, until a maximum (or a minimum) has been reached. • Root search on gradient � ∇ a F ( � a ). (e.g. Newton–Raphson algorithm) � Christian Borgelt Artificial Neural Networks and Deep Learning 119 Christian Borgelt Artificial Neural Networks and Deep Learning 120

  26. Gradient Methods: Cookbook Recipe Gradient Descent: Simple Example Idea: Starting from a randomly chosen point in the search space, make small steps f ( u ) = 5 6 u 4 − 7 u 3 + 115 6 u 2 − 18 u + 6 , Example function: in the search space, always in the direction of the steepest ascent (or descent) of the function to optimize, until a (local) maximum (or minimum) is reached. f ( u ) f ′ ( u i ) � � i u i f ( u i ) ∆ u i ⊤ u (0) = u (0) 1 , . . . , u (0) 6 1. Choose a (random) starting point � n − 11 . 147 0 0 . 200 3 . 112 0 . 111 1 0 . 311 2 . 050 − 7 . 999 0 . 080 5 starting point u ( i ) : 2. Compute the gradient of the objective function f at the current point � 2 0 . 391 1 . 491 − 6 . 015 0 . 060 4 3 0 . 451 1 . 171 − 4 . 667 0 . 047 � � ⊤ � � � 4 0 . 498 0 . 976 − 3 . 704 0 . 037 � � � ∂ ∂ ∇ u f ( � u ) u ( i ) = ∂u 1 f ( � x ) , . . . , ∂u n f ( � x ) � � � u ( i ) � u ( i ) 3 � 5 0 . 535 0 . 852 − 2 . 990 0 . 030 n 1 6 0 . 565 0 . 771 − 2 . 444 0 . 024 2 − 2 . 019 3. Make a small step in the direction (or against the direction) of the gradient: 7 0 . 589 0 . 716 0 . 020 global optimum 8 0 . 610 0 . 679 − 1 . 681 0 . 017 1 � + : gradient ascent u ( i +1) = � u ( i ) ± η ∇ 9 0 . 626 0 . 653 − 1 . 409 0 . 014 u � � u f ( � u ) u ( i ) . � � 0 � − : gradient descent 10 0 . 640 0 . 635 0 1 2 3 4 η is a step width parameter (“learning rate” in artificial neuronal networks) Gradient descent with initial value u 0 = 0 . 2 and step width/learning rate η = 0 . 01. 4. Repeat steps 2 and 3, until some termination criterion is satisfied. Due to a proper step width/learning rate, the minimum is approached fairly quickly. (e.g., a certain number of steps has been executed, current gradient is small) Christian Borgelt Artificial Neural Networks and Deep Learning 121 Christian Borgelt Artificial Neural Networks and Deep Learning 122 Gradient Descent: Simple Example Gradient Descent: Simple Example f ( u ) = 5 6 u 4 − 7 u 3 + 115 f ( u ) = 5 6 u 4 − 7 u 3 + 115 6 u 2 − 18 u + 6 , 6 u 2 − 18 u + 6 , Example function: Example function: f ( u ) f ( u ) f ′ ( u i ) f ′ ( u i ) i u i f ( u i ) ∆ u i i u i f ( u i ) ∆ u i 6 6 − 11 . 147 − 0 . 875 0 0 . 200 3 . 112 0 . 011 0 1 . 500 2 . 719 3 . 500 1 0 . 211 2 . 990 − 10 . 811 0 . 011 5 1 0 . 625 0 . 655 − 1 . 431 0 . 358 5 2 0 . 222 2 . 874 − 10 . 490 0 . 010 2 0 . 983 0 . 955 2 . 554 − 0 . 639 4 4 3 0 . 232 2 . 766 − 10 . 182 0 . 010 3 0 . 344 1 . 801 − 7 . 157 1 . 789 − 9 . 888 − 0 . 142 4 0 . 243 2 . 664 0 . 010 4 2 . 134 4 . 127 0 . 567 3 3 5 0 . 253 2 . 568 − 9 . 606 0 . 010 5 1 . 992 3 . 989 1 . 380 − 0 . 345 6 0 . 262 2 . 477 − 9 . 335 0 . 009 6 1 . 647 3 . 203 3 . 063 − 0 . 766 starting point 2 2 − 9 . 075 − 0 . 438 7 0 . 271 2 . 391 0 . 009 7 0 . 881 0 . 734 1 . 753 global optimum global optimum 8 0 . 281 2 . 309 − 8 . 825 0 . 009 1 8 0 . 443 1 . 211 − 4 . 851 1 . 213 1 9 0 . 289 2 . 233 − 8 . 585 0 . 009 u 9 1 . 656 3 . 231 3 . 029 − 0 . 757 u 0 0 10 0 . 298 2 . 160 10 0 . 898 0 . 766 0 1 2 3 4 0 1 2 3 4 Gradient descent with initial value u 0 = 0 . 2 and step width/learning rate η = 0 . 001. Gradient descent with initial value u 0 = 1 . 5 and step width/learning rate η = 0 . 25. Due to the small step width/learning rate, the minimum is approached very slowly. Due to the large step width/learning rate, the iterations lead to oscillations. Christian Borgelt Artificial Neural Networks and Deep Learning 123 Christian Borgelt Artificial Neural Networks and Deep Learning 124

  27. Gradient Descent: Simple Example Logistic Regression: Gradient Descent 1 f ( u ) = 5 6 u 4 − 7 u 3 + 115 With the abbreviation f ( z ) = 1+ e − z for the logistic function it is 6 u 2 − 18 u + 6 , Example function: n n � � i )) 2 = a ⊤ x ∗ a ⊤ x ∗ i )) · f ′ ( � a ⊤ x ∗ x ∗ � a ) = � ∇ a F ( � ∇ ( y i − f ( � � − 2 ( y i − f ( � � � i ) · � i . � � a f ( u ) f ′ ( u i ) i =1 i =1 i u i f ( u i ) ∆ u i 6 − 1 . 707 0 2 . 600 3 . 816 0 . 085 Derivative of the logistic function: (cf. Bernoulli differential equation) 1 2 . 685 3 . 660 − 1 . 947 0 . 097 5 2 2 . 783 3 . 461 − 2 . 116 0 . 106 � 1 + e − z � − 1 � 1 + e − z � − 2 � − e − z � d f ′ ( z ) = − 4 = 3 2 . 888 3 . 233 − 2 . 153 0 . 108 d z 4 2 . 996 3 . 008 − 2 . 009 0 . 100 1 + e − z − 1 � � 3 1 1 5 3 . 097 2 . 820 − 1 . 688 0 . 084 = = 1 − = f ( z ) · (1 − f ( z )) , 6 3 . 181 2 . 695 − 1 . 263 0 . 063 (1 + e − z ) 2 1 + e − z 1 + e − z starting point 2 − 0 . 845 7 3 . 244 2 . 628 0 . 042 global optimum 8 3 . 286 2 . 599 − 0 . 515 0 . 026 1 y f ( z ) 1 1 9 3 . 312 2 . 589 − 0 . 293 0 . 015 u 0 10 3 . 327 2 . 585 0 1 2 3 4 1 1 2 2 Gradient descent with initial value u 0 = 2 . 6 and step width/learning rate η = 0 . 05. 1 4 x z Proper step width, but due to the starting point, only a local minimum is found. 0 0 − 4 − 2 0 +2 +4 − 4 − 2 0 +2 +4 Christian Borgelt Artificial Neural Networks and Deep Learning 125 Christian Borgelt Artificial Neural Networks and Deep Learning 126 Logistic Regression: Gradient Descent Multivariate Logistic Regression: Example Given: data set D = { ( � x n , y n ) } with n data points, y i ∈ (0 , 1). 4 x 1 , y 1 ) , . . . , ( � i = (1 , x i 1 , . . . , x im ) ⊤ and � x ∗ a = ( a 0 , a 1 , . . . , a m ) ⊤ . Simplification: Use � 1 3 Gradient descent on the objective function F ( � a ) : 0.8 • Choose as the initial point � a 0 the result of a logit transform 0.6 2 and a linear regression (or merely a linear regression). y 0.4 • Update of the parameters � a : 1 0.2 a t − η 2 · � � a t +1 = � ∇ a F ( � a ) | � � a t 4 0 n 3 � 0 a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ x ∗ 0 2 = � a t + η · ( y i − f ( � t � i )) · f ( � t � i ) · (1 − f ( � t � i )) · � i , 1 x 2 2 0 1 2 3 4 x 1 1 3 i =1 0 4 where η is a step width parameter to be chosen by a user (e.g. η = 0 . 05) • Black “contour line”: logit transform and linear regression. (in the area of artificial neural networks also called “learning rate”). • Green “contour line”: gradient descent on error function in original space. • Repeat the update step until convergence, e.g. until a t || < τ with a chosen threshold τ (z.B. τ = 10 − 6 ). || � a t +1 − � (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Artificial Neural Networks and Deep Learning 127 Christian Borgelt Artificial Neural Networks and Deep Learning 128

  28. Logistic Classification: Two Classes Logistic Classification x 2 1 Logistic function with Y = 1: 4 probability 1 y = f ( x ) = 1 class 0 1 + e − a ( x − x 0 ) 3 class 1 class 1 2 1 probability Interpret the logistic function x 2 0 as the probability of one class. x 0 − 4 x 0 − 2 x 0 + 2 x 0 + 4 y x 0 4 a a a a 3 a = ( a 0 , x 1 , . . . , x m ) ⊤ 1 class 0 � x ∗ = (1 , x 1 , . . . , x m ) ⊤ • Conditional class probability is logistic function: � x 2 2 0 4 x 1 3 1 1 0 P ( C = c 1 | � 2 X = � x ) = p 1 ( � x ) = p ( � x ; � a ) = x ∗ . x 1 1 0 1 2 3 4 a ⊤ 0 1 + e − � � 0 • With only two classes the conditional probability of the other class is: • The classes are separated at the “contour line” p ( � x ; � a ) = θ = 0 . 5 (inflection line). P ( C = c 0 | � X = � x ) = p 1 ( � x ) = 1 − p ( � x ; � a ) . (The class boundary is linear, therefore linear classification .) � c 1 , if p ( � • Classification rule: • Via the classification threshold θ , which need not be θ = 0 . 5, a ) ≥ θ, x ; � C = θ = 0 . 5 . misclassification costs may be incorporated. c 0 , if p ( � x ; � a ) < θ, Christian Borgelt Artificial Neural Networks and Deep Learning 129 Christian Borgelt Artificial Neural Networks and Deep Learning 130 Logistic Classification: Example Logistic Classification: Example 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • In finance (e.g., when assessing the credit worthiness of businesses) • In such a case multiple businesses may fall onto the same grid point. logistic classification is often applied in discrete spaces, • Then probabilities may be estimated from observed credit defaults: that are spanned e.g. by binary attributes and expert assessments. x ) = #defaults( � x ) + γ ( γ : Laplace correction, e.g. γ ∈ { 1 2 , 1 } ) p default ( � � (e.g. assessments of the range of products, market share, growth etc.) #loans( � x ) + 2 γ Christian Borgelt Artificial Neural Networks and Deep Learning 131 Christian Borgelt Artificial Neural Networks and Deep Learning 132

  29. Logistic Classification: Example Logistic Classification: Example 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • More frequent is the case in which at least some attributes are metric • Black “contour line”: logit transform and linear regression. and for each point a class, but no class probability is available. • Green “contour line”: gradient descent on error function in original space. • If we assign class 0: c 0 � = y = 0 and class 1: c 1 � = y = 1, the logit transform is not applicable. (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Artificial Neural Networks and Deep Learning 133 Christian Borgelt Artificial Neural Networks and Deep Learning 134 Logistic Classification: Example Logistic Classification: Example 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • The logit transform becomes applicable by mapping the classes to • Logit transform and linear regression often yield suboptimal results: � ǫ � � 1 − ǫ � � ǫ � Depending on the distribution of the data points relative to a(n optimal) separating c 1 = y = ln and c 0 = y = ln = − ln . � � 1 − ǫ ǫ 1 − ǫ hyperplane the computed separating hyperplane can be shifted and/or rotated. • The value of ǫ ∈ (0 , 1 2 ) is irrelevant (i.e., the result is independent of ǫ • This can lead to (unnecessary) misclassifications! and equivalent to a linear regression with c 0 � = y = 0 and c 1 � = y = 1). Christian Borgelt Artificial Neural Networks and Deep Learning 135 Christian Borgelt Artificial Neural Networks and Deep Learning 136

  30. Logistic Classification: Example Logistic Classification: Maximum Likelihood Approach 4 A likelihood function describes the probability of observed data depending on the parameters � a of the (conjectured) data generating process. 1 3 Here: logistic function to describe the class probabilities. 0.8 a ⊤ x ∗ ), class y = 1 occurs with probability p 1 ( � x ) = f ( � � 0.6 a ⊤ x ∗ ), 2 x ) = 1 − f ( � class y = 0 occurs with probability p 0 ( � � y 1 x ∗ = (1 , x 1 , . . . , x m ) ⊤ and � 0.4 a = ( a 0 , a 1 , . . . , a m ) ⊤ . with f ( z ) = 1+ e − z and � 1 0.2 Likelihood function for the data set D = { ( � x 1 , y 1 ) , . . . , � x n , y n ) } with y i ∈ { 0 , 1 } : 4 0 3 n � 0 0 x i ) y i · p 0 ( � x i ) 1 − y i 2 1 x 2 L ( � a ) = p 1 ( � 2 0 1 2 3 4 x 1 1 3 0 i =1 4 n � a ⊤ x ∗ a ⊤ x ∗ i )) 1 − y i i ) y i · (1 − f ( � • Black “contour line”: logit transform and linear regression. = f ( � � � i =1 • Green “contour line”: gradient descent on error function in original space. Maximum Likelihood Approach: Find the set of parameters � a , (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) which renders the occurrence of the (observed) data most likely. Christian Borgelt Artificial Neural Networks and Deep Learning 137 Christian Borgelt Artificial Neural Networks and Deep Learning 138 Logistic Classification: Maximum Likelihood Approach Logistic Classification: Gradient Ascent 1 Simplification by taking the logarithm: log-likelihood function Gradient of the log-likelihood function: (with f ( z ) = 1+ e − z ) � i ) �� n � � n � � � a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ � � a ⊤ x ∗ 1 + e − ( � � ln L ( � a ) = y i · ln f ( � � i ) + (1 − y i ) · ln(1 − f ( � � i )) ∇ a ln L ( � a ) = ∇ ( y i − 1) · � � i − ln � � a i =1   i =1   a ⊤ x ∗ n e − ( � � i ) a ⊤ x ∗ � 1 n e − ( � � i ) �  y i · ln  = i ) + (1 − y i ) · ln x ∗ x ∗  ( y i − 1) · �  = i + i ) · � x ∗ x ∗ a ⊤ a ⊤ 1 + e − ( � � 1 + e − ( � � i ) i x ∗ a ⊤ 1 + e − ( � � i =1 i =1 � i ) �� n � � n � � � � a ⊤ x ∗ � a ⊤ x ∗ 1 + e − ( � � = ( y i − 1) · � � i − ln x ∗ a ⊤ x ∗ x ∗ = ( y i − 1) · � i + 1 − f ( � � i ) · � i i =1 i =1 n � � � a ⊤ x ∗ x ∗ Necessary condition for a maximum: = ( y i − f ( � � i )) · � i ! � = � i =1 Gradient of the objective function ln L ( � a ) w.r.t. � a vanishes: ∇ a ln L ( � a ) 0 � As a comparison: Problem: The resulting equation system is not linear. Gradient of the sum of squared errors / deviations: Solution possibilities: n � � a ⊤ x ∗ a ⊤ x ∗ a ⊤ x ∗ x ∗ ∇ a F ( � a ) = − 2 ( y i − f ( � � i )) · f ( � � i ) · (1 − f ( � � i )) · � • Gradient ascent on objective function ln L ( � � i a ). (considered in the following) � �� � i =1 • Root search on gradient � additional factor: derivative of the logistic function ∇ a ln L ( � a ). (e.g. Newton–Raphson algorithm) � Christian Borgelt Artificial Neural Networks and Deep Learning 139 Christian Borgelt Artificial Neural Networks and Deep Learning 140

  31. Logistic Classification: Gradient Ascent Logistic Classification: Example Given: data set D = { ( � x n , y n ) } with n data points, y ∈ { 0 , 1 } . 4 x 1 , y 1 ) , . . . , ( � i = (1 , x i 1 , . . . , x im ) ⊤ and � x ∗ a = ( a 0 , a 1 , . . . , a m ) ⊤ . Simplification: Use � 1 3 Gradient ascent on the objective function ln L ( � a ) : 0.8 • Choose as the initial point � a 0 the result of a logit transform 0.6 2 and a linear regression (or merely a linear regression). y 0.4 • Update of the parameters � a : 1 0.2 a t + η · � � a t +1 = � ∇ a ln L ( � a ) | � � a t 4 n 0 � 3 a ⊤ x ∗ x ∗ 0 = � a t + η · ( y i − f ( � t � i )) · � i , 0 2 1 x 2 2 0 1 2 3 4 x 1 1 i =1 3 0 4 where η is a step width parameter to be chosen by a user (e.g. η = 0 . 01). • Black “contour line”: logit transform and linear regression. a ⊤ x ∗ a ⊤ x ∗ i ) · (1 − f ( � (Comparison with gradient descent: missing factor f ( � t � t � i )).) • Green “contour line”: gradient descent on error function in the original space. • Repeat the update step until convergence, e.g. until • Magenta “contour line”: gradient ascent on log-likelihood function. a t || < τ with a chosen threshold τ (z.B. τ = 10 − 6 ). || � a t +1 − � (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Artificial Neural Networks and Deep Learning 141 Christian Borgelt Artificial Neural Networks and Deep Learning 142 Logistic Classification: No Gap Between Classes Logistic Classification: No Gap Between Classes 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • If there is no (clear) gap between the classes • Black “contour line”: logit transform and linear regression. a logit transform and subsequent linear regression • Green “contour line”: gradient descent on error function in the original space. yields (unnecessary) misclassifications even more often. • Magenta “contour line”: gradient ascent on log-likelihood function. • In such a case the alternative methods are clearly preferable! (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Artificial Neural Networks and Deep Learning 143 Christian Borgelt Artificial Neural Networks and Deep Learning 144

  32. Logistic Classification: Overlapping Classes Logistic Classification: Overlapping Classes 4 4 1 1 3 3 0.8 0.8 0.6 0.6 2 2 y y 0.4 0.4 1 1 0.2 0.2 4 4 0 0 3 3 0 0 0 0 2 2 1 x 2 1 x 2 2 2 0 1 2 3 4 x 1 0 1 2 3 4 x 1 1 1 3 3 0 0 4 4 • Even more problematic is the situation if the classes overlap • Black “contour line”: logit transform and linear regression. (i.e., there is no perfect separating line/hyperplane). • Green “contour line”: gradient descent on error function in the original space. • In such a case even the other methods cannot avoid misclassifications. • Magenta “contour line”: gradient ascent on log-likelihood function. (There is no way to be better than the pure or Bayes error.) (For simplicity and clarity only the “contour lines” for y = 0 . 5 (inflection lines) are shown.) Christian Borgelt Artificial Neural Networks and Deep Learning 145 Christian Borgelt Artificial Neural Networks and Deep Learning 146 Training Multi-layer Perceptrons: Gradient Descent • Problem of logistic regression: Works only for two-layer perceptrons. • More general approach: gradient descent . • Necessary condition: differentiable activation and output functions . � ∇ z | � The gradient (symbol � ∇ , “nabla”) p =( x 0 ,y 0 ) y 0 Training Multi-layer Perceptrons is a differential operator that turns z ∂z ∂y | � ∂z a scalar function into a vector field. ∂x | � p p Illustration of the gradient of y x 0 a real-valued function z = f ( x, y ) at a point ( x 0 , y 0 ). � ∂z � It is � ∂x | x 0 , ∂z ∇ z | ( x 0 ,y 0 ) = ∂y | y 0 . x The gradient at a point shows the direction of the steepest ascent of the function at this point; its length describes the steepness of the ascent. Christian Borgelt Artificial Neural Networks and Deep Learning 147 Christian Borgelt Artificial Neural Networks and Deep Learning 148

  33. Gradient Descent: Formal Approach Gradient Descent: Formal Approach General Idea : Approach the minimum of the error function in small steps. Single pattern error depends on weights only through the network input: ∂ net ( l ) w u e ( l ) = ∂e ( l ) ∂e ( l ) Error function: u � ∇ � = . ∂ net ( l ) � � � � ∂ � w u e ( l ) = e ( l ) ∂ � w u u e = e v = v , l ∈ L fixed v ∈ U out l ∈ L fixed v ∈ U out Since it is net ( l ) u = f ( u ) in ( l ) in ( l ) net ( � w ⊤ u � u , � w u ) = � u , we have for the second factor in ( l ) (Note: extended input vector � u = (1 , in ( l ) p 1 u , . . . , in ( l ) Form gradient to determine the direction of the step p n u ) and weight vector � w u = ( − θ, w p 1 u , . . . , w p n u ).) (here and in the following: extended weight vector � w u = ( − θ u , w up 1 , . . . , w up n )): ∂ net ( l ) � � in ( l ) u = � u . w u e = ∂e − ∂e ∂e ∂e � ∇ � = , , . . . , . ∂ � w u ∂ � w u ∂θ u ∂w up 1 ∂w up n For the first factor we consider the error e ( l ) for the training pattern l = ( � ı ( l ) ,� o ( l ) ): Exploit the sum over the training patterns: � � 2 � � e ( l ) = e ( l ) o ( l ) v − out ( l ) v = , ∂e ( l ) v � � w u e = ∂e ∂ e ( l ) = � ∇ � = . v ∈ U out v ∈ U out ∂ � w u ∂ � w u ∂ � w u l ∈ L fixed l ∈ L fixed that is, the sum of the errors over all output neurons. Christian Borgelt Artificial Neural Networks and Deep Learning 149 Christian Borgelt Artificial Neural Networks and Deep Learning 150 Gradient Descent: Formal Approach Gradient Descent: Formal Approach Therefore we have Distinguish two cases: • The neuron u is an output neuron . � � 2 � � 2 • The neuron u is a hidden neuron . � o ( l ) v − out ( l ) o ( l ) v − out ( l ) ∂ ∂ ∂e ( l ) v ∈ U out v v � = = . In the first case we have ∂ net ( l ) ∂ net ( l ) ∂ net ( l ) v ∈ U out u u u � ∂ out ( l ) � δ ( l ) o ( l ) u − out ( l ) u ∀ u ∈ U out : u = u ∂ net ( l ) Since only the actual output out ( l ) u of an output neuron v depends on the network v input net ( l ) Therefore we have for the gradient u of the neuron u we are considering, it is � ∂ out ( l ) u = ∂e ( l ) � � ∂ out ( l ) � ∂e ( l ) w u e ( l ) u o ( l ) u − out ( l ) u in ( l ) � � � ∀ u ∈ U out : ∇ � = − 2 o ( l ) v − out ( l ) v u u = − 2 , ∂ � w u ∂ net ( l ) v ∂ net ( l ) ∂ net ( l ) u v ∈ U out u u � �� � and thus for the weight change δ ( l ) u � ∂ out ( l ) � u = − η w ( l ) w u e ( l ) o ( l ) u − out ( l ) in ( l ) u � � ∀ u ∈ U out : ∆ � ∇ � u = η u . which also introduces the abbreviation δ ( l ) u u for the important sum appearing here. ∂ net ( l ) 2 u Christian Borgelt Artificial Neural Networks and Deep Learning 151 Christian Borgelt Artificial Neural Networks and Deep Learning 152

  34. Gradient Descent: Formal Approach Gradient Descent: Formal Approach Exact formulae depend on the choice of the activation and the output function, For a logistic activation function we have since it is � 1 + e − x � − 1 � 1 + e − x � − 2 � − e − x � d f ′ − act ( x ) = = out ( l ) u = f out ( act ( l ) u ) = f out ( f act ( net ( l ) u )) . d x 1 + e − x − 1 � � 1 1 = = 1 − (1 + e − x ) 2 1 + e − x 1 + e − x Consider the special case with: = f act ( x ) · (1 − f act ( x )) , • the output function is the identity, and therefore 1 • the activation function is logistic, that is, f act ( x ) = 1+ e − x . � � � � act ( net ( l ) u ) = f act ( net ( l ) 1 − f act ( net ( l ) = out ( l ) 1 − out ( l ) f ′ u ) · u ) . u u The first assumption yields The resulting weight change is therefore ∂ out ( l ) = ∂ act ( l ) � � � � u u act ( net ( l ) = f ′ w ( l ) o ( l ) u − out ( l ) out ( l ) 1 − out ( l ) in ( l ) � u ) . ∆ � u = η u , u u u ∂ net ( l ) ∂ net ( l ) u u which makes the computations very simple. Christian Borgelt Artificial Neural Networks and Deep Learning 153 Christian Borgelt Artificial Neural Networks and Deep Learning 154 Gradient Descent: Formal Approach Error Backpropagation logistic activation function: derivative of logistic function: Consider now: The neuron u is a hidden neuron , that is, u ∈ U k , 0 < k < r − 1. 1 f act (net ( l ) f ′ act (net ( l ) u ) = f act (net ( l ) u ) · (1 − f act (net ( l ) u ) = u )) The output out ( l ) v of an output neuron v depends on the network input net ( l ) 1 + e − net ( l ) u u only indirectly through the successor neurons 1 1 succ( u ) = { s ∈ U | ( u, s ) ∈ C } = { s 1 , . . . , s m } ⊆ U k +1 , 1 1 namely through their network inputs net ( l ) 2 2 s . 1 4 net net 0 0 We apply the chain rule to obtain − 4 − 2 0 +2 +4 − 4 − 2 0 +2 +4 v ) ∂ out ( l ) ∂ net ( l ) � � δ ( l ) ( o ( l ) v − out ( l ) v s u = . ∂ net ( l ) ∂ net ( l ) • If a logistic activation function is used (shown on left), the weight changes are v ∈ U out s ∈ succ( u ) s u � � proportional to λ ( l ) u = out ( l ) 1 − out ( l ) (shown on right; see preceding slide). u u Exchanging the sums yields • Weight changes are largest, and thus the training speed highest, in the vicinity   v ) ∂ out ( l )  ∂ net ( l ) ∂ net ( l ) of net ( l ) Far away from net ( l ) � � � δ ( l ) ( o ( l ) v − out ( l ) v s δ ( l ) s = 0. = 0, the gradient becomes (very) small u u  u = = . s ∂ net ( l ) ∂ net ( l ) ∂ net ( l ) (“saturation regions”) and thus training (very) slow. v ∈ U out s ∈ succ( u ) s u s ∈ succ( u ) u Christian Borgelt Artificial Neural Networks and Deep Learning 155 Christian Borgelt Artificial Neural Networks and Deep Learning 156

  35. Error Backpropagation Error Backpropagation Consider the network input The resulting formula for the weight change (for training pattern l ) is     � net ( l ) in ( l )  w sp out ( l )  w ⊤ s �  ∂ out ( l )  − θ s , s = � s =  p u = − η � w ( l ) w u e ( l ) = η δ ( l ) in ( l ) δ ( l ) in ( l )   u � u � � ∆ � ∇ � u = η s w su u .  p ∈ pred( s ) ∂ net ( l ) 2 s ∈ succ( u ) u in ( l ) s is the output out ( l ) where one element of � u of the neuron u . Therefore it is   Consider again the special case with ∂ net ( l ) ∂ out ( l ) ∂ out ( l ) � ∂θ s s  p  u  − = w sp = w su ,  ∂ net ( l ) ∂ net ( l ) ∂ net ( l ) ∂ net ( l ) • the output function is the identity, u p ∈ pred( s ) u u u • the activation function is logistic. The result is the recursive equation:    ∂ out ( l ) The resulting formula for the weight change is then � δ ( l )  δ ( l )  u u = s w su .    ∂ net ( l ) s ∈ succ( u ) u � w ( l )  δ ( l )  out ( l )  u (1 − out ( l ) in ( l ) u ) � ∆ � u = η s w su u .  This recursive equation defines error backpropagation , s ∈ succ( u ) because it propagates the error back by one layer. Christian Borgelt Artificial Neural Networks and Deep Learning 157 Christian Borgelt Artificial Neural Networks and Deep Learning 158 Error Backpropagation: Cookbook Recipe Stochastic Gradient Descent • True gradient descent requires batch training , that is, computing ∀ u ∈ U in : ∀ u ∈ U hidden ∪ U out : forward � � �� − 1 � out ( l ) u = ext ( l ) out ( l ) p ∈ pred( u ) w up out ( l ) − η − η � � propagation: w u e ( l ) = w ( l ) u = 1 + exp − � � u p ∇ � ∇ � ∆ � w u = w u e = ∆ � u . 2 2 l ∈ L l ∈ L x 1 y 1 • Since this can be slow, because the parameters are updated only once per epoch, • logistic online training (update after each training pattern) is often employed. activation function x 2 y 2 • Online training is a special case of stochastic gradient descent . • implicit Generally, stochastic gradient descent means: bias value ◦ The function to optimize is composed of several subfunctions, for example, one subfunction per training pattern (as is the case here). x n y m error factor: ◦ A (partial) gradient is computed from a (random) subsample ∀ u ∈ U hidden : ∀ u ∈ U out : of these subfunctions and directly used to update the parameters. backward � � �� � δ ( l ) o ( l ) u − out ( l ) λ ( l ) δ ( l ) s ∈ succ( u ) δ ( l ) λ ( l ) Such (random) subsamples are also referred to as mini-batches . propagation: u = s w su u = u u u (Online training works with mini-batches of size 1.) � � activation weight λ ( l ) u = out ( l ) 1 − out ( l ) ∆ w ( l ) up = η δ ( l ) u out ( l ) ◦ With randomly chosen subsamples, the gradient descent is stochastic . u u p derivative: change: Christian Borgelt Artificial Neural Networks and Deep Learning 159 Christian Borgelt Artificial Neural Networks and Deep Learning 160

  36. Gradient Descent: Examples Gradient Descent: Examples Gradient descent training for the negation ¬ x epoch θ w error epoch θ w error 0 3.00 3.50 1.307 0 3.00 3.50 1.295 x y w 20 3.77 2.19 0.986 20 3.76 2.20 0.985 y 0 1 x θ 40 3.71 1.81 0.970 40 3.70 1.82 0.970 1 0 60 3.50 1.53 0.958 60 3.48 1.53 0.957 80 3.15 1.24 0.937 80 3.11 1.25 0.934 100 2.57 0.88 0.890 100 2.49 0.88 0.880 2 2 2 120 1.48 0.25 0.725 120 1.27 0.22 0.676 e e e − 0 . 06 − 0 . 98 − 0 . 21 − 1 . 04 140 0.331 140 0.292 2 2 2 1 1 1 160 − 0 . 80 − 2 . 07 0.149 160 − 0 . 86 − 2 . 08 0.140 1 1 1 180 − 1 . 19 − 2 . 74 0.087 180 − 1 . 21 − 2 . 74 0.084 4 4 4 2 2 2 200 − 1 . 44 − 3 . 20 0.059 200 − 1 . 45 − 3 . 19 0.058 0 0 0 w w w 4 4 4 –2 –2 –2 220 − 1 . 62 − 3 . 54 0.044 220 − 1 . 63 − 3 . 53 0.044 2 2 2 0 0 0 –4 – 2 –4 – 2 –4 – 2 θ θ θ 4 4 4 – – – error for x = 0 error for x = 1 sum of errors Online Training Batch Training Here there is hardly any difference between online and batch training, Note: error for x = 0 and x = 1 is effectively the squared logistic activation function! because the training data set is so small (only two sample cases). Christian Borgelt Artificial Neural Networks and Deep Learning 161 Christian Borgelt Artificial Neural Networks and Deep Learning 162 Gradient Descent: Examples (Stochastic) Gradient Descent: Variants Visualization of gradient descent for the negation ¬ x In the following the distinction of online and batch training is disregarded. In principle, all methods can be applied with full or stochastic gradient descent. 4 4 Weight update rule: 2 2 2 e w t +1 = w t + ∆ w t ( t = 1 , 2 , 3 , . . . ) 2 1 w w 0 0 1 Standard Backpropagation: − 2 − 2 4 2 − η − η 0 w − 4 − 4 4 ∆ w t = 2 ∇ w e | w t = 2( ∇ w e )( w t ) –2 2 0 − 4 − 2 0 2 4 − 4 − 2 0 2 4 –4 – 2 θ 4 – θ θ Online Training Batch Training Batch Training Manhattan Training: � � ∆ w t = − η sgn ∇ w e | w t • Training is obviously successful. Fixed step width, only the sign of the gradient (direction) is evaluated. • Error cannot vanish completely due to the properties of the logistic function. Advantage: Learning speed does not depend on the size of the gradient. ⇒ No slowing down in flat regions or close to a minimum. Attention: Distinguish between w and θ as parameters of the error function, Disadvantage: Parameter values are constrained to a fixed grid. and their concrete values in each step of the training process. Christian Borgelt Artificial Neural Networks and Deep Learning 163 Christian Borgelt Artificial Neural Networks and Deep Learning 164

  37. (Stochastic) Gradient Descent: Variants (Stochastic) Gradient Descent: Variants Momentum Term: [Polyak 1964] Momentum Term: [Polyak 1964] ∆ w t = − η ∆ w t = − η 2 ∇ w e | w t + α ∆ w t − 1 2 m t with m 0 = 0; m t = αm t − 1 + ∇ w e | w t ; α ∈ [0 . 5 , 0 . 999] Alternative formulation of the momentum term approach, Part of previous change is added, may lead to accelerated training ( α ∈ [0 . 5 , 0 . 999]). using an auxiliary variable to accumulate the gradients (instead of weight changes). Nesterov’s Accelerated Gradient: (NAG) [Nesterov 1983] Nesterov’s Accelerated Gradient: (NAG) [Nesterov 1983] Analogous to the introduction of a momentum term, but: m 0 = 0; m t = αm t − 1 + ∇ w e | w t ; α ∈ [0 . 5 , 0 . 999]; Apply the momentum step to the parameters also before computing the gradient. ∆ w t = − η 2 m t with m t = αm t + ∇ w e | w t ∆ w t = − η 2 ∇ w e | w t + α ∆ w t − 1 + α ∆ w t − 1 Alternative formulation of Nesterov’s Accelerated Gradient. Idea: The momentum term does not depend on the current gradient, (Remark: Not exactly equivalent to the formulation on the preceding slide, so one can get a better adaptation direction by applying the momentum term also but leads to analogous training behavior and is easier to implement.) before computing the gradient, that is, by computing the gradient at w t + α ∆ w t − 1 . Christian Borgelt Artificial Neural Networks and Deep Learning 165 Christian Borgelt Artificial Neural Networks and Deep Learning 166 (Stochastic) Gradient Descent: Variants (Stochastic) Gradient Descent: Variants Self-adaptive Error Backpropagation: (SuperSAB) [Tollenaere 1990] Quick Propagation (QuickProp) e  [Fahlmann 1988] e ( w t − 1 )  c − · η ( w )  t − 1 , if ∇ w e | w t · ∇ w e | w t − 1 < 0,    apex    c + · η ( w ) Approximate the error function t − 1 , if ∇ w e | w t · ∇ w e | w t − 1 > 0 η ( w ) e ( w t ) = t (locally) with a parabola.  ∧ ∇ w e | w t − 1 · ∇ w e | w t − 2 ≥ 0,      η ( w )  w  t − 1 , otherwise. Compute the apex of the parabola m w t +1 w t w t − 1 and “jump” directly to it. Resilient Error Backpropagation: (RProp) [Riedmiller and Braun 1992] ∇ w e ∇ w e | w t − 1  The weight update rule c − · ∆ w t − 1 , if ∇ w e | w t · ∇ w e | w t − 1 < 0,   ∇ w e | w t  can be derived from the triangles:   c + · ∆ w t − 1 , if ∇  w e | w t · ∇ w e | w t − 1 > 0 ∆ w t = ∇ w e | w t  ∧ ∇ w e | w t − 1 · ∇ w e | w t − 1 ≥ 0,  0  ∆ w t = · ∆ w t − 1 .    ∇ w e | w t − 1 − ∇ w e | w t ∆ w t − 1 , otherwise. w w t +1 w t w t − 1 Typical values: c − ∈ [0 . 5 , 0 . 7] and c + ∈ [1 . 05 , 1 . 2]. Recommended: Apply only with (full) batch training, online training can be unstable. Recommended: Apply only with (full) batch training, online training can be unstable. Christian Borgelt Artificial Neural Networks and Deep Learning 167 Christian Borgelt Artificial Neural Networks and Deep Learning 168

  38. (Stochastic) Gradient Descent: Variants (Stochastic) Gradient Descent: Variants AdaGrad (adaptive subgradient descent) [Duchi et al. 2011] AdaDelta (adaptive subgradient descent over windows) [Zeiler 2012] w e | w t ) 2 ; ∆ w t = − η ∇ w e | w t v 0 = 0; v t = v t − 1 + ( ∇ u 1 = 0; u t = αu t − 1 + (1 − α )(∆ w t − 1 ) 2 ; √ v t + ǫ with √ u t + ǫ ǫ = 10 − 6 ; η = 0 . 01 w e | w t ) 2 ; ∆ w t = − √ v t + ǫ ∇ w e | w 0 with v 0 = 0; v t = βv t − 1 + (1 − β )( ∇ Idea: Normalize the gradient by the raw variance of all preceding gradients. α = β = 0 . 95; ǫ = 10 − 6 Slow down learning along dimensions that have already changed significantly, speed up learning along dimensions that have changed only slightly. Advantage: “Stabilizes” the networks representation of common features. The first step is effectively identical to Manhattan training; the following steps are analogous a “normalized” momentum term approach. Disadvantage: Learning becomes slower over time, finally stops completely. Idea: In gradient descent the parameter and its change do not have the same “units”. RMSProp (root mean squared gradients) [Tieleman and Hinton 2012] This is also the case for AdaGrad or RMSProp. w e | w t ) 2 ; v 0 = 0; v t = βv t − 1 + (1 − β )( ∇ ∆ w t = − η ∇ w e | w t with √ v t + ǫ In the Newton-Raphson method, however, due the use of second derivative information ǫ = 10 − 6 β = 0 . 999; (Hessian matrix), the units of a parameter and its change match. Idea: Normalize the gradient by the raw variance of some previous gradients. With some simplifying assumptions, the above update rule can be derived. Use “exponential forgetting” to avoid having to store previous gradients. Christian Borgelt Artificial Neural Networks and Deep Learning 169 Christian Borgelt Artificial Neural Networks and Deep Learning 170 (Stochastic) Gradient Descent: Variants (Stochastic) Gradient Descent: Variants Adam (adaptive moment estimation) [Kingma and Ba 2015] NAdam (Adam with Nesterov acceleration) [Dozat 2016] m 0 = 0; m t = αm t − 1 + (1 − α )( ∇ w e | w t ); ∆ w t = − ηαm t + (1 − α )( ∇ w e | w t ) α = 0 . 9; m t √ v t + ǫ with m t and v t as for Adam ∆ w t = − η √ v t + ǫ with w e | w t ) 2 ; β = 0 . 999 v 0 = 0; v t = βv t − 1 + (1 − β )( ∇ Since the exponential decay of the gradient is similar to a momentum term, This method is called ada ptive m oment estimation , because the idea presents itself to apply Nesterov’s accelerated gradient m t is an estimate of the (recent) first moment (mean) of the gradient, (c.f. the alternative formulation of Nesterov’s accelerated gradient). v t an estimate of the (recent) raw second moment (raw variance) of the gradient. NAdam with Bias Correction [Dozat 2016] Adam with Bias Correction [Kingma and Ba 2015] m t + 1 − α m t = m t / (1 − α t ) � ( m t as above), α � 1 − α t ( ∇ w e | w t ) m t � with � m t and � v t as for Adam √ ∆ w t = − η with ∆ w t = − η √ v t = v t / (1 − β t ) v t + ǫ � v t + ǫ � with bias correction � ( v t as above) Idea: In the first steps of the update procedure, since m 0 = v 0 = 0, Idea: In the first steps of the update procedure, since m 0 = v 0 = 0, several preceding gradients are implicitly set to zero. several preceding gradients are implicitly set to zero. As a consequence, the estimates are biased towards zero (they are too small). As a consequence, the estimates are biased towards zero (they are too small). The above modification of m t and v t corrects for this initial bias. The above modification of m t and v t corrects for this initial bias. Christian Borgelt Artificial Neural Networks and Deep Learning 171 Christian Borgelt Artificial Neural Networks and Deep Learning 172

  39. (Stochastic) Gradient Descent: Variants Gradient Descent: Examples Gradient descent training for the negation ¬ x AdaMax (Adam with l ∞ /maximum norm) [Kingma and Ba 2015] m 0 = 0; m t = αm t − 1 + (1 − α )( ∇ w e | w t ); α = 0 . 9; x y m t � w e | w t ) 2 � w ∆ w t = − η √ v t + ǫ with y 0 1 x θ v 0 = 0; v t = max βv t − 1 , ( ∇ ; β = 0 . 999 1 0 m t is an estimate of the (recent) first moment (mean) of the gradient (as in Adam). However, while standard Adam uses the l 2 norm to estimate the (raw) variance, AdaMax uses the l ∞ or maximum norm to do so. 2 2 2 e e e 2 2 2 AdaMax with Bias Correction [Kingma and Ba 2015] 1 1 1 1 1 1 m t = m t / (1 − α t ) ( m t as above), m t � 4 4 4 � 2 2 2 ∆ w t = − η √ v t + ǫ with 0 0 0 w w w v t does not need a bias correction. 4 4 4 –2 –2 –2 2 2 2 0 0 0 –4 – 2 –4 – 2 –4 – 2 θ θ θ 4 4 4 – – – Idea: In the first steps of the update procedure, since m 0 = 0, error for x = 0 error for x = 1 sum of errors several preceding gradients are implicitly set to zero. As a consequence, the estimates are biased towards zero (they are too small). Note: error for x = 0 and x = 1 is effectively the squared logistic activation function! The above modification of m t corrects for this initial bias. Christian Borgelt Artificial Neural Networks and Deep Learning 173 Christian Borgelt Artificial Neural Networks and Deep Learning 174 Gradient Descent: Examples Gradient Descent: Examples 4 4 θ w θ w epoch error epoch error 2 2 2 0 3.00 3.50 1.295 0 3.00 3.50 1.295 e 2 1 20 3.76 2.20 0.985 10 3.80 2.19 0.984 w w 0 0 40 3.70 1.82 0.970 20 3.75 1.84 0.971 1 − 2 − 2 4 60 3.48 1.53 0.957 30 3.56 1.58 0.960 2 0 w 80 3.11 1.25 0.934 40 3.26 1.33 0.943 − 4 − 4 4 –2 2 0 − 4 − 2 0 2 4 − 4 − 2 0 2 4 –4 – 2 100 2.49 0.88 0.880 50 2.79 1.04 0.910 θ 4 – θ θ 120 1.27 0.22 0.676 60 1.99 0.60 0.814 without momentum term with momentum term with momentum term − 0 . 21 − 1 . 04 − 0 . 25 140 0.292 70 0.54 0.497 160 − 0 . 86 − 2 . 08 0.140 80 − 0 . 53 − 1 . 51 0.211 • Dots show position every 20 (without momentum term) 180 − 1 . 21 − 2 . 74 0.084 90 − 1 . 02 − 2 . 36 0.113 or every 10 epochs (with momentum term). 200 − 1 . 45 − 3 . 19 0.058 100 − 1 . 31 − 2 . 92 0.073 220 − 1 . 63 − 3 . 53 0.044 110 − 1 . 52 − 3 . 31 0.053 • Learning with a momentum term ( α = 0 . 9) is about twice as fast. 120 − 1 . 67 − 3 . 61 0.041 Attention: The momentum factor α must be strictly less than 1. without momentum term with momentum term ( α = 0 . 9) If it is 1 or greater, the training process can “explode”, that is, the parameter changes become larger and larger. Using a momentum term leads to a considerable acceleration (here ≈ factor 2). Christian Borgelt Artificial Neural Networks and Deep Learning 175 Christian Borgelt Artificial Neural Networks and Deep Learning 176

  40. Gradient Descent: Examples Gradient Descent: Examples f ( x ) = 5 6 x 4 − 7 x 3 + 115 f ( x ) = 5 6 x 4 − 7 x 3 + 115 6 x 2 − 18 x + 6 , 6 x 2 − 18 x + 6 , Example function: Example function: f ( u ) f ( u ) f ′ ( x i ) f ′ ( x i ) i x i f ( x i ) ∆ x i i x i f ( x i ) ∆ x i 6 6 0 0 . 200 3 . 112 − 11 . 147 0 . 011 0 1 . 500 2 . 719 3 . 500 − 1 . 050 1 0 . 211 2 . 990 − 10 . 811 0 . 021 1 0 . 450 1 . 178 − 4 . 699 0 . 705 5 5 − 10 . 196 − 0 . 509 2 0 . 232 2 . 771 0 . 029 2 1 . 155 1 . 476 3 . 396 4 4 3 0 . 261 2 . 488 − 9 . 368 0 . 035 3 0 . 645 0 . 629 − 1 . 110 0 . 083 4 0 . 296 2 . 173 − 8 . 397 0 . 040 4 0 . 729 0 . 587 0 . 072 − 0 . 005 3 3 5 0 . 337 1 . 856 − 7 . 348 0 . 044 5 0 . 723 0 . 587 0 . 001 0 . 000 6 0 . 380 1 . 559 − 6 . 277 0 . 046 6 0 . 723 0 . 587 0 . 000 0 . 000 2 2 7 0 . 426 1 . 298 − 5 . 228 0 . 046 7 0 . 723 0 . 587 0 . 000 0 . 000 8 0 . 472 1 . 079 − 4 . 235 0 . 046 1 8 0 . 723 0 . 587 0 . 000 0 . 000 1 − 3 . 319 9 0 . 518 0 . 907 0 . 045 u 9 0 . 723 0 . 587 0 . 000 0 . 000 u 0 0 10 0 . 562 0 . 777 10 0 . 723 0 . 587 0 1 2 3 4 0 1 2 3 4 Gradient descent with initial value 0 . 2, learning rate 0 . 001 Gradient descent with initial value 1 . 5, initial learning rate 0 . 25, and self-adapting learning rate ( c + = 1 . 2, c − = 0 . 5). and momentum term α = 0 . 9. A momentum term can compensate (to some degree) a learning rate that is too small. An adaptive learning rate can compensate for an improper initial value. Christian Borgelt Artificial Neural Networks and Deep Learning 177 Christian Borgelt Artificial Neural Networks and Deep Learning 178 Other Extensions Other Extensions Flat Spot Elimination: • Reminder: z -score normalization of the (external) input vectors (neuron u ). � � � � 2 u = x ( l ) � � out ( l ) 1 − out ( l ) z ( l ) u − µ u x ( l ) x ( l ) µ u = 1 σ 2 1 Increase logistic function derivative: + ζ √ u − µ u u u with u , u = σ 2 | L | | L |− 1 u l ∈ L l ∈ L • Eliminates slow learning in the saturation region of the logistic function ( ζ ≈ 0 . 1). • Idea of batch normalization : [Ioffe and Szegedy 2015] • Counteracts the decay of the error signals over the layers (to be discussed later). ◦ Apply such a normalization after each hidden layer, but compute ( B i ) from each mini-batch B i (instead of all patterns L ). µ ( B i ) and σ 2 u u Weight Decay: z ′ ◦ In addition, apply trainable scale γ u and shift β u : u = γ u · z u + β u ∆ w t = − η 2 ∇ w e | w t − ξ w t , ◦ Use aggregated mean µ u and variance σ 2 u for final network: � � • Helps to improve the robustness of the training results ( ξ ≤ 10 − 3 ). µ u = 1 u = 1 ( B i ) . µ ( B i ) σ 2 σ 2 m : number of , u u m m mini-batches i =1 i =1 • Can be derived from an extended error function penalizing large weights: ◦ Transformation in final network (with trained γ u and β u ) � � e ∗ = e + ξ � � θ 2 w 2 u + . ǫ : to prevent up z u = x u − µ u z ′ √ 2 u + ǫ , u = γ u · z u + β u . u ∈ U out ∪ U hidden p ∈ pred( u ) division by zero σ 2 Christian Borgelt Artificial Neural Networks and Deep Learning 179 Christian Borgelt Artificial Neural Networks and Deep Learning 180

  41. Objective Functions for Classification Objective Functions for Classification • Up to now: sum of squared errors (to be minimized) What to do if there are more than two classes? � � 2 � � � � � e ( l ) = e ( l ) o ( l ) v − out ( l ) • Standard approach: 1-in-n encoding (aka 1-hot encoding ). e = v = v v ∈ U out v ∈ U out l ∈ L fixed l ∈ L fixed l ∈ L fixed ◦ As many output neurons as there are classes: one for each class o ( l ) (that is, output vectors are of the form � = (0 , . . . , 0 , 1 , 0 , . . . , 0)). v • Appropriate objective function for regression problems, that is, for problems with a numeric target. ◦ Each neuron distinguishes the class assigned to it from all other classes (all other classes are combined into a pseudo-class). • For classification a maximum likelihood approach is usually more appropriate. • With 1-in- n encoding one may use as an objective function • A maximum (log-)likelihood approach is immediately possible for two classes, simply the sum of the log-likelihood over all output neurons: for example, as for logistic regression (to be maximized): � � o ( l ) � � 1 − o ( l ) � � v · out ( l ) 1 − out ( l ) � out ( l ) � o ( l ) � 1 − out ( l ) � 1 − o ( l ) � v z = ln v v z = ln · v ∈ U out l ∈ L fixed l ∈ L fixed � �� � � � � � � � out ( l ) � � 1 − out ( l ) �� � o ( l ) out ( l ) + (1 − o ( l ) 1 − out ( l ) o ( l ) · ln + (1 − o ( l ) ) · ln = v · ln v ) · ln = v v v ∈ U out l ∈ L fixed l ∈ L fixed (Attention: For two classes only one output neuron is needed: 0 – one class, 1 – other class.) Disadvantage: Does not ensure a probability distribution over the classes. Christian Borgelt Artificial Neural Networks and Deep Learning 181 Christian Borgelt Artificial Neural Networks and Deep Learning 182 Softmax Function as Output Function Multinomial Log-linear/Logistic Classification R m with k classes c 0 , . . . , c k − 1 . Objective: Output values that can be interpreted as class probabilities. • Given: classification problem on I • First idea: Simply normalize the output values to sum 1. • Select one class ( c 0 ) as a reference or base class and assume    P ( C = c i | � act ( l ) X = � x )  = a i 0 + � a ⊤ out ( l ) v ∀ i ∈ { 1 , . . . , k − 1 } : ln i � x v = P ( C = c 0 | � � X = � x ) u ∈ U out act ( l ) v a ⊤ P ( C = c i | � x ) = P ( C = c 0 | � x ) · e a i 0 + � i � x . or X = � X = � • Problem: Activation values may be negative (e.g., if f act ( x ) = tanh( x )). � k − 1 i =0 P ( C = c i | � • Summing these equations and exploiting X = � x ) = 1 yields • Solution: Use the so-called softmax function (actually soft argmax): k − 1 � a ⊤ � � e a i 0 + � i � x 1 − P ( C = c 0 | � x ) = P ( C = c 0 | � act ( l ) X = � X = � x ) · exp v out ( l ) i =1 v = � � � and therefore act ( l ) u ∈ U out exp 1 v P ( C = c 0 | � X = � x ) = � k − 1 a ⊤ j =1 e a j 0 + � j � x 1 + • Closely related to multinomial log-linear/logistic classification, which assumes as well as   a ⊤  P ( C = c i | � e a i 0 + � i � x X = � x ) a ⊤  = a i 0 + � P ( C = c i | � ln i � x c 0 : reference class ∀ i ∈ { 1 , . . . , k − 1 } : X = � x ) = P ( C = c 0 | � a ⊤ � k − 1 j =1 e a j 0 + � j � x X = � x ) 1 + Christian Borgelt Artificial Neural Networks and Deep Learning 183 Christian Borgelt Artificial Neural Networks and Deep Learning 184

  42. Multinomial Log-linear/Logistic Classification Objective Functions for Classification • This leads to the softmax function if we assume that What to do if there are more than two classes? x ) = 0 (since e 0 = 1) and ◦ the reference class c 0 is represented by a function f 0 ( � • When using a softmax function to determine class probabilities, a ⊤ we can (should) use cross entropy to define an objective function. ◦ all other classes by linear functions f i ( � x , i = 1 , . . . , k − 1. x ) = a i 0 + � i � • General idea: Measure difference between probability distributions. • This would be the case with a 1-in-(n–1) encoding : Represent the classes by c − 1 values with the reference class encoded by all zeros, the remaining c − 1 classes encoded by setting a corresponding value to 1. Let p 1 and p 2 be two strictly positive probability distributions on the same space E of events. Then • Special case k = 2 (standard logistic classification): � p 1 ( E ) 1 1 I KLdiv ( p 1 , p 2 , E ) = p 1 ( E ) log 2 P ( C = c 0 | � X = � x ) = x = p 2 ( E ) a ⊤ a ⊤ E ∈E 1 + e a 10 + � 1 � 1 + e − ( − a 10 − � 1 � x ) a ⊤ is called the Kullback-Leibler information divergence of p 1 and p 2 . e a 10 + � 1 � x 1 P ( C = c 1 | � X = � x ) = x = a ⊤ a ⊤ 1 + e a 10 + � 1 � 1 + e − (+ a 10 + � 1 � x ) • The Kullback-Leibler information divergence is non-negative. • The softmax function abandons the special role of one class as a reference • It is zero if and only if p 1 ≡ p 2 . and treats all classes equally (heuristics; “overdetermined” classification). Christian Borgelt Artificial Neural Networks and Deep Learning 185 Christian Borgelt Artificial Neural Networks and Deep Learning 186 Objective Functions for Classification Parameter Initialization • The Shannon entropy of a probability distribution p on a space E of events is • Bias values are usually initialized to zero, i.e., θ u = 0. � H ( p, E ) = − p ( E ) log 2 p ( E ) • Do not initialize all weights to zero, because this makes training impossible. (The gradient will be the same for all weights.) E ∈E • Let p 1 and p 2 be two strictly positive probability distributions • Uniform Weight Initialization: � � on the same space E of events. Then 1 , 1 p , 1 Draw weights from a uniform distribution on [ − ǫ, ǫ ], with e.g. ǫ ∈ √ p , H cross ( p 1 , p 2 , E ) = H ( p 1 , E ) + I KLdiv ( p 1 , p 2 , E ) where p is the number of predecessor neurons. � = − p 1 ( E ) log 2 p 2 ( E ) • Xavier Weight Initialization: [Glorot and Bengio 2010] � E ∈E Draw weights from a uniform distribution on [ − ǫ, ǫ ], with ǫ = 6 / ( p + s ) is called the cross entropy of p 1 and p 2 . where s is the number of neurons in the respective layer. Draw weights from a normal distribution with µ = 0 and variance 1 p . • Applied to multi-layer perceptrons (output layer distribution): � � � � • He et al ./Kaiming Weight Initialization: [He et al. 2015] � � � ( l ) , U out o ( l ) out ( l ) o ( l ) , � z = H cross � out = − v log 2 Draw weights from a normal distribution with µ = 0 and variance 2 v p . l ∈ L fixed l ∈ L fixed v ∈ U out • Basic idea of Xavier and He et al. /Kaiming initialization: Attention: requires desired and computed outputs normalized to sum 1; Try to equalize the variance of the (initial) neuron activations across layers. is to be minimized, is sometimes divided by | L fixed | (average cross entropy). Christian Borgelt Artificial Neural Networks and Deep Learning 187 Christian Borgelt Artificial Neural Networks and Deep Learning 188

  43. Reminder: The Iris Data Reminder: The Iris Data pictures not available in online version pictures not available in online version • Training a multi-layer perceptron for the Iris data (with sum of squared errors): • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). ◦ 4 input neurons (sepal length and width, petal length and width) • First analyzed by Ronald Aylmer Fisher (famous statistician). ◦ 3 hidden neurons (2 hidden neurons also work) • 150 cases in total, 50 cases per Iris flower type. • Measurements of sepal length and width and petal length and width (in cm). ◦ 3 output neurons (one output neuron per class) • Most famous data set in pattern recognition and data analysis. • Different training methods with different learning rates, 100 independent runs. Christian Borgelt Artificial Neural Networks and Deep Learning 189 Christian Borgelt Artificial Neural Networks and Deep Learning 190 Standard Backpropagation with/without Momentum SuperSAB, RProp, QuickProp standard η = 0 . 02, α = 0, online standard η = 0 . 2, α = 0, online standard η = 2 . 0, α = 0, online SuperSAB η = 0 . 002, batch RProp η = 0 . 002, batch QuickProp η = 0 . 002, batch error error error error error error 180 180 180 180 180 180 160 160 160 160 160 160 140 140 140 140 140 140 120 120 120 120 120 120 100 100 100 100 100 100 80 80 80 80 80 80 60 60 60 60 60 60 40 40 40 40 40 40 20 20 20 20 20 20 0 0 0 0 0 0 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 standard η = 0 . 2, α = 0 . 5, online standard η = 0 . 2, α = 0 . 7, online standard η = 0 . 2, α = 0 . 9, online SuperSAB η = 0 . 02, batch RProp η = 0 . 02, batch QuickProp η = 0 . 02, batch error error error error error error 180 180 180 180 180 180 160 160 160 160 160 160 140 140 140 140 140 140 120 120 120 120 120 120 100 100 100 100 100 100 80 80 80 80 80 80 60 60 60 60 60 60 40 40 40 40 40 40 20 20 20 20 20 20 0 0 0 0 0 0 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 Christian Borgelt Artificial Neural Networks and Deep Learning 191 Christian Borgelt Artificial Neural Networks and Deep Learning 192

  44. AdaGrad, RMSProp, AdaDelta Adam, NAdam, AdaMax AdaGrad η = 0 . 02, online RMSProp η = 0 . 002, online AdaDelta α = β = 0 . 95, online Adam η = 0 . 002, online NAdam η = 0 . 002, online AdaMax η = 0 . 02, online error error error error error error 180 180 180 180 180 180 160 160 160 160 160 160 140 140 140 140 140 140 120 120 120 120 120 120 100 100 100 100 100 100 80 80 80 80 80 80 60 60 60 60 60 60 40 40 40 40 40 40 20 20 20 20 20 20 0 0 0 0 0 0 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 0 5 10 15 20 epoch 25 AdaGrad η = 0 . 2, online RMSProp η = 0 . 02, online AdaDelta α = 0 . 9, β = 0 . 999, online Adam η = 0 . 02, online NAdam η = 0 . 02, online AdaMax η = 0 . 2, online error error error error error error 180 180 180 180 180 180 160 160 160 160 160 160 140 140 140 140 140 140 120 120 120 120 120 120 100 100 100 100 100 100 80 80 80 80 80 80 60 60 60 60 60 60 40 40 40 40 40 40 20 20 20 20 20 20 0 0 0 0 0 0 epoch epoch epoch epoch epoch epoch 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Christian Borgelt Artificial Neural Networks and Deep Learning 193 Christian Borgelt Artificial Neural Networks and Deep Learning 194 Number of Hidden Neurons Number of Hidden Neurons • Note that the approximation theorem only states that there exists Principle of training data/validation data approach: a number of hidden neurons and weight vectors � v and � w i and thresholds θ i , but not how they are to be chosen for a given ε of approximation accuracy. • Underfitting: If the number of neurons in the hidden layer is too small, the multi-layer perceptron may not be able to capture the structure of the relationship • For a single hidden layer the following rule of thumb is popular: between inputs and outputs precisely enough due to a lack of parameters. number of hidden neurons = (number of inputs + number of outputs) / 2 • Overfitting: With a larger number of hidden neurons a multi-layer perceptron • Better, though computationally expensive approach: may adapt not only to the regular dependence between inputs and outputs, but also to the accidental specifics (errors and deviations) of the training data set. ◦ Randomly split the given data into two subsets of (about) equal size, the training data and the validation data . • Overfitting will usually lead to the effect that the error a multi-layer perceptron ◦ Train multi-layer perceptrons with different numbers of hidden neurons yields on the validation data will be (possibly considerably) greater than the error on the training data and evaluate them on the validation data. it yields on the training data. ◦ Repeat the random split of the data and training/evaluation many times The reason is that the validation data set is likely distorted in a different fashion and average the results over the same number of hidden neurons. than the training data, since the errors and deviations are random. Choose the number of hidden neurons with the best average error. • Minimizing the error on the validation data by properly choosing ◦ Train a final multi-layer perceptron on the whole data set. the number of hidden neurons prevents both under- and overfitting. Christian Borgelt Artificial Neural Networks and Deep Learning 195 Christian Borgelt Artificial Neural Networks and Deep Learning 196

  45. Number of Hidden Neurons: Avoid Overfitting Number of Hidden Neurons: Cross Validation • Objective: select the model that best fits the data, • The described method of iteratively splitting the data into taking the model complexity into account . training and validation data may be referred to as cross validation . The more complex the model, the better it usually fits the data. • However, this term is more often used for the following specific procedure: ◦ The given data set is split into n parts or subsets (also called folds) y 7 of about equal size (so-called n-fold cross validation ). black line: 6 regression line ◦ If the output is nominal (also sometimes called symbolic or categorical), 5 (2 free parameters) this split is done in such a way that the relative frequency 4 of the output values in the subsets/folds represent as well as possible 3 blue curve: the relative frequencies of these values in the data set as a whole. 2 7th order regression polynomial This is also called stratification 1 (8 free parameters) (derived from the Latin stratum : layer, level, tier). x 0 0 1 2 3 4 5 6 7 8 ◦ Out of these n data subsets (or folds) n pairs of training and validation data set are formed by using one fold as a validation data set • The blue curve fits the data points perfectly, but it is not a good model . while the remaining n − 1 folds are combined into a training data set. Christian Borgelt Artificial Neural Networks and Deep Learning 197 Christian Borgelt Artificial Neural Networks and Deep Learning 198 Number of Hidden Neurons: Cross Validation Avoiding Overfitting: Alternatives • The advantage of the cross validation method is that one random split of the data • An alternative way to prevent overfitting is the following approach: yields n different pairs of training and validation data set. ◦ During training the performance of the multi-layer perceptron is evaluated after each epoch (or every few epochs) on a validation data set . • An obvious disadvantage is that (except for n = 2) the size of the training and the test data set are considerably different, ◦ While the error on the training data set should always decrease with each which makes the results on the validation data statistically less reliable. epoch, the error on the validation data set should, after decreasing initially as well, increase again as soon as overfitting sets in. • It is therefore only recommended for sufficiently large data sets or sufficiently small n , so that the validation data sets are of sufficient size. ◦ At this moment training is terminated and either the current state or (if avail- able) the state of the multi-layer perceptron, for which the error on the vali- • Repeating the split (either with n = 2 or greater n ) has the advantage dation data reached a minimum, is reported as the training result. that one obtains many more training and validation data sets, leading to more reliable statistics (here: for the number of hidden neurons). • Furthermore a stopping criterion may be derived from the shape of the error curve on the training data over the training epochs, or the network is trained only for a • The described approaches fall into the category of resampling methods . fixed, relatively small number of epochs (also known as early stopping ). • Other well-known statistical resampling methods are • Disadvantage: these methods stop the training of a complex network early enough, bootstrap , jackknife , subsampling and permutation test . rather than adjust the complexity of the network to the “correct” level. Christian Borgelt Artificial Neural Networks and Deep Learning 199 Christian Borgelt Artificial Neural Networks and Deep Learning 200

  46. Sensitivity Analysis Problem of Multi-Layer Perceptrons: • The knowledge that is learned by a neural network is encoded in matrices/vectors of real-valued numbers and is therefore often difficult to understand or to extract. • Geometric interpretation or other forms of intuitive understanding are possible only for very simple networks, but fail for complex practical problems. Sensitivity Analysis • Thus a neural network is often effectively a black box , that computes its output, though in a mathematically precise way, in a way that is very difficult to understand and to interpret. Idea of Sensitivity Analysis: • Try to find out to which inputs the output(s) react(s) most sensitively. • This may also give hints which inputs are not needed and may be discarded. Christian Borgelt Artificial Neural Networks and Deep Learning 201 Christian Borgelt Artificial Neural Networks and Deep Learning 202 Sensitivity Analysis Sensitivity Analysis Question: How important are different inputs to the network? For the second factor we get the general result: Idea: Determine change of output relative to change of input. � � ∂ net v ∂ ∂ out p = w vp out p = w vp . ∂ out u ∂ out u ∂ out u � � � � ∂ out ( l ) p ∈ pred( v ) p ∈ pred( v ) � � 1 � � v � � ∀ u ∈ U in : s ( u ) = � . � � | L fixed | ∂ ext ( l ) � l ∈ L fixed v ∈ U out This leads to the recursion formula u ∂ out v = ∂ out v ∂ net v = ∂ out v � ∂ out p Formal derivation: Apply chain rule. w vp . ∂ out u ∂ net v ∂ out u ∂ net v ∂ out u p ∈ pred( v ) ∂ out v = ∂ out v ∂ out u = ∂ out v ∂ net v ∂ out u . ∂ ext u ∂ out u ∂ ext u ∂ net v ∂ out u ∂ ext u However, for the first hidden layer we get Simplification: Assume that the output function is the identity. ∂ net v ∂ out v = ∂ out v = w vu , therefore w vu . ∂ out u ∂ out u ∂ net v ∂ out u = 1 . ∂ ext u This formula marks the start of the recursion. Christian Borgelt Artificial Neural Networks and Deep Learning 203 Christian Borgelt Artificial Neural Networks and Deep Learning 204

  47. Sensitivity Analysis Sensitivity Analysis Attention: Use weight decay to stabilize the training results! Consider as usual the special case with ∆ w t = − η 2 ∇ w e | w t − ξ w t , • output function is the identity, • activation function is logistic. • Without weight decay the results of a sensitivity analysis can depend (strongly) on the (random) initialization of the network. The recursion formula is in this case ∂ out v � ∂ out p = out v (1 − out v ) w vp Example: Iris data, 1 hidden layer, 3 hidden neurons, 1000 epochs ∂ out u ∂ out u p ∈ pred( v ) attribute ξ = 0 ξ = 0 . 0001 and the anchor of the recursion is sepal length 0.0216 0.0399 0.0232 0.0515 0.0367 0.0325 0.0351 0.0395 ∂ out v sepal width 0.0423 0.0341 0.0460 0.0447 0.0385 0.0376 0.0421 0.0425 = out v (1 − out v ) w vu . petal length 0.1789 0.2569 0.1974 0.2805 0.2048 0.1928 0.1838 0.1861 ∂ out u petal width 0.2017 0.1356 0.2198 0.1325 0.2020 0.1962 0.1750 0.1743 Alternative: Weight normalization (neuron weights are normalized to sum 1). Christian Borgelt Artificial Neural Networks and Deep Learning 205 Christian Borgelt Artificial Neural Networks and Deep Learning 206 Reminder: The Iris Data Application: Recognition of Handwritten Digits picture not available in online version pictures not available in online version • 9298 segmented and digitized digits of handwritten ZIP codes. • Collected by Edgar Anderson on the Gasp´ e Peninsula (Canada). • Appeared on US Mail envelopes passing through the post office in Buffalo, NY. • First analyzed by Ronald Aylmer Fisher (famous statistician). • Written by different people, great variety of sizes, writing styles and instruments, • 150 cases in total, 50 cases per Iris flower type. with widely varying levels of care. • Measurements of sepal length and width and petal length and width (in cm). • In addition: 3349 printed digits from 35 different fonts. • Most famous data set in pattern recognition and data analysis. Christian Borgelt Artificial Neural Networks and Deep Learning 207 Christian Borgelt Artificial Neural Networks and Deep Learning 208

  48. Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits picture not available in online version picture not available in online version • Acquisition, binarization, location of the ZIP code on the envelope and segmenta- • Segmented digits vary in size, but are typically around 40 to 60 pixels. tion into individual digits were performed by Postal Service contractors. • The size of the digits is normalized to 16 × 16 pixels • Further segmentation were performed by the authors of [LeCun et al. 1990]. with a simple linear transformation (preserve aspect ratio). • Segmentation of totally unconstrained digit strings is a difficult problem. • As a result of the transformation, the resulting image is not binary, but has multiple gray levels, which are scaled to fall into [ − 1 , +1]. • Several ambiguous characters are the result of missegmentation (especially broken 5s, see picture above). • Remainder of the recognition is performed entirely by an MLP. Christian Borgelt Artificial Neural Networks and Deep Learning 209 Christian Borgelt Artificial Neural Networks and Deep Learning 210 Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits • Challenge: all connections are adaptive, although heavily constrained (in contrast to earlier work which used manually chosen first layers). • Training: error backpropagation / gradient descent. picture not available in online version • Input: 16 × 16 pixel images of the normalized digits (256 input neurons). • Output: 10 neurons, that is, one neuron per class. If an input image belongs to class i (shows digit i ), • Objective: train an MLP to recognize the digits. the output neuron u i should produce a value of +1, all other output neurons should produce a value of − 1. • 7291 handwritten and 2549 printed digits in training set; 2007 handwritten and 700 printed digits (different fonts) in test set. • Problem: for a fully connected network with multiple hidden layers • Both training and test set contain multiple examples that are ambiguous, the number of parameters is excessive (risk of overfitting). unclassifiable, or even misclassified. • Solution: instead of a fully connected network, use a locally connected one. • Picture above: normalized digits from the test set. Christian Borgelt Artificial Neural Networks and Deep Learning 211 Christian Borgelt Artificial Neural Networks and Deep Learning 212

  49. Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits · · · picture not available in online version · · · · · · “receptive field” • Used network has four hidden layers plus input and output layer. • Local connections in all but the last layer: • Each neuron of the (first) hidden layer is connected to a small number of input neurons that refer to a contiguous region of the input image (left). ◦ Two-dimensional layout of the input and hidden layers (neuron grids). • Connection weights are shared, same network is evaluated at different locations. ◦ Connections to a neuron in the next layer from a group of neighboring neurons The input field is “moved” step by step over the whole image (right). in the preceding layer (small sub-grid, local receptive field). • Equivalent to a convolution with a small size kernel (“receptive field”). Christian Borgelt Artificial Neural Networks and Deep Learning 213 Christian Borgelt Artificial Neural Networks and Deep Learning 214 Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits picture not available in online version picture not available in online version • Idea: local connections can implement feature detectors. • Note: Weight sharing considerably reduces the number of free parameters. • In addition: analogous connections are forced to have identical weights • The outputs of a set of neurons with shared weights constitute a feature map . to allow for features appearing in different parts of the input (weight sharing). • In practice, multiple feature maps, extracting different features, are needed. • Equivalent to a convolution with a small size kernel, • The idea of local, convolutional feature maps can be applied to subsequent hidden followed by an application of the activation function. layers as well to obtain more complex and abstract features. Christian Borgelt Artificial Neural Networks and Deep Learning 215 Christian Borgelt Artificial Neural Networks and Deep Learning 216

  50. Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits • Input image is enlarged from 16 × 16 to 28 × 28 to avoid boundary problems. • In the first hidden layer, four feature maps are represented. All neurons in the same feature map share the same weight vectors. picture not available in online version • The second hidden layer averages over 2 × 2 blocks in the first hidden layer. • 12 feature maps in H3, averaging over 2 × 2 blocks in H4. • Feature maps in hidden layers H2 and H3 are connected as follows: H2/H3 1 2 3 4 5 6 7 8 9 10 11 12 • Higher level features require less precise coding of their location; 1 × × × × × therefore each additional layer performs local averaging and subsampling. 2 × × × × × × × × × × 3 • The loss of spatial resolution is partially compensated 4 × × × × × by an increase in the number of different features. • Layers H1 and H3 are shared-weight feature extractors, • The output layer is fully connected to the fourth hidden layer H4. Layers H2 and H4 are averaging/subsampling layers. • In total: 4635 neurons, 98442 connections, 2578 independent parameters. Christian Borgelt Artificial Neural Networks and Deep Learning 217 Christian Borgelt Artificial Neural Networks and Deep Learning 218 Application: Recognition of Handwritten Digits Demonstration Software: xmlp/wmlp • After 30 training epochs, the error rate on the training set was 1.1% (MSE: 0.017), and 3.4% on the test set (MSE: 0.024). Compared to human error: 2.5% • All classification errors occurred on handwritten digits. • Substitutions (misclassified) versus Rejections (unclassified): ◦ Reject an input if the difference in output between the two neurons with the highest output is less than a threshold. ◦ Best result on the test set: 5.7% rejections for 1% substitutions. ◦ On handwritten data alone: 9% rejections for 1% substitutions or 6% rejections for 2% substitutions. • Atypical input (see below) is also correctly classified. Demonstration of multi-layer perceptron training: • Visualization of the training process • Biimplication and Exclusive Or, two continuous functions picture not available in online version • http://www.borgelt.net/mlpd.html Christian Borgelt Artificial Neural Networks and Deep Learning 219 Christian Borgelt Artificial Neural Networks and Deep Learning 220

  51. Multi-Layer Perceptron Software: mlp/mlpgui Deep Learning (Multi-Layer Perceptrons with Many Layers) Software for training general multi-layer perceptrons: • Command line version written in C, fast training • Graphical user interface in Java, easy to use • http://www.borgelt.net/mlp.html http://www.borgelt.net/mlpgui.html Christian Borgelt Artificial Neural Networks and Deep Learning 221 Christian Borgelt Artificial Neural Networks and Deep Learning 222 Deep Learning: Motivation Deep Learning: Motivation • Reminder: Universal Approximation Theorem • A very simple and commonly used example is the n-bit parity function : R n Any continuous function on an arbitrary compact subspace of I Output is 1 if an even number of inputs is 1; output is 0 otherwise. can be approximated arbitrarily well with a three-layer perceptron. • Can easily be represented by a multi-layer perceptron with only one hidden layer. • This theorem is often cited as (allegedly!) meaning that (See reminder of TLU algorithm on next slide; adapt to MLPs.) ◦ we can confine ourselves to multi-layer perceptrons with only one hidden layer, • However, the solution has 2 n − 1 hidden neurons : disjunctive normal form is a disjunction of 2 n − 1 conjunctions, ◦ there is no real need to look at multi-layer perceptrons with more hidden layers. which represent the 2 n − 1 input combinations with an even number of set bits. • However: The theorem says nothing about the number of hidden neurons • Number of hidden neurons grows exponentially with the number of inputs. that may be needed to achieve a desired approximation accuracy. • However, if more hidden layers are admissible, linear growth is possible: • Depending on the function to approximate, a very large number of neurons may be necessary. ◦ Start with a biimplication of two inputs. • Allowing for more hidden layers may enable us ◦ Continue with a chain of exclusive or s, each of which adds another input. to achieve the same approximation quality ◦ Such a network needs n + 3( n − 1) = 4 n − 3 neurons in total with a significantly lower number of neurons. ( n input neurons, 3 ( n − 1 ) − 1 hidden neurons , 1 output neuron) Christian Borgelt Artificial Neural Networks and Deep Learning 223 Christian Borgelt Artificial Neural Networks and Deep Learning 224

  52. Reminder: Representing Arbitrary Boolean Functions Deep Learning: n -bit Parity Function Algorithm : Let y = f ( x 1 , . . . , x n ) be a Boolean function of n variables. x 1 biimplication x 2 xor (i) Represent the given function f ( x 1 , . . . , x n ) in disjunctive normal form. That is, determine where all C j are conjunctions of n literals, that is, C j = l j 1 ∧ . . . ∧ l jn x 3 with l ji = x i (positive literal) or l ji = ¬ x i (negative literal). (ii) Create a neuron for each conjunction C j of the disjunctive normal form xor (having n inputs — one input for each variable), where x n xor � − 1 n θ j = n − 1 + 1 � 2 , if l ji = x i , x n y w ji = and w ji . − 2 , if l ji = ¬ x i , 2 i =1 (iii) Create an output neuron (having m inputs — one input for each neuron • Implementation of the n -bit parity function with a chain of one biimplication that was created in step (ii)), where and n − 2 exclusive or sub-networks. w ( n +1) k = 2 , k = 1 , . . . , m, and θ n +1 = 1 . • Note that the structure is not strictly layered, but could be completed. If this is done, the number of neurons increases to n ( n + 1) − 1. Remark: weights are set to ± 2 instead of ± 1 in order to ensure integer thresholds. Christian Borgelt Artificial Neural Networks and Deep Learning 225 Christian Borgelt Artificial Neural Networks and Deep Learning 226 Deep Learning: Motivation Reminder: Limitations of Threshold Logic Units • Similar to the situation w.r.t. linearly separable functions: Total number and number of linearly separable Boolean functions (On-Line Encyclopedia of Integer Sequences, oeis.org , A001146 and A000609): ◦ Only few n -ary Boolean functions are linearly separable (see next slide). ◦ Only few n -ary Boolean functions need few hidden neurons in a single layer. inputs Boolean functions linearly separable functions • An n -ary Boolean function has k 0 input combinations with an output of 0 1 4 4 and k 1 input combinations with an output of 1 (clearly k 0 + k 1 = 2 n ). 2 16 14 • We need at most 2 min { k 0 ,k 1 } hidden neurons 3 256 104 4 65,536 1,882 if we choose disjunctive normal form for k 1 ≤ k 0 5 4,294,967,296 94,572 and conjunctive normal form for k 1 > k 0 . 6 18,446,744,073,709,551,616 15,028,134 � 2 n � 2 (2 n ) • As there are possible functions with k 0 input combinations mapped to 0 and k 0 n no general formula known k 1 mapped to 1, many functions require a substantial number of hidden neurons. • Although the number of neurons may be reduced with minimization methods • For many inputs a threshold logic unit can compute almost no functions. like the Quine–McCluskey algorithm [Quine 1952, 1955; McCluskey 1956], • Networks of threshold logic units are needed to overcome the limitations. the fundamental problem remains. Christian Borgelt Artificial Neural Networks and Deep Learning 227 Christian Borgelt Artificial Neural Networks and Deep Learning 228

  53. Deep Learning: Motivation Deep Learning: Main Problems • In practice the problem is mitigated considerably by the simple fact • Deep learning multi-layer perceptrons suffer from two main problems: that training data sets are necessarily limited in size . overfitting and vanishing gradient . • Complete training data for an n -ary Boolean function has 2 n training examples. • Overfitting results mainly from the increased number of adaptable parameters. Data sets for practical problems usually contain much fewer sample cases. • Weight decay prevents large weights and thus an overly precise adaptation. This leads to many input configurations for which no desired output is given; freedom to assign outputs to them allows for a simpler representation. • Sparsity constraints help to avoid overfitting: • Nevertheless, using more than one hidden layer promises in many cases ◦ there is only a restricted number of neurons in the hidden layers or to reduce the number of needed neurons. ◦ only few of the neurons in the hidden layers should be active (on average). • This is the focus of the area of deep learning . May be achieved by adding a regularization term to the error function ( Depth means the length of the longest path in the network graph.) (compares the observed number of active neurons with desired number and pushes adaptations into a direction that tries to match these numbers). • For multi-layer perceptrons (longest path: number of hidden layers plus one), deep learning starts with more than one hidden layer. • Furthermore, a training method called dropout training may be applied: some units are randomly omitted from the input/hidden layers during training. • For 10 or more hidden layers, one sometimes speaks of very deep learning . Christian Borgelt Artificial Neural Networks and Deep Learning 229 Christian Borgelt Artificial Neural Networks and Deep Learning 230 Deep Learning: Dropout Reminder: Cookbook Recipe for Error Backpropagation ∀ u ∈ U in : ∀ u ∈ U hidden ∪ U out : • Desired characteristic: forward � � �� − 1 � Robustness against neuron failures. out ( l ) u = ext ( l ) out ( l ) p ∈ pred( u ) w up out ( l ) propagation: − u = 1 + exp u p • Approach during training: x 1 y 1 • logistic ◦ Use only p % of the neurons activation (e.g. p = 50: half of the neurons). function x 2 y 2 ◦ Choose dropout neurons randomly. • implicit • Approach during execution: bias value ◦ Use all of the neurons. error factor: x n y m ◦ Multiply all weights by p %. ∀ u ∈ U hidden : ∀ u ∈ U out : backward �� � � � • Result of this approach: δ ( l ) s ∈ succ( u ) δ ( l ) λ ( l ) δ ( l ) o ( l ) u − out ( l ) λ ( l ) propagation: u = s w su u = u u u ◦ More robust representation. � � activation weight λ ( l ) u = out ( l ) 1 − out ( l ) ∆ w ( l ) up = η δ ( l ) u out ( l ) ◦ Better generalization. u u p derivative: change: Christian Borgelt Artificial Neural Networks and Deep Learning 231 Christian Borgelt Artificial Neural Networks and Deep Learning 232

  54. Deep Learning: Vanishing Gradient Deep Learning: Vanishing Gradient logistic activation function: derivative of logistic function: • In principle, a small gradient may be counteracted by a large weight. 1 f act (net ( l ) f ′ act (net ( l ) u ) = f act (net ( l ) u ) · (1 − f act (net ( l ) u ) = u )) � � � 1 + e − net ( l ) δ ( l ) δ ( l ) λ ( l ) u u = s w su u . s ∈ succ( u ) 1 1 • However, usually a large weight, since it also enters the activation function, 1 1 drives the activation function to its saturation regions . 2 2 1 • Thus, (the absolute value of) the derivative factor is usually the smaller, 4 net net 0 0 the larger (the absolute value of) the weights. − 4 − 2 0 +2 +4 − 4 − 2 0 +2 +4 • Furthermore, the connection weights are commonly initialized • If a logistic activation function is used (shown on left), the weight changes to a random value in the range from − 1 to +1. � � are proportional to λ ( l ) u = out ( l ) 1 − out ( l ) (shown on right; see recipe). u u ⇒ Initial training steps are particularly affected: both the gradient as well as the weights provide a factor less than 1. • This factor is also propagated back, but cannot be larger than 1 4 (see right). • Theoretically, there can be exceptions to this description. ⇒ The gradient tends to vanish if many layers are backpropagated through. However, they are rare in practice and one usually observes a vanishing gradient. Learning in the early hidden layers can become very slow [Hochreiter 1991]. Christian Borgelt Artificial Neural Networks and Deep Learning 233 Christian Borgelt Artificial Neural Networks and Deep Learning 234 Deep Learning: Vanishing Gradient Deep Learning: Different Activation Functions rectified maximum/ramp function: softplus function: Alternative way to understand the vanishing gradient effect: Note the scale! f act (net , θ ) = ln(1 + e net − θ ) f act (net , θ ) = max { 0 , net − θ } • The logistic activation function is a contracting function : for any two arguments x and y , x � = y , we have | f act ( x ) − f act ( y ) | < | x − y | . 1 4 (Obvious from the fact that its derivative is always < 1; actually ≤ 1 4 .) 1 2 • If several logistic functions are chained, these contractions combine 2 and yield an even stronger contraction of the input range. net net 0 0 • As a consequence, a rather large change of the input values θ − 1 θ θ + 1 θ − 4 θ − 2 θ θ + 2 θ + 4 will produce only a rather small change in the output values, and the more so, the more logistic functions are chained together. • The vanishing gradient problem may be battled with other activation functions. • Therefore the function that maps the inputs of a multi-layer perceptron to its • These activation functions yield so-called rectified linear units (ReLUs) . outputs usually becomes the flatter the more layers the multi-layer perceptron • Rectified maximum/ramp function has. Advantages: simple computation, simple derivative, zeros simplify learning • Consequently the gradient in the first hidden layer Disadvantages: no learning ≤ θ , rather inelegant, not continuously differentiable (were the inputs are processed) becomes the smaller. Christian Borgelt Artificial Neural Networks and Deep Learning 235 Christian Borgelt Artificial Neural Networks and Deep Learning 236

  55. Reminder: Training a TLU for the Negation Deep Learning: Different Activation Functions rectified maximum/ramp function: softplus function: Note the scale! Single input x y w f act (net , θ ) = ln(1 + e net − θ ) f act (net , θ ) = max { 0 , net − θ } y threshold logic unit x 0 1 θ for the negation ¬ x . 1 0 1 4 Modified output error as a function of weight and threshold: 1 2 2 4 4 4 net net e e e 3 3 3 0 0 4 4 4 θ − 1 θ + 1 θ − 4 θ − 2 θ + 2 θ + 4 2 2 2 θ θ 3 3 3 1 1 1 2 2 2 2 2 2 • Softplus function : continuously differentiable, but much more complex 1 1 1 1 1 1 w 0 w 0 w 0 2 2 2 –1 –1 –1 1 1 1 0 0 0 • Alternatives: –2 – 1 –2 – 1 –2 – 1 θ θ θ 2 2 2 – – – � x if x > 0, error for x = 0 error for x = 1 sum of errors ◦ Leaky ReLU: f ( x ) = (adds learning ≤ θ ; ν ≈ 0 . 01) νx otherwise. ◦ Noisy ReLU: f ( x ) = max { 0 , x + N (0 , σ ( x )) } (adds Gaussian noise) A rectified maximum/ramp function yields these errors directly. Christian Borgelt Artificial Neural Networks and Deep Learning 237 Christian Borgelt Artificial Neural Networks and Deep Learning 238 Deep Learning: Different Activation Functions Deep Learning: Auto-Encoders derivative of ramp function: derivative of softplus function: • Reminder: Networks of threshold logic units cannot be trained, because � 1 , if net ≥ θ, 1 f act (net , θ ) = f act (net , θ ) = ◦ there are no desired values for the neurons of the first layer(s), 0 , otherwise. 1 + e − (net − θ ) ◦ the problem can usually be solved with several different functions 1 1 computed by the neurons of the first layer(s) (non-unique solution). 1 1 • Although multi-layer perceptrons, even with many hidden layers, can be trained, 2 2 the same problems still affect the effectiveness and efficiency of training. net net 0 0 • Alternative: Build network layer by layer, θ θ − 4 θ − 2 θ θ + 2 θ + 4 train only newly added layer in each step. Popular: Build the network as stacked auto-encoders . • The derivative of the rectified maximum/ramp function is the step function . • The derivative of the softplus function is the logistic function . • An auto-encoder is a 3-layer perceptron that maps its inputs to approximations of these inputs. • Factor resulting from derivative of activation function can be much larger. ◦ The hidden layer forms an encoder into some form of internal representation. ⇒ Larger gradient in early layers of the network; training is (much) faster. ◦ The output layer forms a decoder that (approximately) reconstructs the inputs. • What also helps: faster hardware , implementations on GPUs etc. Christian Borgelt Artificial Neural Networks and Deep Learning 239 Christian Borgelt Artificial Neural Networks and Deep Learning 240

  56. Deep Learning: Auto-Encoders Deep Learning: Auto-Encoders x 1 x 1 • Rationale of training an auto-encoder: ˆ x 1 hidden layer is expected to construct features . f 1 • Hope: Features capture the information contained in the input in a compressed x 2 x 2 x 2 ˆ form (encoder), so that the input can be well reconstructed from it (decoder). • Note: this implicitly assumes that features that are well suited to represent the inputs in a compressed way are also useful to predict some desired output. f m Experience shows that this assumption is often justified. x n x n x n ˆ • Main problems: ◦ how many units should be chosen for the hidden layer? encoder decoder encoder ◦ how should this layer be treated during training? • If there are as many (or even more) hidden units as there are inputs, An auto-encoder/decoder (left), of which only the encoder part (right) is later used. it is likely that it will merely pass through its inputs to the output layer. The x i are the given inputs, the ˆ x i are the reconstructed inputs, and � n x i − x i ) 2 . • The most straightforward solution are sparse auto-encoders : the f i are the constructed features; the error is e = i =1 (ˆ There should be (considerably) fewer hidden neurons than there are inputs. Training is conducted with error backpropagation. Christian Borgelt Artificial Neural Networks and Deep Learning 241 Christian Borgelt Artificial Neural Networks and Deep Learning 242 Deep Learning: Sparse and Denoising Auto-Encoders Deep Learning: Stacked Auto-Encoders • Few hidden neurons force the auto-encoder to learn relevant features • A popular way to initialize/pre-train a stacked auto-encoder (since it is not possible to simply pass through the inputs). is a greedy layer-wise training approach . • How many hidden neurons are a good choice? • In the first step, a (sparse) auto-encoder is trained for the raw input features. Note that cross-validation not necessarily a good approach! The hidden layer constructs primary features useful for reconstruction. • Alternative to few hidden neurons: sparse activation scheme • A primary feature data set is obtained by propagating the raw input features up The number of active neurons in the hidden layer is restricted to a small number. to the hidden layer and recording the hidden neuron activations. May be enforced by either adding a regularization term to the error function • In the second step, a (sparse) auto-encoder is trained for the obtained feature data that punishes a larger number of active hidden neurons or by set. The hidden layer constructs secondary features useful for reconstruction. explicitly deactivating all but a few neurons with the highest activations. • A secondary feature data set is obtained by propagating the primary features up • A third approach is to add noise (that is, random variations) to the input to the hidden layer and recording the hidden neuron activations. (but not to the copy of the input used to evaluate the reconstruction error): denoising auto-encoders . • This process is repeated as many times as hidden layers are desired. (The auto-encoder to be trained is expected to map the input with noise • Finally the encoder parts of the trained auto-encoders are stacked to (a copy of) the input without noise, thus preventing simple passing through.) and the resulting network is fine-tuned with error backpropagation . Christian Borgelt Artificial Neural Networks and Deep Learning 243 Christian Borgelt Artificial Neural Networks and Deep Learning 244

  57. Deep Learning: Stacked Auto-Encoders Deep Learning: Convolutional Neural Networks (CNNs) Layer-wise training of auto-encoders: x 1 x 1 ˆ • Multi-layer perceptrons with several hidden layers built in the way described have been applied very successfully for handwritten digit recognition. x 2 a 1 a 1 x 2 ˆ ˆ a 1 y 1 b 1 b 1 • In such an application it is assumed, though, that the handwriting x 3 a 2 a 2 x 3 ˆ ˆ a 2 y 2 b 2 b 2 has already been preprocessed in order to separate the digits x 4 a 3 a 3 x 4 ˆ ˆ a 3 (or, in other cases, the letters) from each other. y 3 b 3 b 3 x 5 a 4 a 4 x 5 ˆ a 4 ˆ features 2 output features 2 • However, one would like to use similar networks also for more general applications, features 1 reconst. x 6 x 6 ˆ for example, recognizing whole lines of handwriting or analyzing photos features 1 input reconst. to identify their parts as sky, landscape, house, pavement, tree, human being etc. x 1 • For such applications it is advantageous that the features constructed x 2 a 1 1. Train auto-encoder for the raw input; y 1 b 1 in hidden layers are not localized to a specific part of the image. this yields a primary feature set. x 3 a 2 y 2 b 2 • Special form of deep learning multi-layer perceptron: 2. Train auto-encoder for the primary features; x 4 a 3 convolutional neural network . this yields a secondary feature set. y 3 b 3 x 5 a 4 3. A classifier/predictor for the output features 2 output • Inspired by the human retina, where sensory neurons have a receptive field , is trained from the secondary feature set. features 1 x 6 that is, a limited region in which they respond to a (visual) stimulus. 4. The resulting networks are stacked. input Christian Borgelt Artificial Neural Networks and Deep Learning 245 Christian Borgelt Artificial Neural Networks and Deep Learning 246 Reminder: Receptive Field, Convolution Reminder: Recognition of Handwritten Digits · · · picture not available in online version · · · · · · “receptive field” • Idea: local connections can implement feature detectors. • In addition: analogous connections are forced to have identical weights • Each neuron of the (first) hidden layer is connected to a small number of input to allow for features appearing in different parts of the input (weight sharing). neurons that refer to a contiguous region of the input image (left). • Connection weights are shared, same network is evaluated at different locations. • Equivalent to a convolution with a small size kernel, The input field is “moved” step by step over the whole image (right). followed by an application of the activation function. • Equivalent to a convolution with a small size kernel. Christian Borgelt Artificial Neural Networks and Deep Learning 247 Christian Borgelt Artificial Neural Networks and Deep Learning 248

  58. Reminder: Recognition of Handwritten Digits ImageNet Large Scale Visual Recognition Challenge (ILSVRC) • ImageNet is an image database organized according to WordNet nouns. picture not available in online version • WordNet is a lexical database [of English]; nouns, verbs, adjectives and adverbs are grouped into sets picture not available in online version of cognitive synonyms (synsets) each expressing a distinct con- • Network with four hidden layers implementing cept; contains > 100 , 000 synsets, two convolution and two averaging stages. of which > 80 , 000 are nouns. • Developed in 1990, so training took several days, despite its simplicity. • The ImageNet database contains • Nowadays networks working on much larger images and hundreds and thousands of im- having many more layers are possible. ages for each noun/synset. Christian Borgelt Artificial Neural Networks and Deep Learning 249 Christian Borgelt Artificial Neural Networks and Deep Learning 250 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ImageNet Large Scale Visual Recognition Challenge (ILSVRC) • Uses a subset of the ImageNet database (1.2 million images, 1,000 categories) • Evaluates algorithms for object detection and image classification . • Yearly challenges/competitions with corresponding workshops since 2010. classification error (top5) 30 28.2% of ILSVRC winners picture not available in online version 25.8% • Hardly possible 25 10 years ago; rapid improvement 20 15.3% • Often used: 15 11.7% network ensembles 10 7.3% 6.7% • Very deep learning: • Structure of the AlexNet deep neural network (winner 2012) [Krizhevsky 2012]. 5 3.6% ResNet (2015) 3% 2.25% had > 150 layers • 60 million parameters, 650,000 neurons, efficient GPU implementation. 0 2010 2011 2012 2013 2014 2014 2015 2016 2017 AlexNet VGG GoogLeNet ResNet Christian Borgelt Artificial Neural Networks and Deep Learning 251 Christian Borgelt Artificial Neural Networks and Deep Learning 252

  59. German Traffic Sign Recognition Benchmark (GTSRB) Inceptionism: Visualization of Training Results • One way to visualize what goes pictures not available in online version on in a neural network is to “turn the network upside down” and ask it to enhance an input • Competition at Int. Joint Conference on Neural Networks (IJCNN) 2011 image in such a way as to elicit [Stallkamp et al. 2012] a particular interpretation. benchmark.ini.rub.de/?section=gtsrb • Single image, multi-class classification problem • Impose a prior constraint that pictures not available (that is, each image can belong to multiple classes). the image should have similar in online version statistics to natural images, such • More than 40 classes, more than 50,000 images. as neighboring pixels needing to be correlated. • Physical traffic sign instances are unique within the data set (that is, each real-world traffic sign occurs only once). • Result: NNs that are trained to discriminate between different • Winning entry (committee of CNNs) outperforms humans! error rate winner (convolutional neural networks): 0.54% kinds of images contain informa- runner up (not a neural network): 1.12% tion needed to generate images. human recognition: 1.16% Christian Borgelt Artificial Neural Networks and Deep Learning 253 Christian Borgelt Artificial Neural Networks and Deep Learning 254 Deep Learning: Board Game Go Deep Learning: AlphaGo • board with 19 × 19 grid lines, • AlphaGo is a computer program stones are placed on line crossings developed by Alphabet Inc.’s Google • objective: surround territory/crossings DeepMind to play the board game Go. (and enemy stones: capture) • special “ko” rules for certain situations • It uses a combination of machine learning and tree search techniques, to prevent infinite retaliation picture not available combined with extensive training, both from human and computer play. • a player can pass his/her turn in online version • AlphaGo uses Monte Carlo tree search, (usually disadvantageous) guided by a “value network” and a “policy network”, • game ends after both players both of which are implemented using deep neural networks . subsequently pass a turn • winner is determined by counting • A limited amount of game-specific feature detection is applied to the input (controlled area plus captured stones) before it is sent to the neural networks. • The neural networks were bootstrapped from human gameplay experience. b d Board game b d b number of moves per ply/turn Later AlphaGo was set up to play against other instances of itself, ≈ 10 123 Chess ≈ 35 ≈ 80 d length of game in plies/turns using reinforcement learning to improve its play. ≈ 10 359 b d ≈ 250 ≈ 150 Go number of possible sequences source: Wikipedia Christian Borgelt Artificial Neural Networks and Deep Learning 255 Christian Borgelt Artificial Neural Networks and Deep Learning 256

  60. Deep Learning: AlphaGo Deep Learning: AlphaGo vs Lee Sedol, Game 1 First game of the match between AlphaGo and Lee Sedol Match against Fan Hui AlphaGo threads CPUs GPUs Elo Lee Sedol: black pieces (first move), AlphaGo: white pieces; AlphaGo won (Elo 3016 on 01-01-2016, Async. 1 48 8 2203 #512 world ranking list), Async. 2 48 8 2393 best European player Async. 4 48 8 2564 at the time of the match Async. 8 48 8 2665 AlphaGo wins 5 : 0 Async. 16 48 8 2778 Async. 32 48 8 2867 Async. 40 48 1 2181 Match against Lee Sedol Async. 40 48 2 2738 (Elo 3546 on 01-01-2016, Async. 40 48 4 2850 #1 world ranking list 2007–2011, picture not available in online version Async. 40 48 8 2890 #4 at the time of the match) Distrib. 12 428 64 2937 AlphaGo wins 4 : 1 Distrib. 24 764 112 3079 Distrib. 40 1202 176 3140 Distrib. 64 1920 280 3168 www.goratings.org Christian Borgelt Artificial Neural Networks and Deep Learning 257 Christian Borgelt Artificial Neural Networks and Deep Learning 258 Radial Basis Function Networks A radial basis function network (RBFN) is a neural network with a graph G = ( U, C ) that satisfies the following conditions (i) U in ∩ U out = ∅ , C ′ ⊆ ( U hidden × U out ) (ii) C = ( U in × U hidden ) ∪ C ′ , The network input function of each hidden neuron is a distance function Radial Basis Function Networks of the input vector and the weight vector, that is, f ( u ) w u , � w u , � ∀ u ∈ U hidden : net ( � in u ) = d ( � in u ) , R n × I R n → I R n : R + where d : I 0 is a function satisfying ∀ � x, � y,� z ∈ I ( i ) d ( � x, � y ) = 0 ⇔ � x = � y, ( ii ) d ( � x, � y ) = d ( � y, � x ) (symmetry) , ( iii ) d ( � x, � z ) ≤ d ( � x, � y ) + d ( � y,� z ) (triangle inequality) . Christian Borgelt Artificial Neural Networks and Deep Learning 259 Christian Borgelt Artificial Neural Networks and Deep Learning 260

  61. Distance Functions Radial Basis Function Networks Illustration of distance functions: Minkowski Family The network input function of the output neurons is the weighted sum of their inputs: �   1 � n n k � � � � f ( u ) w u , � w ⊤ u � | x i − y i | k ∀ u ∈ U out : net ( � in u ) = � in u = w uv out v . � k | x i − y i | k   d k ( � x, � y ) = = v ∈ pred ( u ) i =1 i =1 Well-known special cases from this family are: The activation function of each hidden neuron is a so-called radial function , that is, a monotonically decreasing function k = 1 : Manhattan or city block distance, k = 2 : Euclidean distance (the only isotropic distance), R + f : I 0 → [0 , 1] with f (0) = 1 and x →∞ f ( x ) = 0 . lim y ) = max n k → ∞ : maximum distance, that is, d ∞ ( � x, � i =1 | x i − y i | . The activation function of each output neuron is a linear function, namely k = 1 k = 2 k → ∞ f ( u ) act (net u , θ u ) = net u − θ u . (The linear activation function is important for the initialization.) Christian Borgelt Artificial Neural Networks and Deep Learning 261 Christian Borgelt Artificial Neural Networks and Deep Learning 262 Radial Activation Functions Radial Basis Function Networks: Examples rectangle function: triangle function: Radial basis function networks for the conjunction x 1 ∧ x 2 � 0 , if net > σ, � 0 , if net > σ, f act (net , σ ) = f act (net , σ ) = 1 − net 1 , otherwise. σ , otherwise. x 1 1 1 1 1 x 2 1 1 y 0 2 0 net net 1 0 0 x 2 0 σ 0 σ x 1 0 1 cosine until zero: Gaussian function: � 0 , if net > 2 σ , f act (net , σ ) = e − net2 f act (net , σ ) = 2 σ 2 cos ( 2 σ net ) +1 π , otherwise. 2 x 1 1 0 1 1 x 2 − 1 6 y − 1 e − 1 5 2 1 0 2 0 x 2 e − 2 net net 0 0 x 1 0 1 0 2 σ 0 2 σ σ σ Christian Borgelt Artificial Neural Networks and Deep Learning 263 Christian Borgelt Artificial Neural Networks and Deep Learning 264

  62. Radial Basis Function Networks: Examples Radial Basis Function Networks: Function Approximation y y Radial basis function networks for the biimplication x 1 ↔ x 2 y 4 y 4 y 3 y 3 Idea: logical decomposition y 2 y 2 x 1 ↔ x 2 ≡ ( x 1 ∧ x 2 ) ∨ ¬ ( x 1 ∨ x 2 ) y 1 y 1 x x x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 x 1 1 1 1 2 1 x 2 1 y 0 0 0 1 · y 4 Approximation of a function by 1 0 0 x 2 1 2 rectangular pulses, each of which 1 · y 3 x 1 0 1 0 can be represented by a neuron of 1 · y 2 an radial basis function network. 0 1 · y 1 0 Christian Borgelt Artificial Neural Networks and Deep Learning 265 Christian Borgelt Artificial Neural Networks and Deep Learning 266 Radial Basis Function Networks: Function Approximation Radial Basis Function Networks: Function Approximation y y σ y 4 y 4 y 3 y 3 x 1 y 1 y 2 y 2 σ y 2 x 2 y 1 y 1 y x x x 0 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 y 3 x 3 σ x 4 y 4 1 · y 4 Approximation of a function by 0 triangular pulses, each of which 1 · y 3 σ = 1 2 ∆ x = 1 2 ( x i +1 − x i ) σ 0 can be represented by a neuron of 1 · y 2 an radial basis function network. 0 1 · y 1 0 A radial basis function network that computes the step function on the preceding slide and the piecewise linear function on the next slide (depends on activation function). Christian Borgelt Artificial Neural Networks and Deep Learning 267 Christian Borgelt Artificial Neural Networks and Deep Learning 268

  63. Radial Basis Function Networks: Function Approximation Radial Basis Function Networks: Function Approximation Radial basis function network for a sum of three Gaussian functions y y 2 2 1 1 2 1 1 x x 5 3 y x 0 0 1 0 2 4 6 8 2 4 6 8 − 1 − 1 6 − 2 1 1 · w 1 0 Approximation of a function by Gaus- 1 · w 2 sian functions with radius σ = 1. It is • The weights of the connections from the input neuron to the hidden neurons 0 w 1 = 1, w 2 = 3 and w 3 = − 2. determine the locations of the Gaussian functions. 1 · w 3 0 • The weights of the connections from the hidden neurons to the output neuron determine the height/direction (upward or downward) of the Gaussian functions. Christian Borgelt Artificial Neural Networks and Deep Learning 269 Christian Borgelt Artificial Neural Networks and Deep Learning 270 Radial Basis Function Networks: Initialization Let L fixed = { l 1 , . . . , l m } be a fixed learning task, ı ( l ) ,� o ( l ) ). consisting of m training patterns l = ( � Simple radial basis function network : One hidden neuron v k , k = 1 , . . . , m , for each training pattern: ı ( l k ) . ∀ k ∈ { 1 , . . . , m } : w v k = � � Training Radial Basis Function Networks If the activation function is the Gaussian function, the radii σ k are chosen heuristically σ k = d max √ ∀ k ∈ { 1 , . . . , m } : 2 m, where � ı ( l k ) � ı ( l j ) ,� d max = max d � . l j ,l k ∈ L fixed Christian Borgelt Artificial Neural Networks and Deep Learning 271 Christian Borgelt Artificial Neural Networks and Deep Learning 272

  64. Radial Basis Function Networks: Initialization RBFN Initialization: Example Initializing the connections from the hidden to the output neurons Simple radial basis function network for the biimplication x 1 ↔ x 2 m � w uv m out ( l ) v m − θ u = o ( l ) 1 ∀ u : or abbreviated A · � w u = � o u , u 2 k =1 0 0 w 1 o u = ( o ( l 1 ) u , . . . , o ( l m ) ) ⊤ is the vector of desired outputs, θ u = 0, and where � x 1 x 2 y u 1 x 1 1 w 2 2   0 0 1 0 out ( l 1 ) out ( l 1 ) out ( l 1 ) . . . 1 0 0 y  v 1 v 2 v m  0   out ( l 2 ) out ( l 2 ) out ( l 2 )   0 1 0 . . . 0  v 1 v 2 v m  A = . w 3  . . .  x 2 1 . . . 1 1 1 . . .   1 2   out ( l m ) out ( l m ) . . . out ( l m ) w 4 v 1 v 2 v m 1 1 This is a linear equation system, that can be solved by inverting the matrix A : 1 2 w u = A − 1 · � � o u . Christian Borgelt Artificial Neural Networks and Deep Learning 273 Christian Borgelt Artificial Neural Networks and Deep Learning 274 RBFN Initialization: Example RBFN Initialization: Example Simple radial basis function network for the biimplication x 1 ↔ x 2 Simple radial basis function network for the biimplication x 1 ↔ x 2     a b b c e − 2 e − 2 e − 4 1 D D D D       e − 2 1 e − 4 e − 2 b a c b     1 1 1 A − 1 = D D D D     A = act act y e − 2 e − 4 1  e − 2    b c a b     1 1 1  D D D D  e − 4 e − 2 e − 2 1 2 2 2 c b b a D D D D 1 1 1 where x 2 x 2 x 2 0 2 0 2 0 2 D = 1 − 4 e − 4 + 6 e − 8 − 4 e − 12 + e − 16 ≈ 1 1 1 0 . 9287 0 0 0 –1 x 1 –1 x 1 –1 x 1 1 1 1 – – – − 2 e − 4 + e − 8 a = 1 ≈ 0 . 9637 single basis function all basis functions output b = − e − 2 + 2 e − 6 − e − 10 ≈ − 0 . 1304 e − 4 − 2 e − 8 + e − 12 c = ≈ 0 . 0177     a + c 1 . 0567 • Initialization leads already to a perfect solution of the learning task.     1  2 b   − 0 . 2809  w u = A − 1 · �     � o u =  ≈ • Subsequent training is not necessary.     − 0 . 2809 D 2 b    a + c 1 . 0567 Christian Borgelt Artificial Neural Networks and Deep Learning 275 Christian Borgelt Artificial Neural Networks and Deep Learning 276

  65. Radial Basis Function Networks: Initialization RBFN Initialization: Example Normal radial basis function networks: Normal radial basis function network for the biimplication x 1 ↔ x 2 Select subset of k training patterns as centers. Select two training patterns:   out ( l 1 ) out ( l 1 ) out ( l 1 ) 1 . . . v 1 v 2 v k ı ( l 1 ) ,� o ( l 1 ) ) = ((0 , 0) , (1))   • l 1 = ( �   out ( l 2 ) out ( l 2 ) out ( l 2 )   1 . . .  v 1 v 2 v k  A = A · � w u = � o u ı ( l 4 ) ,� o ( l 4 ) ) = ((1 , 1) , (1))   • l 4 = ( � . . . . . . . . . . . .     1 out ( l m ) out ( l m ) . . . out ( l m ) v 1 v 2 v k x 1 1 Compute (Moore–Penrose) pseudo inverse: w 1 1 2 1 A + = ( A ⊤ A ) − 1 A ⊤ . y θ 0 w 2 The weights can then be computed by 0 x 2 1 2 w u = A + · � o u = ( A ⊤ A ) − 1 A ⊤ · � � o u Christian Borgelt Artificial Neural Networks and Deep Learning 277 Christian Borgelt Artificial Neural Networks and Deep Learning 278 RBFN Initialization: Example RBFN Initialization: Example Normal radial basis function network for the biimplication x 1 ↔ x 2 Normal radial basis function network for the biimplication x 1 ↔ x 2   e − 4 1 1   a b b a   (1 , 0) 1 e − 2 e − 2   1 A + = ( A ⊤ A ) − 1 A ⊤ =     c d d e A =   y 1 e − 2 e − 2   1 1   act act 1 e d d c 1 e − 4 1 1 1 0 6 3 2 2 0 . – 0 where 1 1 1 x 2 x 2 x 2 2 2 2 0 0 0 a ≈ − 0 . 1810 , b ≈ 1 1 1 0 . 6810 , 0 0 0 –1 x 1 –1 x 1 –1 x 1 1 1 1 – – – c ≈ d ≈ − 0 . 6688 , e ≈ 0 . 1594 . 1 . 1781 , basis function (0,0) basis function (1,1) output Resulting weights:     • Initialization leads already to a perfect solution of the learning task. − θ − 0 . 3620  = A + · �     w u = � w 1 o u ≈ 1 . 3375  .   • This is an accident, because the linear equation system is not over-determined, w 2 1 . 3375 due to linearly dependent equations. Christian Borgelt Artificial Neural Networks and Deep Learning 279 Christian Borgelt Artificial Neural Networks and Deep Learning 280

  66. Radial Basis Function Networks: Initialization RBFN Initialization: c -means Clustering How to choose the radial basis function centers? • Choose a number c of clusters to be found (user input). • Use all data points as centers for the radial basis functions. • Initialize the cluster centers randomly (for instance, by randomly selecting c data points). ◦ Advantages: Only radius and output weights need to be determined; desired output values can be achieved exactly (unless there are inconsistencies). • Data point assignment : ◦ Disadvantage: Often much too many radial basis functions; computing the Assign each data point to the cluster center that is closest to it (that is, closer than any other cluster center). weights to the output neuron via a pseudo-inverse can become infeasible. • Use a random subset of data points as centers for the radial basis functions. • Cluster center update : Compute new cluster centers as the mean vectors of the assigned data points. ◦ Advantages: Fast; only radius and output weights need to be determined. (Intuitively: center of gravity if each data point has unit weight.) ◦ Disadvantages: Performance depends heavily on the choice of data points. • Repeat these two steps (data point assignment and cluster center update) • Use the result of clustering as centers for the radial basis functions, e.g. until the clusters centers do not change anymore. ◦ c -means clustering (on the next slides) It can be shown that this scheme must converge, ◦ Learning vector quantization (to be discussed later) that is, the update of the cluster centers cannot go on forever. Christian Borgelt Artificial Neural Networks and Deep Learning 281 Christian Borgelt Artificial Neural Networks and Deep Learning 282 c -Means Clustering: Example Delaunay Triangulations and Voronoi Diagrams • Dots represent cluster centers. • Left: Delaunay Triangulation The circle through the corners of a triangle does not contain another point. Data set to cluster. Initial position of cluster centers. Choose c = 3 clusters. Randomly selected data points. • Right: Voronoi Diagram / Tesselation (From visual inspection, can be (Alternative methods include Midperpendiculars of the Delaunay triangulation: boundaries of the regions difficult to determine in general.) e.g. latin hypercube sampling) of points that are closest to the enclosed cluster center (Voronoi cells). Christian Borgelt Artificial Neural Networks and Deep Learning 283 Christian Borgelt Artificial Neural Networks and Deep Learning 284

  67. Delaunay Triangulations and Voronoi Diagrams c -Means Clustering: Example • Delaunay Triangulation: simple triangle (shown in gray on the left) • Voronoi Diagram: midperpendiculars of the triangle’s edges (shown in blue on the left, in gray on the right) Christian Borgelt Artificial Neural Networks and Deep Learning 285 Christian Borgelt Artificial Neural Networks and Deep Learning 286 Radial Basis Function Networks: Training Radial Basis Function Networks: Training Training radial basis function networks : Training radial basis function networks : Derivation of update rules is analogous to that of multi-layer perceptrons. Center coordinates (weights from the input to the hidden neurons). Weights from the hidden to the output neurons. Gradient: ∂ out ( l ) ∂ net ( l ) w v e ( l ) = ∂e ( l ) � ( o ( l ) s − out ( l ) v v Gradient: � ∇ � = − 2 s ) w su ∂ net ( l ) ∂ � w v u = ∂e ( l ) ∂ � w v s ∈ succ( v ) v w u e ( l ) u = − 2( o ( l ) u − out ( l ) in ( l ) � u ) � ∇ � u , ∂ � w u Weight update rule: Weight update rule: ∂ out ( l ) ∂ net ( l ) v = − η 1 � w ( l ) w v e ( l ) = η 1 ( o ( l ) s − out ( l ) v v � ∆ � ∇ � s ) w sv u = − η 3 w ( l ) w u e ( l ) u = η 3 ( o ( l ) u − out ( l ) in ( l ) � u ) � ∂ net ( l ) 2 ∇ � ∆ � ∂ � w v u s ∈ succ( v ) v 2 Typical learning rate: η 3 ≈ 0 . 001. Typical learning rate: η 1 ≈ 0 . 02. (Two more learning rates are needed for the center coordinates and the radii.) Christian Borgelt Artificial Neural Networks and Deep Learning 287 Christian Borgelt Artificial Neural Networks and Deep Learning 288

  68. Radial Basis Function Networks: Training Radial Basis Function Networks: Training Training radial basis function networks : Training radial basis function networks : Center coordinates (weights from the input to the hidden neurons). Radii of radial basis functions. Gradient: Special case: Euclidean distance ∂ out ( l ) ∂e ( l ) � ( o ( l ) s − out ( l ) v = − 2 s ) w su .   − 1 ∂σ v ∂σ v ∂ net ( l ) n 2 s ∈ succ( v ) � v ( w vp i − out ( l ) in ( l ) p i ) 2 w v − �   = ( � v ) . ∂ � w v Weight update rule: i =1 ∂ out ( l ) ∂e ( l ) � v = − η 2 Special case: Gaussian activation function ∆ σ ( l ) ( o ( l ) s − out ( l ) v = η 2 s ) w sv . 2 ∂σ v ∂σ v � � 2 � � 2 s ∈ succ( v ) net ( l ) net ( l ) v v ∂ out ( l ) = ∂f act ( net ( l ) = − net ( l ) − − v , σ v ) ∂ v 2 σ 2 v 2 σ 2 Typical learning rate: η 2 ≈ 0 . 01. = e e . v v σ 2 ∂ net ( l ) ∂ net ( l ) ∂ net ( l ) v v v v Christian Borgelt Artificial Neural Networks and Deep Learning 289 Christian Borgelt Artificial Neural Networks and Deep Learning 290 Radial Basis Function Networks: Training Radial Basis Function Networks: Generalization Training radial basis function networks : Generalization of the distance function Radii of radial basis functions. Idea: Use anisotropic (direction dependent) distance function. Special case: Gaussian activation function Example: Mahalanobis distance � � 2 � � 2 � � 2 net ( l ) net ( l ) � net ( l ) v v ∂ out ( l ) y ) ⊤ Σ − 1 ( � d ( � x, � y ) = ( � x − � x − � y ) . − v − ∂ v 2 σ 2 2 σ 2 = e = e . v v σ 3 ∂σ v ∂σ v v Example: biimplication (The distance function is irrelevant for the radius update, x 1 1 since it only enters the network input function.) 2 1 1 x 2 1 y 0 3 0 � 9 � 1 x 2 8 2 Σ = 8 9 x 1 0 1 Christian Borgelt Artificial Neural Networks and Deep Learning 291 Christian Borgelt Artificial Neural Networks and Deep Learning 292

  69. Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits • Comparison of various classifiers: ◦ Nearest Neighbor (1NN) ◦ Learning Vector Quantization (LVQ) ◦ Decision Tree (C4.5) ◦ Radial Basis Function Network (RBF) ◦ Multi-Layer Perceptron (MLP) ◦ Support Vector Machine (SVM) picture not available in online version • Distinction of the number of RBF training phases: ◦ 1 phase: find output connection weights e.g. with pseudo-inverse. ◦ 2 phase: find RBF centers e.g. with clustering plus 1 phase. ◦ 3 phase: 2 phase plus error backpropagation training. • Initialization of radial basis function centers: • Images of 20,000 handwritten digits (2,000 per class), ◦ Random choice of data points split into training and test data set of 10,000 samples each (1,000 per class). ◦ c -means Clustering • Represented in a normalized fashion as 16 × 16 gray values in { 0 , . . . , 255 } . ◦ Learning Vector Quantization • Data was originally used in the StatLog project [Michie et al. 1994]. ◦ Decision Tree (one RBF center per leaf) Christian Borgelt Artificial Neural Networks and Deep Learning 293 Christian Borgelt Artificial Neural Networks and Deep Learning 294 Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits picture not available in online version picture not available in online version • The 60 cluster centers (6 per class) resulting from c -means clustering. • The 60 cluster centers (6 per class) after training the radial basis function network (Clustering was conducted with c = 6 for each class separately.) with error backpropagation. • Initial cluster centers were selected randomly from the training data. • Differences between the initial and the trained centers of the radial basis functions appear to be fairly small, but ... • The weights of the connections to the output neuron were computed with the pseudo-inverse method. Christian Borgelt Artificial Neural Networks and Deep Learning 295 Christian Borgelt Artificial Neural Networks and Deep Learning 296

  70. Application: Recognition of Handwritten Digits Application: Recognition of Handwritten Digits picture not available in online version picture not available in online version • Distance matrices showing the Euclidean distances of the 60 radial basis function • Before training (left): many distances between centers of different classes/digits are centers before and after training. small (e.g. 2-3, 3-8, 3-9, 5-8, 5-9), which increases the chance of misclassifications. • Centers are sorted by class/digit: first 6 rows/columns refer to digit 0, next 6 • After training (right): only very few small distances between centers of different rows/columns to digit 1 etc. classes/digits; basically all small distances between centers of same class/digit. • Distances are encoded as gray values: darker means smaller distance. Christian Borgelt Artificial Neural Networks and Deep Learning 297 Christian Borgelt Artificial Neural Networks and Deep Learning 298 Application: Recognition of Handwritten Digits Classification results: Classifier Accuracy • LVQ: 200 vectors (20 per class) Nearest Neighbor (1NN) 97.68% C4.5: 505 leaves Learning Vector Quantization (LVQ) 96.99% c -means: 60 centers(?) Decision Tree (C4.5) 91.12% (6 per class) 2-Phase-RBF (data points) 95.24% Learning Vector Quantization SVM: 10 classifiers, 2-Phase-RBF ( c -means) 96.94% ≈ 4200 vectors 2-Phase-RBF (LVQ) 95.86% MLP: 1 hidden layer 2-Phase-RBF (C4.5) 92.72% with 200 neurons 3-Phase-RBF (data points) 97.23% 3-Phase-RBF ( c -means) 98.06% • Results are medians 3-Phase-RBF (LVQ) 98.49% of three training/test runs. 3-Phase-RBF (C4.5) 94.83% • Error backpropagation Support Vector Machine (SVM) 98.76% improves RBF results. Multi-Layer Perceptron (MLP) 97.59% Christian Borgelt Artificial Neural Networks and Deep Learning 299 Christian Borgelt Artificial Neural Networks and Deep Learning 300

  71. Learning Vector Quantization Reminder: Delaunay Triangulations and Voronoi Diagrams • Up to now: fixed learning tasks ◦ The data consists of input/output pairs. ◦ The objective is to produce desired output for given input. ◦ This allows to describe training as error minimization. • Now: free learning tasks ◦ The data consists only of input values/vectors. ◦ The objective is to produce similar output for similar input (clustering). • Dots represent vectors that are used for quantizing the area. • Learning Vector Quantization • Left: Delaunay Triangulation ◦ Find a suitable quantization (many-to-few mapping, often to a finite set) (The circle through the corners of a triangle does not contain another point.) of the input space, e.g. a tesselation of a Euclidean space. • Right: Voronoi Diagram / Tesselation ◦ Training adapts the coordinates of so-called reference or codebook vectors, (Midperpendiculars of the Delaunay triangulation: boundaries of the regions of each of which defines a region in the input space. points that are closest to the enclosed cluster center (Voronoi cells)). Christian Borgelt Artificial Neural Networks and Deep Learning 301 Christian Borgelt Artificial Neural Networks and Deep Learning 302 Learning Vector Quantization Learning Vector Quantization Networks A learning vector quantization network (LVQ) is a neural network Finding clusters in a given set of data points with a graph G = ( U, C ) that satisfies the following conditions (i) U in ∩ U out = ∅ , U hidden = ∅ (ii) C = U in × U out The network input function of each output neuron is a distance function of the input vector and the weight vector, that is, f ( u ) w u , � w u , � ∀ u ∈ U out : net ( � in u ) = d ( � in u ) , R n × I R n → I R n : R + where d : I 0 is a function satisfying ∀ � x, � y,� z ∈ I ⇔ ( i ) d ( � x, � y ) = 0 x = � � y, • Data points are represented by empty circles ( ◦ ). ( ii ) d ( � x, � y ) = d ( � y, � x ) (symmetry) , • Cluster centers are represented by full circles ( • ). ( iii ) d ( � x, � z ) ≤ d ( � x, � y ) + d ( � y,� z ) (triangle inequality) . Christian Borgelt Artificial Neural Networks and Deep Learning 303 Christian Borgelt Artificial Neural Networks and Deep Learning 304

  72. Reminder: Distance Functions Learning Vector Quantization Illustration of distance functions: Minkowski family The activation function of each output neuron is a so-called radial function , that is, a monotonically decreasing function   1 n k � | x i − y i | k   d k ( � x, � y ) = R + 0 → [0 , ∞ ] f : I with f (0) = 1 and x →∞ f ( x ) = 0 . lim i =1 Well-known special cases from this family are: Sometimes the range of values is restricted to the interval [0 , 1]. However, due to the special output function this restriction is irrelevant. k = 1 : Manhattan or city block distance, k = 2 : Euclidean distance, The output function of each output neuron is not a simple function of the activation y ) = max n k → ∞ : i =1 | x i − y i | . maximum distance, that is, d ∞ ( � x, � of the neuron. Rather it takes into account the activations of all output neurons:  k = 1 k = 2 k → ∞  1 , if act u = max act v , f ( u ) v ∈ U out out (act u ) =  0 , otherwise. If more than one unit has the maximal activation, one is selected at random to have an output of 1, all others are set to output 0: winner-takes-all principle . Christian Borgelt Artificial Neural Networks and Deep Learning 305 Christian Borgelt Artificial Neural Networks and Deep Learning 306 Radial Activation Functions Learning Vector Quantization rectangle function: triangle function: Adaptation of reference vectors / codebook vectors � 0 , if net > σ, � 0 , if net > σ, f act (net , σ ) = f act (net , σ ) = 1 − net 1 , otherwise. σ , otherwise. • For each training pattern find the closest reference vector. 1 1 • Adapt only this reference vector (winner neuron). • For classified data the class may be taken into account: net net Each reference vector is assigned to a class. 0 0 0 σ 0 σ cosine until zero: Gaussian function: � Attraction rule (data point and reference vector have same class) 0 , if net > 2 σ , f act (net , σ ) = e − net2 f act (net , σ ) = 2 σ 2 cos ( 2 σ net ) +1 π , otherwise. r (new) = � r (old) + η ( � r (old) ) , 2 � x − � 1 1 Repulsion rule (data point and reference vector have different class) e − 1 2 1 2 r (new) = � r (old) − η ( � r (old) ) . � x − � e − 2 net net 0 0 0 σ 2 σ 0 σ 2 σ Christian Borgelt Artificial Neural Networks and Deep Learning 307 Christian Borgelt Artificial Neural Networks and Deep Learning 308

  73. Learning Vector Quantization Learning Vector Quantization: Example Adaptation of reference vectors / codebook vectors Adaptation of reference vectors / codebook vectors � r 1 r 1 � � r 2 ηd d d � r 2 ηd � r 3 � r 3 x � � x • Left: Online training with learning rate η = 0 . 1, • � x : data point, � r i : reference vector • Right: Batch training with learning rate η = 0 . 05. • η = 0 . 4 (learning rate) Christian Borgelt Artificial Neural Networks and Deep Learning 309 Christian Borgelt Artificial Neural Networks and Deep Learning 310 Learning Vector Quantization: Learning Rate Decay Learning Vector Quantization: Classified Data Problem: fixed learning rate can lead to oscillations Improved update rule for classified data • Idea: Update not only the one reference vector that is closest to the data point (the winner neuron), but update the two closest reference vectors . • Let � x be the currently processed data point and c its class. Let � r j and � r k be the two closest reference vectors and z j and z k their classes. • Reference vectors are updated only if z j � = z k and either c = z j or c = z k . (Without loss of generality we assume c = z j .) The update rules for the two closest reference vectors are: r (new) r (old) r (old) x − � � = � + η ( � ) and j j j Solution: time dependent learning rate r (new) r (old) r (old) − η ( � x − � � = � ) , k k k η ( t ) = η 0 α t , η ( t ) = η 0 t κ , 0 < α < 1 , or κ < 0 . while all other reference vectors remain unchanged. Christian Borgelt Artificial Neural Networks and Deep Learning 311 Christian Borgelt Artificial Neural Networks and Deep Learning 312

  74. Learning Vector Quantization: Window Rule Soft Learning Vector Quantization • Idea: Use soft assignments instead of winner-takes-all • It was observed in practical tests that standard learning vector quantization may (approach described here: [Seo and Obermayer 2003]). drive the reference vectors further and further apart. • Assumption: Given data was sampled from a mixture of normal distributions. • To counteract this undesired behavior a window rule was introduced: Each reference vector describes one normal distribution. update only if the data point � x is close to the classification boundary. • Closely related to clustering by estimating a mixture of Gaussians . • “Close to the boundary” is made formally precise by requiring � d ( � � ◦ (Crisp or hard) learning vector quantization can be seen as an “online version” x,� r k ) , d ( � r j ) x,� r k ) θ = 1 − ξ of c -means clustering. min > θ, where 1 + ξ. d ( � x,� d ( � x,� r j ) ◦ Soft learning vector quantization can be seed as an “online version” ξ is a parameter that has to be specified by a user. of estimating a mixture of Gaussians (that is, of normal distributions). (In the following: brief review of the Expectation Maximization (EM) Algo- • Intuitively, ξ describes the “width” of the window around the classification bound- rithm for estimating a mixture of Gaussians.) ary, in which the data point has to lie in order to lead to an update. • Hardening soft learning vector quantization (by letting the “radii” of the Gaussians • Using it prevents divergence, because the update ceases for a data point once the go to zero, see below) yields a version of (crisp or hard) learning vector quantization classification boundary has been moved far enough away. that works well without a window rule. Christian Borgelt Artificial Neural Networks and Deep Learning 313 Christian Borgelt Artificial Neural Networks and Deep Learning 314 Expectation Maximization: Mixture of Gaussians Expectation Maximization • Assumption: Data was generated by sampling a set of normal distributions. • Basic idea: Do a maximum likelihood estimation of the cluster parameters. (The probability density is a mixture of Gaussian distributions.) • Problem: The likelihood function, • Formally: We assume that the probability density can be described as n n c � � � L ( X ; C ) = f � X j ( � x j ; C ) = p Y ( y ; C ) · f � X | Y ( � x j | y ; C ) , c c � � f � X ( � x ; C ) = f � X,Y ( � x, y ; C ) = p Y ( y ; C ) · f � X | Y ( � x | y ; C ) . j =1 j =1 y =1 y =1 y =1 is difficult to optimize, even if one takes the natural logarithm (cf. the maximum likelihood estimation of the parameters of a normal distribution), because C is the set of cluster parameters � n c X � � is a random vector that has the data space as its domain ln L ( X ; C ) = ln p Y ( y ; C ) · f � X | Y ( � x j | y ; C ) Y is a random variable that has the cluster indices as possible j =1 y =1 R m and dom( Y ) = { 1 , . . . , c } ) values (i.e., dom( � X ) = I contains the natural logarithms of complex sums. p Y ( y ; C ) is the probability that a data point belongs to (is generated by) the y -th component of the mixture • Approach: Assume that there are “hidden” variables Y j stating the clusters that generated the data points � x j , so that the sums reduce to one term. f � X | Y ( � x | y ; C ) is the conditional probability density function of a data point given the cluster (specified by the cluster index y ) • Problem: Since the Y j are hidden, we do not know their values. Christian Borgelt Artificial Neural Networks and Deep Learning 315 Christian Borgelt Artificial Neural Networks and Deep Learning 316

  75. Expectation Maximization Expectation Maximization • Formally: Maximize the likelihood of the “completed” data set ( X , � y ), • Formally: Find the cluster parameters as where � y = ( y 1 , . . . , y n ) combines the values of the variables Y j . That is, ˆ y ; C ) | X ; C ) , C = argmax E ([ln] L ( X , � n n � � C L ( X , � y ; C ) = f � X j ,Y j ( � x j , y j ; C ) = p Y j ( y j ; C ) · f � X j | Y j ( � x j | y j ; C ) . that is, maximize the expected likelihood j =1 j =1 n � � • Problem: Since the Y j are hidden, the values y j are unknown y ; C ) | X ; C ) = y | X ; C ) · E ( L ( X , � p � Y |X ( � f � X j ,Y j ( � x j , y j ; C ) (and thus the factors p Y j ( y j ; C ) cannot be computed). y ∈{ 1 ,...,c } n j =1 � or, alternatively, maximize the expected log-likelihood • Approach to find a solution nevertheless: n � � ◦ See the Y j as random variables (the values y j are not fixed) and E (ln L ( X , � y ; C ) | X ; C ) = p � Y |X ( � y | X ; C ) · ln f � X j ,Y j ( � x j , y j ; C ) . consider a probability distribution over the possible values. y ∈{ 1 ,...,c } n j =1 � ◦ As a consequence L ( X , � y ; C ) becomes a random variable, • Unfortunately, these functionals are still difficult to optimize directly. even for a fixed data set X and fixed cluster parameters C . • Solution: Use the equation as an iterative scheme, fixing C in some terms ◦ Try to maximize the expected value of L ( X , � y ; C ) or ln L ( X , � y ; C ) (iteratively compute better approximations, similar to Heron’s algorithm). (hence the name expectation maximization ). Christian Borgelt Artificial Neural Networks and Deep Learning 317 Christian Borgelt Artificial Neural Networks and Deep Learning 318 Excursion: Heron’s Algorithm Expectation Maximization Find the square root of a given number x , i.e., find y = √ x . • Iterative scheme for expectation maximization: • Task: Choose some initial set C 0 of cluster parameters and then compute • Approach: Rewrite the defining equation y 2 = x as follows: � � C k +1 = argmax E (ln L ( X , � y ; C ) | X ; C k ) y = 1 y = 1 y + x y 2 = x 2 y 2 = y 2 + x 2 y ( y 2 + x ) ⇔ ⇔ ⇔ . C 2 y n � � y | X ; C k ) = argmax p � Y |X ( � ln f � X j ,Y j ( � x j , y j ; C ) • Use the resulting equation as an iteration formula, i.e., compute the sequence C y ∈{ 1 ,...,c } n j =1 � � �   y k +1 = 1 y k + x n n � � � with y 0 = 1 .   = argmax p Y l | � X l ( y l | � x l ; C k ) ln f � X j ,Y j ( � x j , y j ; C ) 2 y k C y ∈{ 1 ,...,c } n j =1 � l =1 0 ≤ y k − √ x ≤ y k − 1 − y n c n • It can be shown that for k ≥ 2. � � = argmax p Y j | � X j ( i | � x j ; C k ) · ln f � X j ,Y j ( � x j , i ; C ) . Therefore this iteration formula provides increasingly better approximations of the C i =1 j =1 square root of x and thus is a safe and simple way to compute it. Ex.: x = 2: y 0 = 1, y 1 = 1 . 5, y 2 ≈ 1 . 41667, y 3 ≈ 1 . 414216, y 4 ≈ 1 . 414213. • It can be shown that each EM iteration increases the likelihood of the data and that the algorithm converges to a local maximum of the likelihood function • Heron’s algorithm converges very quickly and is often used in pocket calculators (i.e., EM is a safe way to maximize the likelihood function). and microprocessors to implement the square root. Christian Borgelt Artificial Neural Networks and Deep Learning 319 Christian Borgelt Artificial Neural Networks and Deep Learning 320

  76. Expectation Maximization Expectation Maximization Justification of the last step on the previous slide: • The probabilities p Y j | � X j ( i | � x j ; C k ) are computed as   n n � � � f � X j ,Y j ( � x j , i ; C k ) f � X j | Y j ( � x j | i ; C k ) · p Y j ( i ; C k )   p Y l | � X l ( y l | � x l ; C k ) ln f � X j ,Y j ( � x j , y j ; C ) p Y j | � X j ( i | � x j ; C k ) = = x j | l ; C k ) · p Y j ( l ; C k ) , � c y ∈{ 1 ,...,c } n j =1 � l =1 f � X j ( � x j ; C k ) l =1 f � X j | Y j ( �   c c n n c � � � � �   = · · · p Y l | � X l ( y l | � x l ; C k ) δ i,y j ln f � X j ,Y j ( � x j , i ; C ) that is, as the relative probability densities of the different clusters (as specified y 1 =1 y n =1 l =1 j =1 i =1 by the cluster parameters) at the location of the data points � x j . c n c c n � � � � � = ln f � X j ,Y j ( � x j , i ; C ) · · · δ i,y j p Y l | � X l ( y l | � x l ; C k ) • The p Y j | � X j ( i | � x j ; C k ) are the posterior probabilities of the clusters i =1 j =1 y 1 =1 y n =1 l =1 given the data point � x j and a set of cluster parameters C k . c n � � = p Y j | � X j ( i | � x j ; C k ) · ln f � X j ,Y j ( � x j , i ; C ) • They can be seen as case weights of a “completed” data set: i =1 j =1 c c c c n � � � � � ◦ Split each data point � x j into c data points ( � x j , i ), i = 1 , . . . , c . · · · · · · p Y l | � X l ( y l | � x l ; C k ) . ◦ Distribute the unit weight of the data point � x j according to the above proba- y 1 =1 y j − 1 =1 y j +1 =1 y n =1 l =1 ,l � = j � �� � � n � c � n bilities, i.e., assign to ( � x j , i ) the weight p Y j | � X j ( i | � x j ; C k ), i = 1 , . . . , c . = yl =1 p Yl | � Xl ( y l | � x l ; C k ) = l =1 ,l � = j 1 = 1 l =1 ,l � = j Christian Borgelt Artificial Neural Networks and Deep Learning 321 Christian Borgelt Artificial Neural Networks and Deep Learning 322 Expectation Maximization: Cookbook Recipe Expectation Maximization: Mixture of Gaussians Core Iteration Formula Expectation Step: Use Bayes’ rule to compute c n � � p C ( i ; c i ) · f � X | C ( � x | i ; c i ) p C ( i ; c i ) · f � X | C ( � x | i ; c i ) X j ( i | � x j ; C k ) · ln f � C k +1 = argmax p Y j | � X j ,Y j ( � x j , i ; C ) p C | � X ( i | � x ; C ) = = x | k ; c k ) . � c C f � X ( � x ; C ) k =1 p C ( k ; c k ) · f � X | C ( � i =1 j =1 Expectation Step → “weight” of the data point � x for the estimation. • For all data points � x j : X j ( i | � Maximization Step: Use maximum likelihood estimation to compute Compute for each normal distribution the probability p Y j | � x j ; C k ) that the data point was generated from it � n x j ; C ( t ) ) · � X j ( i | � j =1 p C | � x j n � = 1 (ratio of probability densities at the location of the data point). ̺ ( t +1) µ ( t +1) x j ; C ( t ) ) , p C | � X j ( i | � � = , � n i i → “weight” of the data point for the estimation. x j ; C ( t ) ) n X j ( i | � j =1 p C | � j =1 Maximization Step � �� � � n ⊤ µ ( t +1) µ ( t +1) x j ; C ( t ) ) · j =1 p C | � X j ( i | � x j − � � � x j − � i i • For all normal distributions: Σ ( t +1) and = � n i x j ; C ( t ) ) Estimate the parameters by standard maximum likelihood estimation j =1 p C | � X j ( i | � using the probabilities (“weights”) assigned to the data points w.r.t. the distribution in the expectation step. Iterate until convergence (checked, e.g., by change of mean vector). Christian Borgelt Artificial Neural Networks and Deep Learning 323 Christian Borgelt Artificial Neural Networks and Deep Learning 324

  77. Expectation Maximization: Technical Problems Soft Learning Vector Quantization Idea: Use soft assignments instead of winner-takes-all • If a fully general mixture of Gaussian distributions is used, (approach described here: [Seo and Obermayer 2003]). the likelihood function is truly optimized if Assumption: Given data was sampled from a mixture of normal distributions. ◦ all normal distributions except one are contracted to single data points and Each reference vector describes one normal distribution. ◦ the remaining normal distribution is the maximum likelihood estimate for the remaining data points. Objective: Maximize the log-likelihood ratio of the data, that is, maximize   • This undesired result is rare, n ⊤ ( � � �  − ( � x j − � r ) x j − � r )  ln L ratio = ln exp because the algorithm gets stuck in a local optimum. 2 σ 2 j =1 � r ∈ R ( c j )   • Nevertheless it is recommended to take countermeasures, n ⊤ ( � � �  − ( � x j − � r ) x j − � r )  . − ln exp which consist mainly in reducing the degrees of freedom, like 2 σ 2 j =1 r ∈ Q ( c j ) � ◦ Fix the determinants of the covariance matrices to equal values. Here σ is a parameter specifying the “size” of each normal distribution. ◦ Use a diagonal instead of a general covariance matrix. R ( c ) is the set of reference vectors assigned to class c and Q ( c ) its complement. ◦ Use an isotropic variance instead of a covariance matrix. Intuitively: at each data point the probability density for its class should be as large ◦ Fix the prior probabilities of the clusters to equal values. as possible while the density for all other classes should be as small as possible. Christian Borgelt Artificial Neural Networks and Deep Learning 325 Christian Borgelt Artificial Neural Networks and Deep Learning 326 Soft Learning Vector Quantization Hard Learning Vector Quantization Update rule derived from a maximum log-likelihood approach: Idea: Derive a scheme with hard assignments from the soft version.  r (old) Approach: Let the size parameter σ of the Gaussian function go to zero. u ⊕   ij · ( � x j − � ) , if c j = z i , r (new) r (old) i + η · � = � i i The resulting update rule is in this case:  r (old) − u ⊖  ij · ( � x j − � if c j � = z i , ) , i  r (old) u ⊕   ij · ( � x j − � ) , if c j = z i , where z i is the class associated with the reference vector � r i and r (new) r (old) i + η · � = � i i  r (old) − u ⊖  ij · ( � x j − � if c j � = z i , ) , r (old) r (old) 1 ⊤ ( � i exp ( − x j − � x j − � 2 σ 2 ( � ) )) i i u ⊕ = and � ij where 1 ⊤ ( � r (old) ) r (old) )) exp ( − x j − � x j − � 2 σ 2 ( �   r ∈ R ( c j ) �   1 , if � r i = argmin d ( � x j ,� r ) , 1 , if � r i = argmin d ( � x j ,� r ) ,   u ⊕ u ⊖ r (old) r (old) ⊤ ( � r ∈ R ( c j ) r ∈ Q ( c j ) 1 ij = � ij = � exp ( − 2 σ 2 ( � x j − � ) x j − � ) )   u ⊖ i i   = . 0 , otherwise, 0 , otherwise. � ij ⊤ ( � 1 r (old) ) r (old) )) exp ( − 2 σ 2 ( � x j − � x j − � � r i is closest vector of same class r i is closest vector of different class � � r ∈ Q ( c j ) This update rule is stable without a window rule restricting the update. R ( c ) is the set of reference vectors assigned to class c and Q ( c ) its complement. Christian Borgelt Artificial Neural Networks and Deep Learning 327 Christian Borgelt Artificial Neural Networks and Deep Learning 328

  78. Learning Vector Quantization: Extensions Demonstration Software: xlvq/wlvq • Frequency Sensitive Competitive Learning ◦ The distance to a reference vector is modified according to the number of data points that are assigned to this reference vector. • Fuzzy Learning Vector Quantization ◦ Exploits the close relationship to fuzzy clustering. ◦ Can be seen as an online version of fuzzy clustering. ◦ Leads to faster clustering. • Size and Shape Parameters Demonstration of learning vector quantization: ◦ Associate each reference vector with a cluster radius. • Visualization of the training process Update this radius depending on how close the data points are. • Arbitrary data sets, but training only in two dimensions ◦ Associate each reference vector with a covariance matrix. • http://www.borgelt.net/lvqd.html Update this matrix depending on the distribution of the data points. Christian Borgelt Artificial Neural Networks and Deep Learning 329 Christian Borgelt Artificial Neural Networks and Deep Learning 330 Self-Organizing Maps A self-organizing map or Kohonen feature map is a neural network with a graph G = ( U, C ) that satisfies the following conditions (i) U hidden = ∅ , U in ∩ U out = ∅ , (ii) C = U in × U out . The network input function of each output neuron is a distance function of in- Self-Organizing Maps put and weight vector. The activation function of each output neuron is a radial function , that is, a monotonically decreasing function R + f : I 0 → [0 , 1] f (0) = 1 x →∞ f ( x ) = 0 . with and lim The output function of each output neuron is the identity. The output is often discretized according to the “ winner takes all ” principle. On the output neurons a neighborhood relationship is defined: R + d neurons : U out × U out → I 0 . Christian Borgelt Artificial Neural Networks and Deep Learning 331 Christian Borgelt Artificial Neural Networks and Deep Learning 332

  79. Self-Organizing Maps: Neighborhood Topology Preserving Mapping Images of points close to each other in the original space Neighborhood of the output neurons: neurons form a grid should be close to each other in the image space. Example: Robinson projection of the surface of a sphere (maps from 3 dimensions to 2 dimensions) quadratic grid hexagonal grid • Thin black lines: Indicate nearest neighbors of a neuron. • Thick gray lines: Indicate regions assigned to a neuron for visualization. • Robinson projection is/was frequently used for world maps. • Usually two-dimensional grids are used to be able to draw the map easily. • The topology is preserved, although distances, angles, areas may be distorted. Christian Borgelt Artificial Neural Networks and Deep Learning 333 Christian Borgelt Artificial Neural Networks and Deep Learning 334 Self-Organizing Maps: Topology Preserving Mapping Self-Organizing Maps: Neighborhood neuron space/grid Find topology preserving mapping by respecting the neighborhood usually 2-dimensional quadratic or Reference vector update rule: hexagonal grid r (new) r (old) r (old) � = � + η ( t ) · f nb ( d neurons ( u, u ∗ ) , ̺ ( t )) · ( � x − � ) , u u u • u ∗ is the winner neuron (reference vector closest to data point). • The neighborhood function f nb is a radial function. input/data space usually high-dim. (here: only 3-dim.) Time dependent learning rate blue: ref. vectors η ( t ) = η 0 α t η ( t ) = η 0 t κ η , η , 0 < α η < 1 , or κ η < 0 . Connections may Time dependent neighborhood radius be drawn between vectors corresponding ̺ ( t ) = ̺ 0 α t ̺ ( t ) = ̺ 0 t κ ̺ , ̺ , 0 < α ̺ < 1 , or κ ̺ < 0 . to adjacent neurons. Christian Borgelt Artificial Neural Networks and Deep Learning 335 Christian Borgelt Artificial Neural Networks and Deep Learning 336

  80. Self-Organizing Maps: Neighborhood Self-Organizing Maps: Training Procedure • Initialize the weight vectors of the neurons of the self-organizing map, The neighborhood size is reduced over time: (here: step function) that is, place initial reference vectors in the input/data space. • This may be done by randomly selecting training examples (provided there are fewer neurons that training examples, which is the usual case) or by sampling from some probability distribution on the data space. • For the actual training, repeat the following steps: ◦ Choose a training sample / data point (traverse the data points, possibly shuffling after each epoch). ◦ Find the winner neuron with the distance function in the data space, that is, find the neuron with the closest reference vector. ◦ Compute the time dependent radius and learning rate and Note that a neighborhood function that is not a step function has a “soft” border and adapt the corresponding neighbors of the winner neuron thus allows for a smooth reduction of the neighborhood size (larger changes of the (severity of weight changes depend by neighborhood and learning rate). reference vectors are restricted more and more to the close neighborhood). Christian Borgelt Artificial Neural Networks and Deep Learning 337 Christian Borgelt Artificial Neural Networks and Deep Learning 338 Self-Organizing Maps: Examples Self-Organizing Maps: Examples Unfolding of a two-dimensional self-organizing map. (data space) Example: Unfolding of a two-dimensional self-organizing map. • Self-organizing map with 10 × 10 neurons (quadratic grid) that is trained with random points chosen uniformly from the square [ − 1 , 1] × [ − 1 , 1]. • Initialization with random reference vectors chosen uniformly from [ − 0 . 5 , 0 . 5] × [ − 0 . 5 , 0 . 5]. • Gaussian neighborhood function � � − d 2 neurons ( u,u ∗ ) f nb ( d neurons ( u, u ∗ ) , ̺ ( t )) = exp . 2 ̺ ( t ) 2 • Time-dependent neighborhood radius ̺ ( t ) = 2 . 5 · t − 0 . 1 • Time-dependent learning rate η ( t ) = 0 . 6 · t − 0 . 1 . • The next slides show the SOM state after 10, 20, 40, 80 and 160 training steps. In each training step one training sample is processed. Shading of the neuron grid shows neuron activations for ( − 0 . 5 , − 0 . 5). Christian Borgelt Artificial Neural Networks and Deep Learning 339 Christian Borgelt Artificial Neural Networks and Deep Learning 340

  81. Self-Organizing Maps: Examples Self-Organizing Maps: Examples Unfolding of a two-dimensional self-organizing map. (neuron grid) Example: Unfolding of a two-dimensional self-organizing map. Training a self-organizing map may fail if • the (initial) learning rate is chosen too small or • or the (initial) neighborhood radius is chosen too small. Christian Borgelt Artificial Neural Networks and Deep Learning 341 Christian Borgelt Artificial Neural Networks and Deep Learning 342 Self-Organizing Maps: Examples Demonstration Software: xsom/wsom Example: Unfolding of a two-dimensional self-organizing map. (a) (b) (c) Self-organizing maps that have been trained with random points from (a) a rotation parabola, (b) a simple cubic function, (c) the surface of a sphere. Since the data points come from a two-dimensional subspace, training works well. Demonstration of self-organizing map training: • Visualization of the training process • In this case original space and image space have different dimensionality. (In the previous example they were both two-dimensional.) • Two-dimensional areas and three-dimensional surfaces • http://www.borgelt.net/somd.html • Self-organizing maps can be used for dimensionality reduction (in a quantized fashion, but interpolation may be used for smoothing). Christian Borgelt Artificial Neural Networks and Deep Learning 343 Christian Borgelt Artificial Neural Networks and Deep Learning 344

  82. Application: The “Neural” Phonetic Typewriter Application: World Poverty Map Organize Countries based on Poverty Indicators [Kaski et al. 1997] Create a Phonotopic Map of Finnish [Kohonen 1988] • Data consists of World Bank statistics of countries in 1992. • The recorded microphone signal is converted into a spectral representation grouped into 15 channels. • 39 indicators describing various quality-of-life factors were used, such as state of health, nutrition, educational services etc. • The 15-dimensional input space is mapped to a hexagonal SOM grid. picture not available in online version pictures not available in online version Christian Borgelt Artificial Neural Networks and Deep Learning 345 Christian Borgelt Artificial Neural Networks and Deep Learning 346 Application: World Poverty Map Application: Classifying News Messages Organize Countries based on Poverty Indicators [Kaski et al. 1997] Classify News Messages on Neural Networks [Kaski et al. 1998] • Map of the world with countries colored by their poverty type. • Countries in gray were not considered (possibly insufficient data). picture not available in online version picture not available in online version Christian Borgelt Artificial Neural Networks and Deep Learning 347 Christian Borgelt Artificial Neural Networks and Deep Learning 348

  83. Application: Organize Music Collections Self-Organizing Maps: Summary • Dimensionality Reduction Organize Music Collections (Sound and Tags) [M¨ orchen et al. 2005/2007] ◦ Topology preserving mapping in a quantized fashion, • Uses an Emergent Self-organizing Map to organize songs but interpolation may be used for smoothing. (no fixed shape, larger number of neurons than data points). ◦ Cell coloring according to activation for few data points • Creates semantic features with regression and feature selection from in order to “compare” these data points 140 different short-term features (short time windows with stationary sound) (i.e., their relative location in the data space). 284 long term features (e.g. temporal statistics etc.) and user-assigned tags. ◦ Allows to visualize (with loss, of course) high-dimensional data. • Similarity Search ◦ Train a self-organizing map on high-dimensional data. ◦ Find the cell / neuron to which a query data point is assigned. pictures not available in online version ◦ Similar data points are assigned to the same or neighboring cells. ◦ Useful for organizing text, image or music collections etc. Christian Borgelt Artificial Neural Networks and Deep Learning 349 Christian Borgelt Artificial Neural Networks and Deep Learning 350 Hopfield Networks A Hopfield network is a neural network with a graph G = ( U, C ) that satisfies the following conditions: (i) U hidden = ∅ , U in = U out = U , (ii) C = U × U − { ( u, u ) | u ∈ U } . Hopfield Networks and Boltzmann Machines • In a Hopfield network all neurons are input as well as output neurons. • There are no hidden neurons. • Each neuron receives input from all other neurons. • A neuron is not connected to itself. The connection weights are symmetric, that is, ∀ u, v ∈ U, u � = v : w uv = w vu . Christian Borgelt Artificial Neural Networks and Deep Learning 351 Christian Borgelt Artificial Neural Networks and Deep Learning 352

  84. Hopfield Networks Hopfield Networks The network input function of each neuron is the weighted sum of the outputs of all Alternative activation function other neurons, that is,   1 , if net u > θ u ,  � f ( u ) f ( u ) w ⊤ ∀ u ∈ U : act (net u , θ u , act u ) = − 1 , if net u < θ u , w u , � u � ∀ u ∈ U : net ( � in u ) = � in u = w uv out v .   act u , if net u = θ u . v ∈ U −{ u } This activation function has advantages w.r.t. the physical interpretation The activation function of each neuron is a threshold function, that is, of a Hopfield network. � 1 , if net u ≥ θ u , f ( u ) ∀ u ∈ U : act (net u , θ u ) = − 1 , otherwise. General weight matrix of a Hopfield network   The output function of each neuron is the identity, that is, 0 w u 1 u 2 . . . w u 1 u n     w u 1 u 2 0 . . . w u 2 u n f ( u )   W = ∀ u ∈ U : out (act u ) = act u . . . .   . . . . . .   w u 1 u n w u 1 u n . . . 0 Christian Borgelt Artificial Neural Networks and Deep Learning 353 Christian Borgelt Artificial Neural Networks and Deep Learning 354 Hopfield Networks: Examples Hopfield Networks: Examples Very simple Hopfield network Parallel update of neuron activations u 1 u 1 u 2 x 1 y 1 0 input phase − 1 1 � � 0 1 work phase 1 − 1 W = 1 1 1 0 − 1 1 1 − 1 x 2 y 2 0 − 1 1 u 2 − 1 1 − 1 1 The behavior of a Hopfield network can depend on the update order. • The computations oscillate, no stable state is reached. • Computations can oscillate if neurons are updated in parallel. • Computations always converge if neurons are updated sequentially. • Output depends on when the computations are terminated. Christian Borgelt Artificial Neural Networks and Deep Learning 355 Christian Borgelt Artificial Neural Networks and Deep Learning 356

  85. Hopfield Networks: Examples Hopfield Networks: Examples Sequential update of neuron activations Simplified representation of a Hopfield network u 1 u 2 u 1 u 2 x 1 y 1 0 input phase − 1 1 input phase − 1 1 u 1   0 work phase 1 1 work phase − 1 − 1 0 1 2 1 1 1 2 1 1 − 1 − 1   W = 1 0 1 x 2 y 2 u 2   0 2 0 1 1 − 1 − 1 2 1 0 2 1 1 − 1 − 1 1 1 1 u 3 0 x 3 y 3 0 • Update order u 1 , u 2 , u 1 , u 2 , . . . (left) or u 2 , u 1 , u 2 , u 1 , . . . (right) • Regardless of the update order a stable state is reached. • Symmetric connections between neurons are combined. • However, which state is reached depends on the update order. • Inputs and outputs are not explicitly represented. Christian Borgelt Artificial Neural Networks and Deep Learning 357 Christian Borgelt Artificial Neural Networks and Deep Learning 358 Hopfield Networks: State Graph Hopfield Networks: Convergence Graph of activation states and transitions: Convergence Theorem: If the activations of the neurons of a Hopfield network (for the Hopfield network shown on the preceding slide) are updated sequentially (asynchronously), then a stable state is reached in a finite number of steps. “+”/” − ” encode the neuron activations: u 1 +++ u 2 If the neurons are traversed cyclically in an arbitrary, but fixed order, at most n · 2 n “+” means +1 and “ − ” means − 1. u 3 steps (updates of individual neurons) are needed, where n is the number of neurons of u 3 u 2 u 1 Labels on arrows indicate the neurons, the Hopfield network. u 2 u 2 whose updates (activation changes) lead u 1 ++ − + − + − ++ u 3 to the corresponding state transitions. The proof is carried out with the help of an energy function . u 1 u 3 The energy function of a Hopfield network with n neurons u 1 , . . . , u n is defined as u 2 u 2 States shown in gray: u 3 u 1 stable states, cannot be left again u 1 = − 1 + −− − + − −− + act ⊤ W � θ ⊤ � u 3 act + � � States shown in white: E act 2 unstable states, may be left again. u 1 u 2 u 3 = − 1 � � w uv act u act v + θ u act u . u 1 Such a state graph captures 2 −−− u 2 u,v ∈ U,u � = v u ∈ U u 3 all imaginable update orders. Christian Borgelt Artificial Neural Networks and Deep Learning 359 Christian Borgelt Artificial Neural Networks and Deep Learning 360

  86. Hopfield Networks: Convergence Hopfield Networks: Convergence It takes at most n · 2 n update steps to reach convergence. Consider the energy change resulting from an update that changes an activation: � • Provided that the neurons are updated in an arbitrary, but fixed order, ∆ E = E (new) − E (old) w uv act (new) act v + θ u act (new) ( − = ) u u since this guarantees that the neurons are traversed cyclically, v ∈ U −{ u } and therefore each neuron is updated every n steps. � w uv act (old) act v + θ u act (old) − ( − ) u u • If in a traversal of all n neurons no activation changes: v ∈ U −{ u } � � � � � a stable state has been reached . act (old) − act (new) = w uv act v − θ u . u u v ∈ U −{ u } • If in a traversal of all n neurons at least one activation changes: � �� � = net u the previous state cannot be reached again , because • If net u < θ u , then the second factor is less than 0. ◦ either the new state has a smaller energy than the old Also, act (new) = − 1 and act (old) = 1, therefore the first factor is greater than 0. u u (no way back: updates cannot increase the network energy) Result: ∆ E < 0. ◦ or the number of +1 activations has increased (no way back: equal energy is possible only for net u ≥ θ u ). • If net u ≥ θ u , then the second factor greater than or equal to 0. Also, act (new) = 1 and act (old) • The number of possible states of the Hopfield network is 2 n , at least one of which = − 1, therefore the first factor is less than 0. u u must be rendered unreachable in each traversal of the n neurons. Result: ∆ E ≤ 0. Christian Borgelt Artificial Neural Networks and Deep Learning 361 Christian Borgelt Artificial Neural Networks and Deep Learning 362 Hopfield Networks: Examples Hopfield Networks: Examples Arrange states in state graph according to their energy The state graph need not be symmetric E −−− 5 + −− −− + − ++ ++ − 2 E u 1 − 1 + − + − + − 3 + −− −− + 0 − 2 u 2 − 1 2 − 2 ++ − − ++ 1 − 2 u 3 − 1 − 4 −−− +++ − 1 − + − +++ Energy function for example Hopfield network: ≈ E = − act u 1 act u 2 − 2 act u 1 act u 3 − act u 2 act u 3 . + − + − 7 Christian Borgelt Artificial Neural Networks and Deep Learning 363 Christian Borgelt Artificial Neural Networks and Deep Learning 364

  87. Hopfield Networks: Physical Interpretation Hopfield Networks: Associative Memory Physical interpretation: Magnetism Idea: Use stable states to store patterns x = (act u 1 , . . . , act u n ) ⊤ ∈ {− 1 , 1 } n , n ≥ 2, A Hopfield network can be seen as a (microscopic) model of magnetism First: Store only one pattern � (so-called Ising model, [Ising 1925]). that is, find weights, so that pattern is a stable state. physical neural Necessary and sufficient condition: atom neuron x − � S ( W � θ ) = � x, magnetic moment (spin) activation state strength of outer magnetic field threshold value where magnetic coupling of the atoms connection weights R n → {− 1 , 1 } n , S : I Hamilton operator of the magnetic field energy function x �→ � � y with � 1 , if x i ≥ 0, ∀ i ∈ { 1 , . . . , n } : y i = − 1 , otherwise. Christian Borgelt Artificial Neural Networks and Deep Learning 365 Christian Borgelt Artificial Neural Networks and Deep Learning 366 Hopfield Networks: Associative Memory Hopfield Networks: Associative Memory If � θ = � 0 an appropriate matrix W can easily be found, because it suffices to ensure Hebbian learning rule [Hebb 1949] R + . with c ∈ I W � x = c� x Written in individual weights the computation of the weight matrix reads:  Algebraically: Find a matrix W that has a positive eigenvalue w.r.t. � x .  0 , if u = v ,   1 , if u � = v , act u = act v , w uv = Choose  x ⊤ − E   W = � x� − 1 , otherwise. x ⊤ is the so-called outer product of � where � x� x with itself. • Originally derived from a biological analogy. With this matrix we have • Strengthen connection between neurons that are active at the same time. ( ∗ ) x ⊤ ) � x ⊤ � W � x = ( � x� x − E � x = x ( � � x ) − � x ���� Note that this learning rule also stores the complement of the pattern: � �� � = � x x | 2 = n = | � With W � x = ( n − 1) � x it is also W ( − � x ) = ( n − 1)( − � x ) . = n� x − � x = ( n − 1) � x. ( ∗ ) holds, because vector/matrix multiplication is associative. Christian Borgelt Artificial Neural Networks and Deep Learning 367 Christian Borgelt Artificial Neural Networks and Deep Learning 368

  88. Hopfield Networks: Associative Memory Hopfield Networks: Associative Memory Storing several patterns Storing several patterns Choose W � x j = ( n − m ) � x j .   m m � � x ⊤  − m E �  W � x j = W i � x j = ( � x i � i ) � x j x j ���� Result: As long as m < n , � x is a stable state of the Hopfield network. i =1 i =1 = � x j   m � Note that the complements of the patterns are also stored. x ⊤  − m�  = � x i ( � i � x j ) x j i =1 With W � x j = ( n − m ) � x j it is also W ( − � x j ) = ( n − m )( − � x j ) . If the patterns are orthogonal, we have � But: Capacity is very small compared to the number of possible states (2 n ), 0 , if i � = j , x ⊤ � i � x j = since at most m = n − 1 orthogonal patterns can be stored (so that n − m > 0). n, if i = j , and therefore Furthermore, the requirement that the patterns must be orthogonal is a strong limitation of the usefulness of this result. W � x j = ( n − m ) � x j . Christian Borgelt Artificial Neural Networks and Deep Learning 369 Christian Borgelt Artificial Neural Networks and Deep Learning 370 Hopfield Networks: Associative Memory Associative Memory: Example Non-orthogonal patterns: x 1 = (+1 , +1 , − 1 , − 1) ⊤ and � x 2 = ( − 1 , +1 , − 1 , +1) ⊤ . Example: Store patterns � m � x ⊤ W � x j = ( n − m ) � x j + � x i ( � i � x j ) . x ⊤ x ⊤ W = W 1 + W 2 = � x 1 � 1 + � x 2 � 2 − 2 E i =1 i � = j where     � �� � 0 1 − 1 − 1 0 − 1 1 − 1 “disturbance term”      1 0 − 1 − 1   − 1 0 − 1 1      W 1 =  , W 2 =  .     − 1 − 1 1 − 1 0 − 1 • The “disturbance term” need not make it impossible to store the patterns. 0 1   − 1 − 1 − 1 1 − 1 1 0 0 • The states corresponding to the patterns � x j may still be stable, The full weight matrix is: if the “disturbance term” is sufficiently small.   0 0 0 − 2   • For this term to be sufficiently small, the patterns must be “almost” orthogonal.  0 0 − 2 0    W =  .   0 − 2 0 0  • The larger the number of patterns to be stored (that is, the smaller n − m ), − 2 0 0 0 the smaller the “disturbance term” must be. Therefore it is • The theoretically possible maximal capacity of a Hopfield network x 1 = (+2 , +2 , − 2 , − 2) ⊤ x 2 = ( − 2 , +2 , − 2 , +2) ⊤ . W � and W � (that is, m = n − 1) is hardly ever reached in practice. Christian Borgelt Artificial Neural Networks and Deep Learning 371 Christian Borgelt Artificial Neural Networks and Deep Learning 372

  89. Associative Memory: Examples Hopfield Networks: Associative Memory Example: Storing bit maps of numbers Training a Hopfield network with the Delta rule Necessary condition for pattern � x being a stable state: + w u 1 u 2 act ( p ) u 2 + . . . + w u 1 u n act ( p ) u n − θ u 1 ) = act ( p ) s (0 u 1 , s ( w u 2 u 1 act ( p ) + . . . + w u 2 u n act ( p ) u n − θ u 2 ) = act ( p ) u 1 + 0 u 2 , . . . . . . . . . . . . . . . s ( w u n u 1 act ( p ) u 1 + w u n u 2 act ( p ) − θ u n ) = act ( p ) u 2 + . . . + 0 u n . with the standard threshold function � • Left: Bit maps stored in a Hopfield network. 1 , if x ≥ 0, s ( x ) = − 1 , otherwise. • Right: Reconstruction of a pattern from a random input. Christian Borgelt Artificial Neural Networks and Deep Learning 373 Christian Borgelt Artificial Neural Networks and Deep Learning 374 Hopfield Networks: Associative Memory Demonstration Software: xhfn/whfn Training a Hopfield network with the Delta rule Turn weight matrix into a weight vector: w = ( w u 1 u 2 , w u 1 u 3 , . . . , w u 1 u n , � w u 2 u 3 , . . . , w u 2 u n , ... . . . w u n − 1 u n , − θ u 1 , − θ u 2 , . . . , − θ u n ) . Construct input vectors for a threshold logic unit Demonstration of Hopfield networks as associative memory: z 2 = (act ( p ) act ( p ) u 3 , . . . , act ( p ) � u 1 , 0 , . . . , 0 , u n , . . . 0 , 1 , 0 , . . . , 0 ) . � �� � � �� � • Visualization of the association/recognition process n − 2 zeros n − 2 zeros • Two-dimensional networks of arbitrary size Apply Delta rule training / Widrow–Hoff procedure until convergence. • http://www.borgelt.net/hfnd.html Christian Borgelt Artificial Neural Networks and Deep Learning 375 Christian Borgelt Artificial Neural Networks and Deep Learning 376

  90. Hopfield Networks: Solving Optimization Problems Hopfield Networks: Activation Transformation Use energy minimization to solve optimization problems A Hopfield network may be defined either with activations − 1 and 1 or with activations 0 and 1. The networks can be transformed into each other. General procedure: From act u ∈ {− 1 , 1 } to act u ∈ { 0 , 1 } : • Transform function to optimize into a function to minimize. w 0 = 2 w − and uv uv � • Transform function into the form of an energy function of a Hopfield network. θ 0 u = θ − w − u + uv v ∈ U −{ u } • Read the weights and threshold values from the energy function. • Construct the corresponding Hopfield network. From act u ∈ { 0 , 1 } to act u ∈ {− 1 , 1 } : • Initialize Hopfield network randomly and update until convergence. 1 w − 2 w 0 = and uv uv • Read solution from the stable state reached. � u − 1 θ − = θ 0 w 0 uv . u 2 • Repeat several times and use best solution found. v ∈ U −{ u } Christian Borgelt Artificial Neural Networks and Deep Learning 377 Christian Borgelt Artificial Neural Networks and Deep Learning 378 Hopfield Networks: Solving Optimization Problems Hopfield Networks: Solving Optimization Problems Combination lemma: Let two Hopfield networks on the same set U of neurons Example: Traveling salesman problem with weights w ( i ) uv , threshold values θ ( i ) u and energy functions Idea: Represent tour by a matrix. � � � E i = − 1 w ( i ) θ ( i ) uv act u act v + u act u , city 2 u ∈ U u ∈ U v ∈ U −{ u } 1 2 3 4   1 2 1 0 0 0 1 . R + . Then E = aE 1 + bE 2 is the energy i = 1 , 2, be given. Furthermore let a, b ∈ I    0 0 1 0  2 . step   function of the Hopfield network on the neurons in U that has the weights w uv =   0 0 0 1 3 .   aw (1) uv + bw (2) uv and the threshold values θ u = aθ (1) u + bθ (2) u . 0 1 0 0 4 . 3 4 Proof: Just do the computations. An element m ij of the matrix is 1 if the j -th city is visited in the i -th step and 0 otherwise. Idea: Additional conditions can be formalized separately and incorporated later. Each matrix element will be represented by a neuron. (One energy function per condition, then apply combination lemma.) Christian Borgelt Artificial Neural Networks and Deep Learning 379 Christian Borgelt Artificial Neural Networks and Deep Learning 380

  91. Hopfield Networks: Solving Optimization Problems Hopfield Networks: Solving Optimization Problems Minimization of the tour length Additional conditions that have to be satisfied: n n n � � � d j 1 j 2 · m ij 1 · m ( i mod n )+1 ,j 2 . E 1 = • Each city is visited on exactly one step of the tour: j 1 =1 j 2 =1 i =1 n � ∀ j ∈ { 1 , . . . , n } : m ij = 1 , Double summation over steps (index i ) needed: i =1 � � E 1 = d j 1 j 2 · δ ( i 1 mod n )+1 ,i 2 · m i 1 j 1 · m i 2 j 2 , that is, each column of the matrix contains exactly one 1. ( i 1 ,j 1 ) ∈{ 1 ,...,n } 2 ( i 2 ,j 2 ) ∈{ 1 ,...,n } 2 • On each step of the tour exactly one city is visited: where � n � 1 , if a = b , δ ab = ∀ i ∈ { 1 , . . . , n } : m ij = 1 , 0 , otherwise. j =1 that is, each row of the matrix contains exactly one 1. Symmetric version of the energy function: � E 1 = − 1 − d j 1 j 2 · ( δ ( i 1 mod n )+1 ,i 2 + δ i 1 , ( i 2 mod n )+1 ) · m i 1 j 1 · m i 2 j 2 These conditions are incorporated by finding additional functions to optimize. 2 ( i 1 ,j 1 ) ∈{ 1 ,...,n } 2 ( i 2 ,j 2 ) ∈{ 1 ,...,n } 2 Christian Borgelt Artificial Neural Networks and Deep Learning 381 Christian Borgelt Artificial Neural Networks and Deep Learning 382 Hopfield Networks: Solving Optimization Problems Hopfield Networks: Solving Optimization Problems Formalization of first condition as a minimization problem: Resulting energy function:   E 2 = − 1 � �       2 2 − 2 δ j 1 j 2 · m i 1 j 1 · m i 2 j 2 + − 2 m ij n n n n n � � � � �   E ∗ 2  − 1      2 = m ij = m ij − 2 m ij + 1   ( i 1 ,j 1 ) ∈{ 1 ,...,n } 2 ( i,j ) ∈{ 1 ,...,n } 2 j =1 i =1 j =1 i =1 i =1 ( i 2 ,j 2 ) ∈{ 1 ,...,n } 2       n n n n � � � �  − 2      = m i 1 j m i 2 j m ij + 1 Second additional condition is handled in a completely analogous way: j =1 i 1 =1 i 2 =1 i =1 � � E 3 = − 1 n n n n n � � � � � − 2 δ i 1 i 2 · m i 1 j 1 · m i 2 j 2 + − 2 m ij = m i 1 j m i 2 j − 2 m ij + n 2 ( i 1 ,j 1 ) ∈{ 1 ,...,n } 2 ( i,j ) ∈{ 1 ,...,n } 2 j =1 i 1 =1 i 2 =1 j =1 i =1 ( i 2 ,j 2 ) ∈{ 1 ,...,n } 2 Double summation over cities (index i ) needed: Combining the energy functions: � � � b a = c E 2 = δ j 1 j 2 · m i 1 j 1 · m i 2 j 2 − 2 m ij E = aE 1 + bE 2 + cE 3 where a > 2 ( j 1 ,j 2 ) ∈{ 1 ,...,n } 2 d j 1 j 2 max ( i 1 ,j 1 ) ∈{ 1 ,...,n } 2 ( i 2 ,j 2 ) ∈{ 1 ,...,n } 2 ( i,j ) ∈{ 1 ,...,n } 2 Christian Borgelt Artificial Neural Networks and Deep Learning 383 Christian Borgelt Artificial Neural Networks and Deep Learning 384

  92. Hopfield Networks: Solving Optimization Problems Hopfield Networks: Reasons for Failure From the resulting energy function we can read the weights Hopfield network only rarely finds a tour, let alone an optimal one. • One of the main problems is that the Hopfield network is unable to switch w ( i 1 ,j 1 )( i 2 ,j 2 ) = − ad j 1 j 2 · ( δ ( i 1 mod n )+1 ,i 2 + δ i 1 , ( i 2 mod n )+1 ) − 2 bδ j 1 j 2 − 2 cδ i 1 i 2 from a found tour to another with a lower total length. � �� � � �� � � �� � from E 1 from E 2 from E 3 • The reason is that transforming a matrix that represents a tour into another matrix that represents a different tour requires and the threshold values: that at least four neurons (matrix elements) change their activations. θ ( i,j ) = 0 a − 2 b − 2 c = − 2( b + c ) ���� ���� ���� • However, each of these changes, if carried out individually, from E 1 from E 2 from E 3 violates at least one of the constraints and thus increases the energy. • Only all four changes together can result in a smaller energy, Problem: Random initialization and update until convergence but cannot be executed together due to the asynchronous update. not always leads to a matrix that represents a tour, let alone an optimal one. • Therefore the normal activation updates can never change an already found tour into another, even if this requires only a marginal change of the tour. Christian Borgelt Artificial Neural Networks and Deep Learning 385 Christian Borgelt Artificial Neural Networks and Deep Learning 386 Hopfield Networks: Local Optima Simulated Annealing May be seen as an extension of random or gradient f • Results can be somewhat improved if instead of descent that tries to avoid getting stuck. discrete Hopfield networks (activations in {− 1 , 1 } (or { 0 , 1 } )) one uses continuous Hopfield networks (activations in [ − 1 , 1] (or [0 , 1])). Idea: transitions from higher to lower (local) However, the fundamental problem is not solved in this way. minima should be more probable than vice versa . • More generally, the reason for the difficulties that are encountered Ω [Metropolis et al. 1953; Kirkpatrik et al. 1983] if an optimization problem is to be solved with a Hopfield network is: The update procedure may get stuck in a local optimum. Principle of Simulated Annealing: • Random variants of the current solution (candidate) are created. • The problem of local optima occurs also with many other optimization methods, for example, gradient descent, hill climbing, alternating optimization etc. • Better solution (candidates) are always accepted. • Ideas to overcome this difficulty for other optimization methods • Worse solution (candidates) are accepted with a probability that depends on may be transferred to Hopfield networks. ◦ the quality difference between the new and the old solution (candidate) and • One such method, which is very popular, is simulated annealing . ◦ a temperature parameter that is decreased with time. Christian Borgelt Artificial Neural Networks and Deep Learning 387 Christian Borgelt Artificial Neural Networks and Deep Learning 388

  93. Simulated Annealing Simulated Annealing: Procedure • Motivation: 1. Choose a (random) starting point s 0 ∈ Ω (Ω is the search space). ◦ Physical minimization of the energy (more precisely: atom lattice energy) 2. Choose a point s ′ ∈ Ω “in the vicinity” of s i if a heated piece of metal is cooled slowly. (for example, by a small random variation of s i ). ◦ This process is called annealing .  3. Set s ′ if f ( s ′ ) ≥ f ( s i ), ◦ It serves the purpose to make the metal easier to work or to machine     by relieving tensions and correcting lattice malformations. with probability p = e − ∆ f s ′ kT and s i +1 =     • Alternative Motivation: s i with probability 1 − p otherwise. ◦ A ball rolls around on an (irregularly) curved surface; ∆ f = f ( s i ) − f ( s ′ ) quality difference of the solution (candidates) minimization of the potential energy of the ball. k = ∆ f max (estimation of the) range of quality values ◦ In the beginning the ball is endowed with a certain kinetic energy, T temperature parameter (is (slowly) decreased over time) which enables it to roll up some slopes of the surface. 4. Repeat steps 2 and 3 until some termination criterion is fulfilled. ◦ In the course of time, friction reduces the kinetic energy of the ball, so that it finally comes to a rest in a valley of the surface. • For (very) small T the method approaches a pure random descent. • Attention: There is no guarantee that the global optimum is found! Christian Borgelt Artificial Neural Networks and Deep Learning 389 Christian Borgelt Artificial Neural Networks and Deep Learning 390 Hopfield Networks: Simulated Annealing Hopfield Networks: Summary • Hopfield networks are restricted recurrent neural networks Applying simulated annealing to Hopfield networks is very simple: (full pairwise connections, symmetric connection weights). • All neuron activations are initialized randomly. • Synchronous update of the neurons may lead to oscillations, • The neurons of the Hopfield network are traversed repeatedly but asynchronous update is guaranteed to reach a stable state (for example, in some random order). (asynchronous updates either reduce the energy of the network or increase the number of +1 activations). • For each neuron, it is determined whether an activation change leads to a reduction of the network energy or not. • Hopfield networks can be used as associative memory , that is, as memory that is addressed by its contents, by using • An activation change that reduces the network energy is always accepted the stable states to store desired patterns. (in the normal update process, only such changes occur). • Hopfield networks can be used to solve optimization problems , • However, if an activation change increases the network energy, if the function to optimize can be reformulated as an energy function it is accepted with a certain probability (see preceding slide). (stable states are (local) minima of the energy function). • Note that in this case we have simply • Approaches like simulated annealing may be needed to prevent that the update gets stuck in a local optimum. ∆ f = ∆ E = | net u − θ u | Christian Borgelt Artificial Neural Networks and Deep Learning 391 Christian Borgelt Artificial Neural Networks and Deep Learning 392

  94. Boltzmann Machines Boltzmann Machines • Boltzmann machines are closely related to Hopfield networks. • For Boltzmann machines the product kT is often replaced by merely T , combining the temperature and Boltzmann’s constant into a single parameter. • They differ from Hopfield networks mainly in how the neurons are updated. s consists of the vector � • The state � act of the neuron activations. • They also rely on the fact that one can define an energy function that assigns a numeric value (an energy ) to each state of the network. • The energy function of a Boltzmann machine is act) = − 1 ⊤ W � • With the help of this energy function a probability distribution over the states of θ ⊤ � act + � E ( � � act act , 2 the network is defined based on the Boltzmann distribution (also known as Gibbs distribution ) of statistical mechanics, namely where W : matrix of connection weights; � θ : vector of threshold values. s ) = 1 c e − E ( � s ) kT . • Consider the energy change resulting from the change of a single neuron u : P ( � � ∆ E u = E act u =1 − E act u =0 = w uv act v − θ u � s describes the (discrete) state of the system, v ∈ U −{ u } c is a normalization constant, E is the function that yields the energy of a state � s , • Writing the energies in terms of the Boltzmann distribution yields T is the thermodynamic temperature of the system, is Boltzmann’s constant ( k ≈ 1 . 38 · 10 − 23 J/K ). ∆ E u = − kT ln( P (act u = 1) − ( − kT ln( P (act u = 0)) . k Christian Borgelt Artificial Neural Networks and Deep Learning 393 Christian Borgelt Artificial Neural Networks and Deep Learning 394 Boltzmann Machines Boltzmann Machines: Update Procedure • Rewrite as ∆ E u = ln( P (act u = 1)) − ln( P (act u = 0)) • A neuron u is chosen (randomly), its network input, from it the energy difference kT ∆ E u and finally the probability of the neuron having activation 1 is computed. = ln( P (act u = 1)) − ln(1 − P (act u = 1)) The neuron is set to activation 1 with this probability and to 0 otherwise. (since obviously P (act u = 0) + P (act u = 1) = 1). • This update is repeated many times for randomly chosen neurons. • Solving this equation for P (act u = 1) finally yields • Simulated annealing is carried out by slowly lowering the temperature T . 1 P (act u = 1) = . 1 + e − ∆ Eu • This update process is a Markov Chain Monte Carlo ( MCMC ) procedure. kT • After sufficiently many steps, the probability that the network is in a specific • That is: the probability of a neuron being active is a logistic function of the (scaled) activation state depends only on the energy of that state. energy difference between its active and inactive state. It is independent of the initial activation state the process was started with. • Since the energy difference is closely related to the network input, namely as • This final situation is also referred to as thermal equilibrium . � ∆ E u = w uv act v − θ u = net u − θ u , • Therefore: Boltzmann machines are representations of and sampling mechanisms v ∈ U −{ u } for the Boltzmann distributions defined by their weights and threshold values. this formula suggests a stochastic update procedure for the network. Christian Borgelt Artificial Neural Networks and Deep Learning 395 Christian Borgelt Artificial Neural Networks and Deep Learning 396

  95. Boltzmann Machines: Training Boltzmann Machines: Training • Idea of Training: • Objective of Training: Develop a training procedure with which the probability distribution represented Adapt the connection weights and threshold values in such a way that the true by a Boltzmann machine via its energy function can be adapted to a given sample distribution underlying a given data sample is approximated well by the probability of data points, in order to obtain a probabilistic model of the data . distribution represented by the Boltzmann machine on its visible neurons. • This objective can only be achieved sufficiently well • Natural Approach to Training: if the data points are actually a sample from a Boltzmann distribution. ◦ Choose a measure for the difference between two probability distributions. (Otherwise the model cannot, in principle, be made to fit the sample data well.) ◦ Carry out a gradient descent in order to minimize this difference measure. • In order to mitigate this restriction to Boltzmann distributions, • Well-known measure: Kullback–Leibler information divergence . a deviation from the structure of Hopfield networks is introduced. For two probability distributions p 1 and p 2 defined over the same sample space Ω: • The neurons are divided into � p 1 ( ω ) ln p 1 ( ω ) KL ( p 1 , p 2 ) = p 2 ( ω ) . ◦ visible neurons , which receive the data points as input, and ω ∈ Ω ◦ hidden neurons , which are not fixed by the data points. Applied to Boltzmann machines: p 1 refers to the data sample, p 2 to the visible neurons of the Boltzmann machine. (Reminder: Hopfield networks have only visible neurons.) Christian Borgelt Artificial Neural Networks and Deep Learning 397 Christian Borgelt Artificial Neural Networks and Deep Learning 398 Boltzmann Machines: Training Boltzmann Machines: Training • In each training step the Boltzmann machine is ran twice (two “phases”). • Intuitive explanation of the update rule: ◦ If a neuron is more often active when a data sample is presented • “Positive Phase” : Visible neurons are fixed to a randomly chosen data point; than when the network is allowed to run freely, only the hidden neurons are updated until thermal equilibrium is reached. the probability of the neuron being active is too low, • “Negative Phase” : All units are updated until thermal equilibrium is reached. so the threshold should be reduced. ◦ If neurons are more often active together when a data sample is presented • In the two phases statistics about individual neurons and pairs of neurons than when the network is allowed to run freely, (both visible and hidden) being activated (simultaneously) are collected. the connection weight between them should be increased, • Then update is performed according to the following two equations: so that they become more likely to be active together. ∆ w uv = 1 ∆ θ u = − 1 η ( p + uv − p − η ( p + u − p − u ) . • This training method is very similar to the Hebbian learning rule . uv ) and Derived from a biological analogy it says: p u probability that neuron u is active, p uv probability that neurons u and v are both active simultaneously, connections between two neurons that are synchronously active are strengthened + − as upper indices indicate the phase referred to. (“cells that fire together, wire together”). (All probabilities are estimated from observed relative frequencies.) Christian Borgelt Artificial Neural Networks and Deep Learning 399 Christian Borgelt Artificial Neural Networks and Deep Learning 400

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend