Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu - - PDF document

maths knowledge overview for part 1 comp24111
SMART_READER_LITE
LIVE PREVIEW

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu - - PDF document

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA 1. Linear Algebra Basics 1.1 Basic Concepts and Notations A matrix


slide-1
SLIDE 1

Maths Knowledge Overview - for Part 1, COMP24111

Tingting Mu

tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA

  • 1. Linear Algebra Basics

1.1 Basic Concepts and Notations A matrix is a rectangular array of numbers arranged in rows and columns. By X ∈ Rm×n, we denote a matrix X with m rows and n columns of real-valued numbers. The notation X = [xij] (or X = [xi,j]) indicates that the element of X at its i-th row and j-th column is denoted by xij (or xi,j): X = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ x11 x12 x13 ⋯ x1n x21 x22 x23 ⋯ x2n x31 x32 x33 ⋯ x3n ⋮ ⋮ ⋮ ⋱ ⋮ xm1 xm2 xm3 ⋯ xmn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (1) For instance, A = [ 1.2 4 −0.4 3 1 ] (2) is a 2 × 3 matrix containing two rows and three columns. Given a matrix X, the notation X∶,i is usually used to denote its i-th column. Its i-th row can be denoted by Xi,∶. Its element at the i-th row and j-th column, which is referred to as the ij-th element, can be denoted by Xij. A row vector is a matrix with one row. By x = [x1,x2,...,xn], we denote a row vector

  • f dimension n. For instance, the 2nd row of the matrix A in Eq. (2) is

A2,∶ = [3, 0, 1]. A column vector is a matrix with one column. By x = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ x1 x2 ⋮ xn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , we denote a column vector

  • f dimension n. For instance, the 3rd column of the matrix A in Eq. (2) is

A∶,3 = [ −0.4 1 ]. 1

slide-2
SLIDE 2

The i-th element of a vector x, which can be either a row or column vector, is denoted by xi. A matrix with the same number of rows and columns is called a square matrix. A square matrix with ones on the diagonal and zeros everywhere else is called the identity matrix, typically denoted by I: I = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 ⋯ 1 ⋯ 1 ⋯ ⋮ ⋮ ⋮ ⋱ ⋮ ⋯ 1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (3) An identity matrix of size n is denoted by In ∈ Rn×n. A matrix with all the non-diagonal ele- ments equal to 0 is called a diagonal matrix, typically denoted by D = diag([d1,d2,...,dn]): D = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ d1 ⋯ d2 ⋯ d3 ⋯ ⋮ ⋮ ⋮ ⋱ ⋮ ⋯ dn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (4) Clearly, I = diag([1,1,...,1]). A diagonal matrix formed from the n-dimensional vector x is diag(x), written as diag(x) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ x1 ⋯ x2 ⋯ x3 ⋯ ⋮ ⋮ ⋮ ⋱ ⋮ ⋯ xn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (5) 1.2 Matrix Operations A summary of some frequently used matrix operations is provided below.

  • The transpose of a matrix X, denoted by XT , is formed by “flipping” the rows and

columns: (XT )ij = Xji. For instance, [ 1 −7 −2 4 1 0 ]

T

= ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 −2 4 1 −7 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (6) It has the property of (XT )

T = X.

  • The sum operation is applied to two matrices of the same size. Given two m × n

matrices X and Y, their sum is calculated entrywise such that (X + Y)ij = Xij + Yij. For instance,

[ 1 −7 −2 4 1 ]+[ 0 1 1 2 1 1 ] = [ 1 + 0 0 + 0 0 + 0 −7 + 1 −2 + 1 4 + 2 1 + 1 0 + 1 ] = [ 1 −6 −1 6 2 1 ]. (7)

It has the property of (X + Y)T = XT + YT . 2

slide-3
SLIDE 3
  • The product of a number (also called a scalar) and a matrix is referred to as scalar

multiplication. Given a scalar c and a matrix X, their scalar multiplication is computed by multiplying every entry of X by c such that (cX)ij = c(X)ij. For instance, 2[ 1 −7 −2 4 1 0 ] = [ 2 × 1 2 × 0 2 × 0 2 × (−7) 2 × (−2) 2 × 4 2 × 1 2 × 0 ] = [ 2 −14 −4 8 2 ]. (8) It has the property of (cX)T = cXT .

  • The multiplication operation is defined over two matrices where the number of

columns of the left matrix has to be the same as the number of rows of the right

  • matrix. Given an m × n matrix X and an n × p matrix Y, their multiplication is

denoted by XY, where (XY)ij =

n

k=1

XikYkj. (9) An illustration example of calculating the multiplication of a 4 × 2 matrix A = [ai,j] and a 2 × 3 matrix B = [bi,j] is shown in Figure 1. a1,1b

1,2 + a1,2b2,2

a3,1b

1,3 + a3,2b2,3

Figure 1: An illustration of calculating matrix multiplication. The figure is adapted from the Wikipedia page on matrix multiplication. Given matrices A ∈ Rm×n, B ∈ Rn×p, C ∈ Rn×p and D ∈ Rp×q, some properties of the matrix multiplication are shown in the following: A(B + C) = AB + AC, (10) (B + C)D = BD + CD, (11) (AB)D = A(BD), (12) (AB)T = BT AT . (13) 3

slide-4
SLIDE 4
  • The trace operation is defined for a square matrix X ∈ Rn×n, denoted by tr(X). It is

the sum of all the diagonal elements in the matrix, given by tr(X) =

n

i=1

Xii. (14) Given two square matrices X and Y of size n, and two matrices A ∈ Rm×n and B ∈ Rn×m some properties of the trace are shown in the following: tr(X) = tr(XT ), (15) tr(X + Y) = tr(X) + tr(Y), (16) tr(AB) = tr(BA). (17)

  • The inverse of a square matrix X of size n is denoted by X−1, which is the unique

matrix such that XX−1 = X−1X = I. (18) Non-square matrices do not have inverses by definition. For some square matrices, their inverse may not exist. We say that X is invertible or (non-singular) if X−1 exists, and non-invertible (or singular) otherwise. Given two invertible square matrices X and Y of the same size, some properties of the inverse are shown in the following: (X−1)

−1

= X, (19) (X−1)

T

= (XT )

−1 ,

(20) (XY)−1 = Y−1X−1. (21)

  • Given two n-dimensional column vectors x and y, the quantity xT y is called the inner

product (or dot product) of the two vectors, which is a real number computed by xT y = [x1,x2,...,xn] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ y1 y2 ⋮ yn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ =

n

i=1

xiyi. (22)

  • A norm of a vector x is informally a measure of the “length” of the vector, and is

usually denoted by ∥x∥. Assuming x is an n-dimensional column vector, the commonly used Euclidean norm (or called l2-norm) is given by ∥x∥2 =

  • n

i=1

x2

i =

√ xT x. (23) Another example of the norm is the l1-norm, given by ∥x∥1 =

n

i=1

∣xi∣. (24) 4

slide-5
SLIDE 5
  • A norm can also be defined for a matrix. For example, the Frobenius norm of an

m × n matrix X is given by ∥X∥F =

  • m

i=1 n

j=1

X2

ij =

√ tr(XT X) = √ tr(XXT ). (25) 1.3 Symmetric Matrices Given a square matrix X ∈ Rn×n, it is symmetric if X = XT . For instance, the following 4 × 4 matrix is symmetric: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 −7 4 3 3 2 1 −7 1 −1.6 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (26) Given an arbitrary square matrix X ∈ Rn×n, the matrix X + XT is symmetric.

  • 2. Calculus Basics

2.1 Derivative and Differentiation Rules Given a function of a real variable f(x) ∶ R → R, its derivative f′(x) (or d

f dx in Leibniz’s

notation) measures the rate at which the function value changes with respect to the change

  • f the input variable x, where

f′(x) = d f dx = lim

∆x→0

f(x + ∆x) − f(x) ∆x . (27) This gives the trivial case that the derivative of a constant function is zero. The tangent line to the graph of a function f(x) at a chosen input value is the straight line that ”just touches” the function curve at that point. The slope of the tangent line is equal to the derivative of the function at the chosen value (see Figure 2 for example). The process of finding a derivative is called differentiation. Here is a summary of rules for computing the derivative of a function in calculus, referred to as differentiation rules.

  • Linearity: For any functions f(x) and g(x) and any real numbers a and b, the deriva-

tive of the function h(x) = af(x) + bg(x) with respect to x is h′(x) = af′(x) + bg′(x). (28) Its special cases include the constant factor rule (af)′ = af′, the sum rule (f + g)′ = f′ + g′, and the subtraction rule (f − g)′ = f′ − g′.

  • Product rule: For any functions f(x) and g(x), the derivative of the function h(x) =

f(x)g(x) with respect to x is h′(x) = f′(x)g(x) + f(x)g′(x). (29) 5

slide-6
SLIDE 6

graph of the function tangent line slope ʹ

f (x1) x1 x f (x)

Figure 2: Geometric illustration of the derivative of a single-variable function.

  • Chain rule: For any functions f(x) and g(x), the derivative of the function h(x) =

f(g(x)) with respect to x is h′(x) = f′(g(x))g′(x). (30)

  • Inverse function rule: If the function f(x) has an inverse function g(x), which means

that g(f(x)) = x and f(g(y)) = y, the derivative of g(x) with respect to x is g′(x) = 1 f′(g(x)). (31)

  • Quotient rule: For any function g(x) ≠ 0 and for any function f(x), the derivative of

the function h(x) = f(x)

g(x) with respect to x is

h′(x) = f′(x)g(x) − f(x)g′(x) (g(x))2 . (32) Its special case is the reciprocal rule, where the derivative of the function g(x) =

1 f(x)

with respect to x is g′(x) = − f′(x)

(f(x))2 .

Utilising the differentiation rules, most derivative computations can eventually be based

  • n the computation of derivatives of some common functions. Table 1 provides an

incomplete list showing some frequently used single-variable functions and their derivatives. 2.2 Partial Derivative and Gradient Given a function of multiple real variables f(x1,x2,...,xn), its partial derivative f′

xi (or

denoted by ∂f

∂xi ), where i = 1,2,...,n, is its derivative with respect to one of those variables,

6

slide-7
SLIDE 7

Functions Derivatives Functions Derivatives xr rxr−1 ex ex ln(x)

1 x

ax ax ln(a) sin(x) cos(x) cos(x) −sin(x)

Table 1: Some frequently used single-variable functions and their derivatives. with the others held constant. For instance, given a function f(x,y,z) = x2 +3xy +z +1, we have ∂f

∂x = 2x + 3y, ∂f ∂y = 3x and ∂f ∂z = 1.

The gradient is a multi-variable generalisation of the derivative, which is defined on a function of multiple variables f(x1,x2,...,xn). The multi-variable function can be viewed as a function f(x) ∶ Rn → R taking the vector x = [x1,x2,...,xn] as the input. Its gradient is denoted by ▽xf and is defined from the partial derivatives: ▽xf = [ ∂f ∂x1 , ∂f ∂x2 ,..., ∂f ∂xn ]. (33) It can be seen that a derivative is a scalar-valued function, while a gradient is a vector- valued function. For instance, the gradient of the function f(x,y,z) = x2 + 3xy + z + 1 is [2x + 3y,3x,1]. If a function f(X) ∶ Rm×n → R takes an m × n matrix X = [xij] as the input. The gradient of f with respect to the matrix X is defined as the matrix of partial derivatives, given as ▽Xf = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

∂f ∂x11 ∂f ∂x12

∂f ∂x1n ∂f ∂x21 ∂f ∂x22

∂f ∂x2n ∂f ∂x31 ∂f ∂x32

∂f ∂x3n

⋮ ⋮ ⋱ ⋮

∂f ∂xm1 ∂f ∂xm2

∂f ∂xmn

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (34)

  • 3. Linear and Quadratic Functions

Let w ∈ Rn denote a known n-dimensional vector. For an input column vector x ∈ Rn, the following function f(x) =

n

i=1

wixi = wT x (35) is a linear function of x. The partial derivative of this function is ∂f(x) ∂xi = ∂ ∂xi (

n

i=1

wixi) = wi, for i = 1,2,...n. (36) The gradient of f(x) with respect to the input column vector x is ▽xf(x) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

∂f ∂x1 ∂f ∂x2

∂f ∂xn

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w1 w2 ⋮ wn ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = w. (37) 7

slide-8
SLIDE 8

Note that the function f(x) can also be written as f(x) = xT w, and its gradient with respect to x is w. Let A = [aij] denote an n × n square matrix. For an input column vector x ∈ Rn, the following function f(x) =

n

i=1 n

j=1

aijxixj = xT Ax (38) is a quadratic function of x. To compute the partial derivative of this function with respect to an element xk in the input vector (k = 1,2,...n), we consider separately the terms that contain xk and x2

k, also the terms that do not contain xk. This gives

∂f(x) ∂xk = ∂ ∂xk ⎛ ⎝

n

i=1 n

j=1

aijxixj ⎞ ⎠ = ∂ ∂xk ⎛ ⎝akkx2

k + ∑ i≠k

aikxixk + ∑

j≠k

akjxkxj + ∑

i≠k

j≠k

aijxixj ⎞ ⎠ = 2akkxk + ∑

i≠k

aikxi + ∑

j≠k

akjxj =

n

i=1

aikxi +

n

j=1

akjxj (39) = AT

∶,kx + Ak,∶x.

(40) The gradient of f(x) with respect to x is ▽xf(x) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

∂f ∂x1 ∂f ∂x2

∂f ∂xn

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ AT

∶,1x + A1,∶x

AT

∶,2x + A2,∶x

⋮ AT

∶,nx + An,∶x

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ AT

∶,1x

AT

∶,2x

⋮ AT

∶,nx

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ A1,∶x A2,∶x ⋮ An,∶x ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = AT x + Ax. (41) A special case of the quadratic function is f(x) = xT x, where A is an identity matrix. Its gradient with respect to x is therefore ▽xf(x) = IT x + Ix = 2x.

  • 4. General From of Optimisation

A mathematical optimization problem has the following general form: min O(x1,x2,...,xn) (42) subject to f1(x1,x2,...,xn) ≤ 0, (43) f2(x1,x2,...,xn) ≤ 0, (44) ⋮ fm(x1,x2,...,xn) ≤ 0. (45) The real-valued function O(x1,x2,...,xn) that takes n real-valued variables as the input is called the optimisation objective function. The different real-valued functions fi(x1,x2,...,xn) ≤ 0 (i = 1,2,...m) are called the constrained functions. They restrict the sets from which the 8

slide-9
SLIDE 9

input variables are allowed to choose their values. Storing the n input variables in a vector such as x = [x1,x2,...,xn], the above problem can be written as min O(x) (46) subject to fi(x) ≤ 0,i = 1,2,...m. (47) The above notation can also be simplified as min

fi(x)≤0,i=1,2,...mO(x).

If all the input variables of the objective function O(x1,x2,...,xn) are allowed to be chosen from the set of all real numbers such that xi ∈ R for i = 1,2,...,n (or equivalently, x ∈ Rn for O(x)), an unconstrained optimisation problem is to be solved, simply written as minO(x1,x2,...,xn),

  • r

minO(x). We look at the example of finding the minimum of the function (x + 1)2 sin(y), where the input x is allowed to be chosen from the set of real numbers between 0 and 3 (expressed as x ∈ [0,3] or 0 ≤ x ≤ 3), while the input y is allowed to be chosen from the set of real numbers between 0 and 5 (expressed as y ∈ [0,5] or 0 ≤ y ≤ 5). The objective function is O(x,y) = (x + 1)2 sin(y). The input x and y are restricted to the two sets [0,3] and [0,5], which can be converted to four constraint functions: −x ≤ 0, x − 5 ≤ 0, −y ≤ 0, y − 3 ≤ 0. Following the general form of representing an optimisation problem, the above example can be expressed as min

−x≤0 x−5≤0 −y≤0 y−3≤0

(x + 1)2 sin(y). If we maximise the function (x + 1)2 sin(y), it can be expressed as min

−x≤0 x−5≤0 −y≤0 y−3≤0

−(x + 1)2 sin(y). 9