Machine Learning - MT 2017
- 2. Mathematical Basics
Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase - - PowerPoint PPT Presentation
Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017 About this lecture No Machine Learning without rigorous mathematics This should be the most boring lecture Serves as reference
◮ No Machine Learning without rigorous mathematics ◮ This should be the most boring lecture ◮ Serves as reference for notation used throughout the course ◮ If there are any holes make sure to fill them sooner than later ◮ Attempt Problem Sheet 0 to see where you are standing
1
◮ Linear algebra ◮ Calculus ◮ Probability theory
2
◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = (x1, . . . , xD) ∈ RD of dimension D ◮ Matrix: two-dimensional array A ∈ Rm×n written as
3
◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = (x1, . . . , xD) ∈ RD of dimension D ◮ Matrix: two-dimensional array A ∈ Rm×n written as
◮ vector x is a RD×1 matrix ◮ Ai,j denotes ai,j ◮ Ai,: denotes i-th row ◮ A:,i denotes i-th column ◮ AT is the transpose of A such that (AT)i,j = Aj,i ◮ symmetric if A = AT ◮ A ∈ Rn×n is diagonal if Ai,j = 0 for all i = j ◮ In is the n × n diagonal matrix s.t. (In)i,i = 1
3
◮ Addition: C = A + B s.t. Ci,j = Ai,j + Bi,j with A, B, C ∈ Rm×n ◮ associative: A + (B + C) = (A + B) + C ◮ commutative: A + B = B + A
4
◮ Addition: C = A + B s.t. Ci,j = Ai,j + Bi,j with A, B, C ∈ Rm×n ◮ associative: A + (B + C) = (A + B) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. Bi,j = r · Ai,j
4
◮ Addition: C = A + B s.t. Ci,j = Ai,j + Bi,j with A, B, C ∈ Rm×n ◮ associative: A + (B + C) = (A + B) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. Bi,j = r · Ai,j ◮ Multiplication: C = A · B s.t.
◮ associative: A · (B · C) = (A · B) · C ◮ not commutative in general: A · B = B · A ◮ distributive wrt. addition: A · (B + C) = A · B + A · C ◮ (A · B)T = BT · AT ◮ v and w are orthogonal if vT · w = 0
4
◮ v ∈ Rn is an eigenvector of A ∈ Rn×n with eigenvalue λ ∈ R if
◮ A is positive (negative) definite if all eigenvalues are strictly greater
5
◮ v ∈ Rn is an eigenvector of A ∈ Rn×n with eigenvalue λ ∈ R if
◮ A is positive (negative) definite if all eigenvalues are strictly greater
◮ Determinant of A ∈ Rn×n with eigenvectors λ1, . . . , λn is
5
◮ v ∈ Rn is an eigenvector of A ∈ Rn×n with eigenvalue λ ∈ R if
◮ A is positive (negative) definite if all eigenvalues are strictly greater
◮ Determinant of A ∈ Rn×n with eigenvectors λ1, . . . , λn is
◮ v(1), . . . , v(n) ∈ RD are linearly independent if there are no
5
◮ v ∈ Rn is an eigenvector of A ∈ Rn×n with eigenvalue λ ∈ R if
◮ A is positive (negative) definite if all eigenvalues are strictly greater
◮ Determinant of A ∈ Rn×n with eigenvectors λ1, . . . , λn is
◮ v(1), . . . , v(n) ∈ RD are linearly independent if there are no
◮ A ∈ Rn×n invertible if there is A−1 ∈ Rn×n s.t.
5
◮ v ∈ Rn is an eigenvector of A ∈ Rn×n with eigenvalue λ ∈ R if
◮ A is positive (negative) definite if all eigenvalues are strictly greater
◮ Determinant of A ∈ Rn×n with eigenvectors λ1, . . . , λn is
◮ v(1), . . . , v(n) ∈ RD are linearly independent if there are no
◮ A ∈ Rn×n invertible if there is A−1 ∈ Rn×n s.t.
◮ Note that: ◮ A is invertible if rows of A are linearly independent ◮ equivalently if det(A) = 0 ◮ If A invertible then A · x = b has solution x = A−1 · b
5
◮ The Lp norm of v = (v1, . . . , vD) ∈ RD is given by
1≤i≤D
1/p
6
◮ The Lp norm of v = (v1, . . . , vD) ∈ RD is given by
1≤i≤D
1/p ◮ Properties of Lp (which actually hold for any norm): ◮ vp = 0 implies v = 0 ◮ v + wp ≤ vp + wp ◮ r · vp = |r| · vp for all r ∈ R
6
◮ The Lp norm of v = (v1, . . . , vD) ∈ RD is given by
1≤i≤D
1/p ◮ Properties of Lp (which actually hold for any norm): ◮ vp = 0 implies v = 0 ◮ v + wp ≤ vp + wp ◮ r · vp = |r| · vp for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L1 ◮ Eucledian norm L2 ◮ Maximum norm L∞ where v∞ = max1≤i≤D |vi|
6
◮ The Lp norm of v = (v1, . . . , vD) ∈ RD is given by
1≤i≤D
1/p ◮ Properties of Lp (which actually hold for any norm): ◮ vp = 0 implies v = 0 ◮ v + wp ≤ vp + wp ◮ r · vp = |r| · vp for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L1 ◮ Eucledian norm L2 ◮ Maximum norm L∞ where v∞ = max1≤i≤D |vi| ◮ Vectors v, w ∈ RD are orthonormal if v and w are orthogonal and
6
◮ First derivative:
h→0
◮ f ′(x∗) = 0 means that f(x∗) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x∗ with smallest f(x∗) ◮ Second derivative test to (partially) decide nature of critical point
7
◮ First derivative:
h→0
◮ f ′(x∗) = 0 means that f(x∗) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x∗ with smallest f(x∗) ◮ Second derivative test to (partially) decide nature of critical point ◮ Differentiation rules:
◮ Chain rule: if f = h(g) then f ′ = h′(g) · g′
7
◮ Partial derivative of f(x1, . . . , xm) in direction xi at a = (a1, . . . , am):
h→0
8
◮ Partial derivative of f(x1, . . . , xm) in direction xi at a = (a1, . . . , am):
h→0
◮ Gradient (assuming f is differentiable everywhere):
◮ Critical point if ∇xf(a) = 0
8
◮ Partial derivative of f(x1, . . . , xm) in direction xi at a = (a1, . . . , am):
h→0
◮ Gradient (assuming f is differentiable everywhere):
◮ Critical point if ∇xf(a) = 0
◮ f given as f = (f1, . . . , fn) with fi : Rm → R ◮ Jacobian J of f is an n × m matrix such that
8
◮ Hessian is square matrix consisting of all second-order derivatives:
◮ Symmetric (at continuous points) ◮ If H(f)(a) positive (negative) definite then critical point a is local
◮ Second derivative test may be inconclusive
9
◮ Hessian is square matrix consisting of all second-order derivatives:
◮ Symmetric (at continuous points) ◮ If H(f)(a) positive (negative) definite then critical point a is local
◮ Second derivative test may be inconclusive
9
◮ Hessian is square matrix consisting of all second-order derivatives:
◮ Symmetric (at continuous points) ◮ If H(f)(a) positive (negative) definite then critical point a is local
◮ Second derivative test may be inconclusive
9
g · ∇yz = ∂y
10
g · ∇yz = ∂y
g =
10
◮ Consists of sample space S and a probability function p : P(S) → [0, 1]
◮ Fulfills axioms of probability: ◮ p(∅) = 0 and p(S) = 1 ◮ For mutually exclusive events A1, A2, . . .
∞
∞
11
◮ Consists of sample space S and a probability function p : P(S) → [0, 1]
◮ Fulfills axioms of probability: ◮ p(∅) = 0 and p(S) = 1 ◮ For mutually exclusive events A1, A2, . . .
∞
∞
◮ p(A) = 1 − p(A) ◮ If A ⊆ B then p(A) ≤ p(B) ◮ p(A ∪ B) = p(A) + p(B) − p(A ∩ B)
11
◮ Given events A, B with p(B) > 0, conditional probability of A given B is
◮ p(A) is prior, and p(A|B) is posterior probability of A ◮ Law of total probability: Given partition A1, . . . , An of S with p(Ai) > 0,
n
◮ Bayes’ rule:
12
◮ Function from sample space to some numeric domain (usually R) ◮ p(X = x) denotes probability of event {s ∈ S : X(s) = x} ◮ Write X ∼ p(x) to specify probability distribution of X
13
◮ Function from sample space to some numeric domain (usually R) ◮ p(X = x) denotes probability of event {s ∈ S : X(s) = x} ◮ Write X ∼ p(x) to specify probability distribution of X
◮ Discrete if there are a1, a2, . . . such that p(X = aj for some j) = 1 ◮ Probability mass function (PMF) pX given by pX(x) = p(X = x) giving
◮ Cumulative distribution function (CDF) maps x to p(X ≤ x)
13
◮ Function from sample space to some numeric domain (usually R) ◮ p(X = x) denotes probability of event {s ∈ S : X(s) = x} ◮ Write X ∼ p(x) to specify probability distribution of X
◮ Discrete if there are a1, a2, . . . such that p(X = aj for some j) = 1 ◮ Probability mass function (PMF) pX given by pX(x) = p(X = x) giving
◮ Cumulative distribution function (CDF) maps x to p(X ≤ x)
◮ Continuous if CDF is differentiable ◮ Probability density function (PDF) p(x) is derivative of CDF giving
13
◮ Natural generalisation to vectors of random variables giving joint
◮ Marginal probability distribution: Given p(X, Y ), obtain p(X) via
◮ Conditional probabilities: Assuming p(X = x) > 0,
◮ Chain rule of conditional probability:
n
14
◮ EX∼p[f(x)] = x p(x) · f(x) (for discrete r.v.’s) ◮ EX∼p[f(x)] =
◮ Linearity of expectation:
15
◮ EX∼p[f(x)] = x p(x) · f(x) (for discrete r.v.’s) ◮ EX∼p[f(x)] =
◮ Linearity of expectation:
◮ Variance captures how much values of probability distribution vary on
◮ Standard deviation is square root of variance
◮ Covariance generalises variance to two r.v.’s:
◮ Covariance matrix Σ generalises covariance to multiple r.v.’s xi:
15
◮ Bernoulli: ◮ Parameter: φ ∈ [0, 1] ◮ PMF: p(X = 1) = φ, p(X = 0) = 1 − φ; ◮ E[X] = φ; Var(X) = φ · (1 − φ)
16
◮ Bernoulli: ◮ Parameter: φ ∈ [0, 1] ◮ PMF: p(X = 1) = φ, p(X = 0) = 1 − φ; ◮ E[X] = φ; Var(X) = φ · (1 − φ) ◮ Binomial distribution: ◮ Parameters: φ ∈ [0, 1], n ∈ N \ {0} ◮ PMF: p(X = k) =
k
◮ E[X] = n · φ; Var(X) = n · φ · (1 − φ)
16
◮ Normal distribution: ◮ Parameters: µ, σ2 ◮ PDF:
17
◮ Multivariate normal distribution: ◮ Parameters: k, µ, Σ positive semi-definite ◮ PDF:
18
◮ Laplace distribution: ◮ Parameters: µ, γ2 ◮ PDF:
19
◮ Supervised Machine Learning: Linear regression
20