Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase - PowerPoint PPT Presentation

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017

About this lecture ◮ No Machine Learning without rigorous mathematics ◮ This should be the most boring lecture ◮ Serves as reference for notation used throughout the course ◮ If there are any holes make sure to fill them sooner than later ◮ Attempt Problem Sheet 0 to see where you are standing 1

Outline Today’s lecture ◮ Linear algebra ◮ Calculus ◮ Probability theory 2

Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as   · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n     A = . . . ...  . . .  . . .     · · · a m, 1 a m, 2 a m,n 3

Linear algebra We will mostly work in the real vector space: ◮ Scalar: single number r ∈ R ◮ Vector: array of numbers x = ( x 1 , . . . , x D ) ∈ R D of dimension D ◮ Matrix: two-dimensional array A ∈ R m × n written as   · · · a 1 , 1 a 1 , 2 a 1 ,n · · · a 2 , 1 a 2 , 2 a 2 ,n     A = . . . ...  . . .  . . .     · · · a m, 1 a m, 2 a m,n ◮ vector x is a R D × 1 matrix ◮ A i,j denotes a i,j ◮ A i, : denotes i -th row ◮ A : ,i denotes i -th column ◮ A T is the transpose of A such that ( A T ) i,j = A j,i ◮ symmetric if A = A T ◮ A ∈ R n × n is diagonal if A i,j = 0 for all i � = j ◮ I n is the n × n diagonal matrix s.t. ( I n ) i,i = 1 3

Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A 4

Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j 4

Operations on matrices ◮ Addition: C = A + B s.t. C i,j = A i,j + B i,j with A , B , C ∈ R m × n ◮ associative: A + ( B + C ) = ( A + B ) + C ◮ commutative: A + B = B + A ◮ Scalar multiplication: B = r · A s.t. B i,j = r · A i,j ◮ Multiplication: C = A · B s.t. � A i,k · B k,j C i,j = 1 ≤ k ≤ n with A ∈ R m × n , B ∈ R n × p , C ∈ R m × p ◮ associative: A · ( B · C ) = ( A · B ) · C ◮ not commutative in general: A · B � = B · A ◮ distributive wrt. addition: A · ( B + C ) = A · B + A · C ◮ ( A · B ) T = B T · A T ◮ v and w are orthogonal if v T · w = 0 4

Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero 5

Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n 5

Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n 5

Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n 5

Eigenvectors, eigenvalues, determinant, linear independence, inverses ◮ v ∈ R n is an eigenvector of A ∈ R n × n with eigenvalue λ ∈ R if A · v = λ · v ◮ A is positive (negative) definite if all eigenvalues are strictly greater (smaller) than zero ◮ Determinant of A ∈ R n × n with eigenvectors λ 1 , . . . , λ n is det( A ) = λ 1 · λ 2 · · · λ n ◮ v (1) , . . . , v ( n ) ∈ R D are linearly independent if there are no r 1 , . . . , r n ∈ R \ { 0 } such that r i · v ( i ) = 0 � 1 ≤ i ≤ n ◮ A ∈ R n × n invertible if there is A − 1 ∈ R n × n s.t. A · A − 1 = A − 1 · A = I n ◮ Note that: ◮ A is invertible if rows of A are linearly independent ◮ equivalently if det( A ) � = 0 ◮ If A invertible then A · x = b has solution x = A − 1 · b 5

Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D 6

Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R 6

Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | 6

Vector norms Vector norms allow us to talk about the length of vectors ◮ The L p norm of v = ( v 1 , . . . , v D ) ∈ R D is given by 1 /p    � | v i | p � v � p =  1 ≤ i ≤ D ◮ Properties of L p (which actually hold for any norm): ◮ � v � p = 0 implies v = 0 ◮ � v + w � p ≤ � v � p + � w � p ◮ � r · v � p = | r | · � v � p for all r ∈ R ◮ Popular norms: ◮ Manhattan norm L 1 ◮ Eucledian norm L 2 ◮ Maximum norm L ∞ where � v � ∞ = max 1 ≤ i ≤ D | v i | ◮ Vectors v , w ∈ R D are orthonormal if v and w are orthogonal and � v � 2 = � w � 2 = 1 6

Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point 7

Calculus Functions of one variable f : R → R ◮ First derivative: f ′ ( x ) = d f ( x + h ) − f ( x ) dxf ( x ) = lim h h → 0 ◮ f ′ ( x ∗ ) = 0 means that f ( x ∗ ) is a critical or stationary point ◮ Can be a local minimum, a local maximum, or a saddle point ◮ Global minima are local minima x ∗ with smallest f ( x ∗ ) ◮ Second derivative test to (partially) decide nature of critical point ◮ Differentiation rules: 1 d d d dxx n = n · x n − 1 dxa x = a x · ln( a ) dx log a ( x ) = x · ln( a ) ( f + g ) ′ = f ′ + g ′ ( f · g ) ′ = f ′ · g + f · g ′ ◮ Chain rule: if f = h ( g ) then f ′ = h ′ ( g ) · g ′ 7

Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 8

Calculus Functions of multiple variables f : R m → R ◮ Partial derivative of f ( x 1 , . . . , x m ) in direction x i at a = ( a 1 , . . . , a m ) : ∂ f ( a 1 , . . . , a i + h, . . . , a m ) − f ( a 1 , . . . , a i , . . . , a m ) ∂x i f ( a ) = lim h h → 0 ◮ Gradient (assuming f is differentiable everywhere): � ∂f � ∂f � � ∂x 1 , ∂f ∂x 2 , . . . , ∂f ∂x 1 ( a ) , . . . , ∂f ∇ x f = s.t. ∇ x f ( a ) = ∂x m ( a ) ∂x m ◮ Points in direction of steepest ascent ◮ Critical point if ∇ x f ( a ) = 0 8

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase - PowerPoint PPT Presentation

Machine Learning - MT 2017 2. Mathematical Basics Christoph Haase University of Oxford October 11, 2017 About this lecture No Machine Learning without rigorous mathematics This should be the most boring lecture Serves as reference

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

COMP24111: Machine Learning and Optimisation Chapter 1A: Machine Learning Basics Dr. Tingting Mu

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine Learning Basics Prof. Kuan-Ting Lai 2020/4/4 Machine Learning Francois Chollet , Deep

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Sparse and TV Kaczmarz solvers and the linearized Bregman method Dirk Lorenz, Frank Schpfer,

Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers

HelenOS annual update Jakub Jerm Introduction Who is Jakub HelenOS developer since 2001

Summary Overfitting arises when we evaluate and train on the same data. We can bound error

www.iapf.ie Def Defaul ult Fund Fund versus versus In Inve vestm stment Choice Choice Jam

Security Enhanced Linux Thanks to David Quigley History SELinux Timeline 1985: LOCK (early Type

Ethereum Transactions and Smart Contracts Prof. Tom Austin San Jos State University

INFO SESSION OCTOBER 2019 And now... TRIVIA! Where is this located? What country is this