Mathematics for Machine Learning
2019 CS420 Machine Learning, Lecture 1A Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net
http://wnzhang.net/teaching/cs420/index.html
(Home Reading Materials)
Mathematics for Machine Learning Weinan Zhang Shanghai Jiao Tong - - PowerPoint PPT Presentation
2019 CS420 Machine Learning, Lecture 1A (Home Reading Materials) Mathematics for Machine Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Areas of Mathematics Essential to
2019 CS420 Machine Learning, Lecture 1A Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net
http://wnzhang.net/teaching/cs420/index.html
(Home Reading Materials)
computer science
transformations on data
to each value in domain of x
experiment with three components:
|Ω| is finite
|Ω| is infinite
between 0 and 1 (the probability of O or E)
where ,
TTH, TTT
flips are heads
flips are tails
= p( TTH ) = p( TTT ) = 1/8
p( HHT ) + p( HTH ) + p( THH ) = 3/8
some single value in range 2 feet to 8 feet
than 5.5 feet tall
described by a continuous function, p( O )
O p(O)
example: sum of two fair dice example: waiting time between eruptions of Old Faithful (minutes)
number x with each outcome O of a process
probability space
American men
single man
whether in parallel or in sequence)
combination of outcomes
variables
represented by random variables X and Y. Probability that process X has outcome x and process Y has
joint probability: p( X = minivan, Y = European ) = 0.1481
distribution
another variable takes a certain value
Marginal probability: p( X = minivan ) = 0.0741 + 0.1111 + 0.1481 = 0.3333
Conditional probability: p( Y = European | X = minivan ) = 0.1481 / ( 0.0741 + 0.1111 + 0.1481 ) = 0.4433
two dimensions
probability
areas represent relative probabilities
areas represent relative probabilities
areas represent relative probabilities
areas represent relative probabilities
deck, if card from each draw is replaced
deck, if card from each draw is not replaced
variable when conditional probabilities for another variable are known.
The weatherman is forecasting rain for tomorrow. When it actually rains, the weatherman has forecast rain 90% of the time. When it doesn't rain, he has forecast rain 10% of the time. What is the probability it will rain tomorrow?
rain 10% of the time.]
tomorrow, given a forecast for rain by the weatherman. The answer can be determined from Bayes rule:
weatherman predicts rain, it only rains only about 11% of the time, which is much higher than average.
that X takes on the takes on the various values of
defined on X
“average” value of :
, taking into account probability of the various
from mean , taking into account probability of the various
https://www.zhihu.com/question/20099757
means in same (or opposite) directions at same time
by the standard deviations of the two variables
Linear dependence with noise Linear dependence without noise Various nonlinear dependencies
corresponding observations x1, . . . , xn.
parameters of the model to the data.
unknown parameter θ.
variable, and we specify a prior distribution p(θ).
distribution
unknown parameter θ. Assume that θ satisfies normal distribution.
prior p(θ)?
the overfitting.
HHTTHHHHHT
estimates θ=0.558
heads and 30 tails.
heads and 300 tails.
expected value of the estimate is the same as the true value of the parameters.
variance σ2
variance σ2
unbiased?
,
learning is a 2D array, where
as a vector of attributes, and whole array as a matrix
means “transpose”
magnitude and direction
that satisfies:
is element in second row, fourth column of A
matrix, where one of matrix dimensions is 1.
)
matrix
scalar:
match the row dimension of the following matrix
elements are pairwise orthogonal
has
x y θ
, then
x y θ x+y
is orthogonal if
written as
matrices preserves geometric structure
is tall (m>n) and has
called operator norm.
that satisfies:
is:
, we say that a nonzero vector is an eigenvector of A corresponding to eigenvalue λ if
is
its diagonal.
diagonalizable.
has an SVD as follows:
are orthogonal matrices
values of A on its diagonal.
matrix, Σ is matrix.
matrix, Σ is matrix.
matrix, Σ is matrix.
matrix, Σ is matrix.
roots of the nonzero eigenvalues of ATA.
absolute values of the eigenvalues of A.
k=10 k=20 k=50
(390*390)
if for all
all nonzero
than positive semi-definiteness.
if A is PSD, if A is PD
eigenvalues are nonnegative.
matrix is equivalent to its singular value decomposition.
symmetric PSD matrix B such that
is
is
constraints:
algorithms
global minima local minima
is convex if for any ,
, for given radius r
, given A, b
, where inequality ≤ is interpreted component-wise.
convex
and C is convex, then is convex
is convex if for ,
is convex
strict convexity convexity
is convex if Q is positive semidefinite (PSD)
is convex for any norm
and only if for all
global minimizer.
global minimizer of f(x) if and only if
diverge.
slow.
. Start with and multiply until
, we need O(1/𝜗) iterations.
same order convergence rate.
with backtracking line search satisfies
, we need O(log(1/𝜗)) iterations.
guaranteed quadratic convergence after some number of steps k.
usually very expensive.
lower bound on f* for any u ≥ 0 and v.
(even when primal problem is not convex)
which is called strong duality.
and there exists at least one strictly feasible x, i.e, then strong duality holds
590 - Introduction to Machine Learning
ture_slides/02_math_essentials.pdf
book.pdf
fall09/lectures/optimization/slides.pdf