CS 559: Machine Learning Fundamentals and Applications 2 nd Set of - - PowerPoint PPT Presentation

cs 559 machine learning fundamentals and applications 2
SMART_READER_LITE
LIVE PREVIEW

CS 559: Machine Learning Fundamentals and Applications 2 nd Set of - - PowerPoint PPT Presentation

1 CS 559: Machine Learning Fundamentals and Applications 2 nd Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Introduction to Graphical


slide-1
SLIDE 1

CS 559: Machine Learning Fundamentals and Applications 2nd Set of Notes

Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

1

slide-2
SLIDE 2

Overview

  • Introduction to Graphical Models

Introduction to Graphical Models

  • Belief Networks

Belief Networks

  • Linear Algebra Review

Linear Algebra Review

– See links on class webpage – Email me if you need additional resources

2

slide-3
SLIDE 3

Example: Disease Testing

  • Suppose you have been tested positive for

a disease; what is the probability that you actually have the disease?

  • It depends on the accuracy and sensitivity
  • f the test, and on the background (prior)

probability of the disease

3

slide-4
SLIDE 4

Example: Disease Testing (cont.)

  • Let P(Test=+ | Disease=true) = 0.95
  • Then the false negative rate, P(Test=- | Disease=true)

= 5%.

  • Let P(Test=+ | Disease=false) = 0.05, (the false

positive rate is also 5%)

  • Suppose the disease is rare: P(Disease=true) = 0.01

4

             

161 . 99 . * 05 . 01 . * 95 . 01 . * 95 . | | | |                       false Disease P false Disease Test p true Disease P true Disease Test p true Disease P true Disease Test p Test true Disease P

slide-5
SLIDE 5

Example: Disease Testing (cont.)

  • Probability of having the disease given that you

tested positive is just 16%.

– Seems too low, but ...

  • Of 100 people, we expect only 1 to have the

disease, and that person will probably test positive.

  • But we also expect about 5% of the others (about

5 people in total) to test positive by accident.

  • So of the 6 people who test positive, we only

expect 1 of them to actually have the disease; and indeed 1/6 is approximately 0.16.

5

slide-6
SLIDE 6

Monty Hall Problem

  • You're given the choice of three doors: Behind
  • ne door is a car; behind the others, goats.
  • You pick a door, say No. 1
  • The host, who knows what's behind the doors,
  • pens another door, say No. 3, which has a

goat.

  • Do you want to pick door No. 2 instead?

6

Slides by Jingrui He (CMU), 2007

slide-7
SLIDE 7

Host must reveal Goat B Host must reveal Goat A Host reveals Goat A

  • r

Host reveals Goat B

7

slide-8
SLIDE 8

Monty Hall Problem: Bayes Rule

  • : the car is behind door i, i = 1, 2, 3
  • : the host opens door j after you pick

door i

i

C

ij

H

 

1 3

i

P C 

8

 

1 2 1 ,

ij k

i j j k P H C i k i k j k             

slide-9
SLIDE 9

Monty Hall Problem: Bayes Rule cont.

  • WLOG, i=1, j=3
  • 9

    

  

13 1 1 1 13 13

P H C P C P C H P H 

 

 

13 1 1

1 1 1 2 3 6 P H C P C   

slide-10
SLIDE 10
  • Monty Hall Problem: Bayes Rule cont.

10

       

  

  

13 13 1 13 2 13 3 13 1 1 13 2 2

, , , 1 1 1 6 3 1 2 P H P H C P H C P H C P H C P C P H C P C         

 

1 13

1 6 1 1 2 3 P C H  

slide-11
SLIDE 11

Monty Hall Problem: Bayes Rule cont.

   You should switch!

11

 

1 13

1 6 1 1 2 3 P C H  

   

2 13 1 13

1 2 1 3 3 P C H P C H    

slide-12
SLIDE 12

Introduction to Graphical Models

Barber Ch. 2

12

slide-13
SLIDE 13

Graphical Models

  • GMs are graph based representations of various

factorization assumptions of distributions

– These factorizations are typically equivalent to independence statements amongst (sets of) variables in the distribution

  • Directed graphs model conditional distributions

(e.g. Belief Networks)

  • Undirected graphs represented relationships

between variables (e.g. neighboring pixels in an image)

13

slide-14
SLIDE 14

Definition

  • A graph G consists of nodes (also called

vertices) and edges (also called links) between the nodes

  • Edges may be directed (they have an arrow in a

single direction) or undirected

– Edges can also have associated weights

  • A graph with all edges directed is called a

directed graph, and one with all edges undirected is called an undirected graph

14

slide-15
SLIDE 15

More Definitions

  • A path

path A  B from node A to node B is a sequence of nodes that connects A to B

  • A cycle

cycle is a directed path that starts and returns to the same node

  • Directed Acyclic Graph (DAG)

Directed Acyclic Graph (DAG): A DAG is a graph G with directed edges (arrows on each link) between the nodes such that by following a path of nodes from one node to another along the direction of each edge no path will revisit a node

15

slide-16
SLIDE 16

More Definitions

  • The parents of x4 are pa(x4) = {x1, x2, x3}
  • The children of x4 are ch(x4) = {x5, x6}
  • Graphs can be encoded using the

edge list L={(1,8),(1,4),(2,4) …}

  • r the adjacency matrix

16

slide-17
SLIDE 17

Belief Networks

Barber Ch. 3

17

slide-18
SLIDE 18

Belief Networks (Bayesian Networks)

  • A belief network is a directed acyclic graph in which each node has

associated the conditional probability of the node given its parents

  • The joint distribution is obtained by taking the product of the

conditional probabilities:

18

slide-19
SLIDE 19

Alarm Example

  • Sally's burglar Alarm is sounding. Has she

been Burgled, or was the alarm triggered by an Earthquake? She turns the car Radio on for news of earthquakes.

  • Choosing an ordering

– Without loss of generality, we can write

p(A,R,E,B) = p(A|R,E,B)p(R,E,B) = p(A|R,E,B)p(R|E,B)p(E,B) = p(A|R,E,B)p(R|E,B)p(E|B)p(B)

19

slide-20
SLIDE 20

Alarm Example

  • Assumptions:

– The alarm is not directly influenced by any report

  • n the radio,

p(A|R,E,B) = p(A|E,B)

  • The radio broadcast is not directly influenced

by the burglar variable, p(R|E,B) = p(R|E)

  • Burglaries don't directly `cause' earthquakes,

p(E|B) = p(E)

  • Therefore

p(A,R,E,B) = p(A|E,B)p(R|E)p(E)p(B)

20

slide-21
SLIDE 21

Alarm Example

The remaining data are p(B = 1) = 0.01 and p(E = 1) = 0.000001

21

slide-22
SLIDE 22

Alarm Example: Inference

  • Initial evidence: the alarm is sounding

22

slide-23
SLIDE 23

Alarm Example: Inference

  • Additional evidence: the radio broadcasts an

earthquake warning

– A similar calculation gives p(B = 1 | A = 1, R = 1) ≈ 0,01 – Initially, because the alarm sounds, Sally thinks that she's been burgled. However, this probability drops dramatically when she hears that there has been an earthquake. – The earthquake `explains away' to an extent the fact that the alarm is ringing

23

slide-24
SLIDE 24

Wet Grass Example

  • One morning Tracey leaves her house and realizes that her grass is
  • wet. Is it due to overnight rain or did she forget to turn off the

sprinkler last night? Next she notices that the grass of her neighbor, Jack, is also wet. This explains away to some extent the possibility that her sprinkler was left on, and she concludes therefore that it has probably been raining.

  • Define:

R ∈ {0, 1} R = 1 means that it has been raining, and 0 otherwise S ∈ {0, 1} S = 1 means that Tracey has forgotten to turn off the sprinkler, and 0 otherwise J ∈ {0, 1} J = 1 means that Jack's grass is wet, and 0 otherwise T ∈ {0, 1} T = 1 means that Tracey's Grass is wet, and 0 otherwise

24

slide-25
SLIDE 25

Wet Grass Example

  • The number of values that need to be

specified in general scales exponentially with the number of variables in the model

– This is impractical in general and motivates simplifications

  • Conditional independence:

p(T|J,R,S) = p(T|R,S) p(J|R,S) = p(J|R) p(R|S) = p(R)

25

slide-26
SLIDE 26

Wet Grass Example

  • Original equation

p(T,J,R,S) = p(T|J,R,S)p(J,R,S) = p(T|J,R,S)p(J|R,S)p(R,S) = p(T|J,R,S)p(J|R,S)p(R|S)p(S)

  • Becomes

p(T,J,R,S) = p(T|R,S)p(J|R)p(R)p(S)

26

slide-27
SLIDE 27

Wet Grass Example

  • p(R = 1) = 0.2 and p(S = 1) = 0.1
  • p(J = 1|R = 1) = 1, p(J = 1|R = 0) = 0.2 (sometimes

Jack's grass is wet due to unknown effects, other than rain)

  • p(T = 1|R = 1, S = 0) = 1,

p(T = 1|R = 1, S = 1) = 1, p(T = 1|R = 0, S = 1) = 0.9 (there's a small chance that even though the sprinkler was left on, it didn't wet the grass noticeably)

  • p(T = 1|R = 0, S = 0) = 0

27

slide-28
SLIDE 28

Wet Grass Example

  • Note that ΣJp(J|R)p(R)=p(R)

28

slide-29
SLIDE 29

Wet Grass Example

29

slide-30
SLIDE 30

Independence in Belief Networks

  • In (a), (b) and (c), A, B are conditionally independent given C
  • In (d) the variables A,B are conditionally dependent given C:

30

slide-31
SLIDE 31

Independence in Belief Networks

  • In (a), (b) and (c), A, B are marginally dependent
  • In (d) the variables A, B are marginally independent

31

slide-32
SLIDE 32

Intro to Linear Algebra

Slides by Olga Sorkine (ETH Zurich)

32

  • O. Sorkine, 2006
slide-33
SLIDE 33

Vector space

  • Informal definition:

– V  

(a non-empty set of vectors)

– v, w  V  v + w  V

(closed under addition)

– v  V,  is scalar  v  V (closed under multiplication by

scalar)

  • Formal definition includes axioms about associativity and

distributivity of the + and  operators.

  • 0  V always!

33

  • O. Sorkine, 2006
slide-34
SLIDE 34

Subspace - example

  • Let l be a 2D line though the origin
  • L = {p – O | p  l} is a linear subspace of R2

x y

O

34

  • O. Sorkine, 2006
slide-35
SLIDE 35

Subspace - example

  • Let  be a plane through the origin in 3D
  • V = {p – O | p  } is a linear subspace of

R3

y z x O

35

  • O. Sorkine, 2006
slide-36
SLIDE 36

Linear independence

  • The vectors {v1, v2, …, vk} are a linearly

independent set if: 1v1 + 2v2 + … + kvk = 0  i = 0  i

  • It means that none of the vectors can be
  • btained as a linear combination of the
  • thers.

36

  • O. Sorkine, 2006
slide-37
SLIDE 37

Linear independence - example

  • Parallel vectors are always dependent:
  • Orthogonal vectors are always linearly

independent

v w v = 2.4 w  v + (2.4)w = 0

37

  • O. Sorkine, 2006
slide-38
SLIDE 38

Basis of V

  • {v1, v2, …, vn} are linearly independent
  • {v1, v2, …, vn} span the whole vector space V:

V = {1v1 + 2v2 + … + nvn | I scalars}

  • Any vector in V is a unique linear combination
  • f the basis
  • The number of basis vectors is called the

dimension of V

38

  • O. Sorkine, 2006
slide-39
SLIDE 39

Basis - example

  • The standard basis of R3 – three unit orthogonal

vectors x, y, z: (sometimes called i, j, k or e1, e2, e3)

y z x

39

  • O. Sorkine, 2006
slide-40
SLIDE 40

Basis – another example

  • Grayscale NM images:

– Each pixel has value between 0 (black) and 1 (white) – The image can be interpreted as a vector  RNM

M N

40

  • O. Sorkine, 2006
slide-41
SLIDE 41

The “standard” basis (44)

41

  • O. Sorkine, 2006
slide-42
SLIDE 42

Linear combinations of the basis

*1 + *(2/3) + *(1/3) =

42

  • O. Sorkine, 2006
slide-43
SLIDE 43

Matrix representation

  • Let {v1, v2, …, vn} be a basis of V
  • Every vV has a unique representation

v = 1v1 + 2v2 + … + nvn

  • Denote v by the column-vector:
  • The basis vectors are therefore denoted:

1 n

             v                                            1 , 1 , 1    

43

  • O. Sorkine, 2006
slide-44
SLIDE 44

Linear operators

  • A : V  W

is called linear operator if:

– A(v + w) = A(v) + A(w) – A( v) =  A(v)

  • In particular, A(0) = 0
  • Linear operators we know:

– Scaling – Rotation, reflection – Translation is not linear – moves the origin

44

  • O. Sorkine, 2006
slide-45
SLIDE 45

Linear operators - illustration

  • Rotation is a linear operator:

v w v+w R(v+w) v w

45

  • O. Sorkine, 2006
slide-46
SLIDE 46

Linear operators - illustration

  • Rotation is a linear operator:

v w v+w R(v+w) v w R(w)

R(v)

R(v)+R(w) R(v+w) = R(v) + R(w)

46

  • O. Sorkine, 2006
slide-47
SLIDE 47

Matrix operations

  • Addition, subtraction, scalar multiplication –

simple…

  • Multiplication of matrix by column vector:

1 11 1 1 1 1

row , row ,

i i i n m mn n mi i m i

a b a a b a a b a b                                             

 

b b       

A b

47

  • O. Sorkine, 2006
slide-48
SLIDE 48

Matrix by vector multiplication

  • Sometimes a better way to look at it:

– Ab is a linear combination of A’s columns!

1 1 2 1 1 2 2

| | | | | | | | | | | |

n n n n

b b b b b                                                  a a a a a a   

48

  • O. Sorkine, 2006
slide-49
SLIDE 49

Matrix operations

  • Transposition: make the rows to be the

columns

  • (AB)T = BTAT

                    

mn n m T mn m n

a a a a a a a a          

1 1 11 1 1 11

49

  • O. Sorkine, 2006
slide-50
SLIDE 50

Matrix properties

  • Matrix A (nn) is non-singular if B, AB = BA = I
  • B = A1 is called the inverse of A
  • A is non-singular  det A  0
  • If A is non-singular then the equation Ax=b

has one unique solution for each b

  • A is non-singular  the rows of A are linearly

independent (and so are the columns)

50

  • O. Sorkine, 2006
slide-51
SLIDE 51

Orthogonal matrices

  • Matrix A (nn) is orthogonal if A1 = AT
  • Follows: AAT = ATA = I
  • The rows of A are orthonormal vectors!

Proof:

I = ATA =

v1 v2 vn

= vi

Tvj

= ij

v1 v2 vn

 <vi, vi> = 1  ||vi|| = 1; <vi, vj> = 0

51

  • O. Sorkine, 2006
slide-52
SLIDE 52

The Trace

  • The trace of a square matrix denoted by

tr(A) is the sum of the diagonal elements

52

slide-53
SLIDE 53

The Determinant

  • For a square matrix A, the determinant is

denoted by |A| or det(A)

53

slide-54
SLIDE 54

The Determinant

  • |A| = |AT|
  • |AB| = |A| |B|
  • |A| = 0, if and only if A is singular

– Else, |A-1| = 1/|A|

54

slide-55
SLIDE 55

The Covariance Matrix (Interlude)

55

slide-56
SLIDE 56

Covariance

  • Covariance is a numerical measure that

shows how much two random variables change together

  • Positive covariance: if one increases, the
  • ther is likely to increase
  • Negative covariance: …
  • More precisely: the covariance is a measure
  • f the linear dependence between the two

variables

56

slide-57
SLIDE 57

Covariance Example

Relationships between the returns of different stocks

57

Stock B return Stock A return * * * * * * * * * * * * * Stock D return Stock C Return * * * * * * * * * * * * Scatter plot I Scatter Plot II

slide-58
SLIDE 58

Correlation Coefficient

  • One may be tempted to conclude that if

the covariance is larger, the relationship between two variables is stronger (in the sense that they have stronger linear relationship)

  • The correlation coefficient is defined as:

58

slide-59
SLIDE 59

Correlation Coefficient

  • The correlation coefficient, unlike covariance, is

a measure of dependence that is free of scales

  • f measurement of Yij and Yik
  • By definition, correlation must take values

between −1 and 1

  • A correlation of 1 or −1 is obtained when there is

a perfect linear relationship between the two variables

59

slide-60
SLIDE 60

Covariance Matrix

  • For the vector of repeated measures, Yi = (Yi1,

Yi2, ..., Yin), we define the covariance matrix, Cov(Yi):

  • It is a symmetric, square matrix

60

slide-61
SLIDE 61

Variance and Confidence Intervals

  • Single Gaussian (normal) random variable

61

) , ( 2 1 ) (

2 2 ) (

2 2

   

 

N e x p

x

 

 

slide-62
SLIDE 62

Multivariate Normal Density

– The multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t  = (1, 2, …, d)t mean vector  = d×d covariance matrix || and -1 are the determinant and inverse respectively

62

         

) x ( ) x ( 2 1 exp ) 2 ( 1 ) x ( P

1 t 2 / 1 2 / d

    

P(x) is larger for smaller exponents!

slide-63
SLIDE 63

Confidence Intervals: Multi-Variate Case

  • Same concept: how large is the area that

contains X% of samples drawn from the distribution

  • Confidence intervals are ellipsoids for normal

distribution

63

slide-64
SLIDE 64

Confidence Intervals: Multi-Variate Case

  • Increasing X%, increases the size of the

ellipsoids, but not their orientation and aspect ratio

64

slide-65
SLIDE 65

The Multi-Variate Normal Density

  • Σ is positive semi definite (xt Σx>=0)

x>=0)

– If xt Σx =0 for non-zero x then det( x =0 for non-zero x then det(Σ)=0. )=0. This case is not interesting, p(x) is not defined p(x) is not defined

  • The feature vector is a constant (has zero

variance)

  • Two or more features are linearly dependent
  • So we will assume Σ is positive definite

(xt Σx >0) x >0)

  • If Σ is positive definite then so is Σ-1
  • 1

65

  • O. Veksler
slide-66
SLIDE 66

Confidence Intervals: Multi-Variate Case

  • Covariance matrix

determines the shape

66

slide-67
SLIDE 67

Confidence Intervals: Multi-Variate Case

  • Case I:  = 2I
  • All variables are uncorrelated and have equal

variance

  • Confidence intervals are circles

67

slide-68
SLIDE 68

Confidence Intervals: Multi-Variate Case

  • Case II:  diagonal, with unequal elements
  • All variables are uncorrelated but have different

variances

  • Confidence intervals are axis-aligned

ellipsoids

68

slide-69
SLIDE 69

Confidence Intervals: Multi-Variate Case

  • Case III:  arbitrary
  • Variables may be correlated and have different

variances

  • Confidence intervals are arbitrary

ellipsoids

69

slide-70
SLIDE 70

Eigen-interlude

Based on D. Barber’s slides

70

slide-71
SLIDE 71

Eigenvalues and Eigenvectors

  • For an n×n square matrix A, e is an

eigenvector with eigenvalue λ if Ae = λe

  • Or

(A- λI)e=0

  • If (A- λI) is invertible, the only solution is

e=0 (trivial)

71

slide-72
SLIDE 72

Eigenvalues and Eigenvectors

(A- λI)e=0

  • For non-trivial solutions:

det(A- λI)=0

  • Above equation is called the “characteristic

polynomial”

  • Solutions are not unique

– If e is an eigenvector αe is also an eigenvector

72

slide-73
SLIDE 73

Simple Example

  • For a 2×2 matrix

73

det A  I

   a11  

a12 a21 a22    a11  

  a22     a12a21  0

0  a11a22  a12a21   a11  a22

  2

A  1 2 2 4      

slide-74
SLIDE 74

The solutions are =0 and =5 The eigenvector for the first eigenvalue, =0 is: One solution for both equations is x=2, y=-1

74

0  a11a22  a12a21  a11 a22

  2

0 1 4  22 (1 4) 2 (1 4)  2

 

                                                           4 2 2 1 4 2 2 1 4 2 2 1 , y x y x y x y x x I A x Ax  

slide-75
SLIDE 75

For the other eigenvalue, =5:

75

1 2 2 4       5 5              x y       4 2 2 1        x y       4x 2y 2x 1y       0      

  • 4x + 2y = 0, and 2x  y  0, so, x 1,y  2
slide-76
SLIDE 76

Properties

  • The product of the eigenvalues = |A|
  • The sum of the eigenvalues = trace(A)
  • The eigenvectors are pairwise orthogonal

76

slide-77
SLIDE 77

Spectral Decomposition

  • A symmetric matrix has real eigenvalues
  • A real symmetric matrix can be written as:

77

slide-78
SLIDE 78

Back to the Covariance Matrix

78

slide-79
SLIDE 79

Geometric Interpretation

  • Start from N(0,I) and construct multi-

variate distribution with desired covariance matrix

79

translation rotation anisotropic scaling

slide-80
SLIDE 80

Eigenvectors of the Covariance Matrix

  • New basis aligned with ellipsoids
  • Major axis  eigenvector with max

eigenvalue

80

slide-81
SLIDE 81

2D Examples

81

  • O. Veksler
slide-82
SLIDE 82

2D Examples

82

  • O. Veksler