[PPT] - Dimensionality Reduc1on Lecture 23 David Sontag New York PowerPoint Presentation

SLIDE 1

Dimensionality Reduc1on Lecture 23

David Sontag New York University

Slides adapted from Carlos Guestrin and Luke Zettlemoyer

SLIDE 2

Dimensionality reduc9on

Input data may have thousands or millions of

dimensions!

– e.g., text data has ???, images have ???

Dimensionality reduc1on: represent data with

fewer dimensions

– easier learning – fewer parameters – visualiza9on – show high dimensional data in 2D – discover “intrinsic dimensionality” of data

high dimensional data that is truly lower dimensional
noise reduc9on

SLIDE 3

!"#$%&"'%()$*+,-"'%

.&&+#/-"'%0(*1-1(21//)'3"#1-$456(4"$&('%(

1(4'7$)(*"#$%&"'%14(&/1,$

831#/4$&0

Slide from Yi Zhang

n = 2 k = 1 n = 3 k = 2

SLIDE 4

Example (from Bishop)

Suppose we have a dataset of digits (“3”)

perturbed in various ways:

What opera9ons did I perform? What is the

data’s intrinsic dimensionality?

Here the underlying manifold is nonlinear

SLIDE 5

Lower dimensional projec9ons

Obtain new feature vector by transforming the original

features x1 … xn

New features are linear combina9ons of old ones
Reduces dimension when k<n
This is typically done in an unsupervised seZng

– just X, but no Y

z1 = w(1) + ⌥

i

w(1)

i

xi

…

⌥ zk = w(k) + ⌥

i

w(k)

i

xi

In general will not be inver9ble – cannot go from z back to x

SLIDE 6

Which projec9on is be[er?

From notes by Andrew Ng

SLIDE 7

Reminder: Vector Projec9ons

Basic defini9ons:

– A.B = |A||B|cos θ

Assume |B|=1 (unit vector)

– A.B = |A|cos θ – So, dot product is length of projec9on!

SLIDE 8

Using a new basis for the data

Project a point into a (lower dimensional) space:

– point: x = (x1,…,xn) – select a basis – set of unit (length 1) basis vectors (u1,…,uk)

we consider orthonormal basis:

– uj•uj=1, and uj•ul=0 for j≠l

– select a center – x, defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z1,…,zk), zj

i = (xi-x)•uj

SLIDE 9

Maximize variance of projec9on

1 m

m

i=1

(x(i)Tu)2 = 1 m

m

i=1

uTx(i)x(i)Tu = uT

1

m

i=1

x(i)x(i)T

u.

Let x(i) be the ith data point minus the mean. Choose unit-length u to maximize: Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board) Covariance matrix Σ

SLIDE 10

Basic PCA algorithm

Start from m by n data matrix X
Recenter: subtract mean from each row of X

– Xc ← X – X

Compute covariance matrix:

– Σ ← 1/m Xc

T Xc

Find eigen vectors and values of Σ
Principal components: k eigen vectors with

highest eigen values

[Pearson 1901, Hotelling, 1933]

SLIDE 11

PCA example

Data: Projection: Reconstruction:

SLIDE 12

Dimensionality reduc9on with PCA

23

In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. You might lose some information, but if the eigenvalues much

5 10 15 20 25 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 Variance (%)

Slide from Aarti Singh

Percentage of total variance captured by dimension zj for j=1 to 10:

var(zj) = 1 m

m

X

i=1

(zi

j)2

= 1 m

m

X

i=1

(xi · uj)2 = λj

λj Pn

l=1 λl

SLIDE 13

Eigenfaces [Turk, Pentland ’91]

Input images:

Principal components:

SLIDE 14

Eigenfaces reconstruc9on

Each image corresponds to adding together

(weighted versions of) the principal components:

SLIDE 15

Scaling up

Covariance matrix can be really big!

– Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow…

Use singular value decomposi9on (SVD)

– Finds k eigenvectors – great implementa9ons available, e.g., Matlab svd

SLIDE 16

SVD

Write X = Z S UT

– X ← data matrix, one row per datapoint – S ← singular value matrix, diagonal matrix with entries σi

Rela9onship between singular values of X and

eigenvalues of Σ given by λi = σi

2/m

– Z ← weight matrix, one row per datapoint

Z 9mes S gives coordinate of xi in eigenspace

– UT ← singular vector matrix

In our seZng, each row is eigenvector uj

SLIDE 17

PCA using SVD algorithm

Start from m by n data matrix X
Recenter: subtract mean from each row of X

– Xc ← X – X

Call SVD algorithm on Xc – ask for k singular

vectors

Principal components: k singular vectors with

highest singular values (rows of UT)

– Coefficients: project each point onto the new vectors

SLIDE 18

Non-linear methods

12

A%&,'-

/)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C?

E"&4%&,'-

!"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? Slide from Aarti Singh

SLIDE 19

Isomap

Goal: use geodesic distance between points (with respect to manifold) Es9mate manifold using

graph. Distance between

points given by distance of shortest path Embed onto 2D plane so that Euclidean distance approximates graph distance

[Tenenbaum, Silva, Langford. Science 2000]

SLIDE 20

Isomap

Table 1. The Isomap algorithm takes as input the distances dX(i,j) between all pairs i,j from N data points in the high-dimensional input space X, measured either in the standard Euclidean metric (as in Fig. 1A)

r in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors yi in a

d-dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the

data. The only free parameter ( or K) appears in Step 1.

Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by dX(i, j)] they are closer than (-Isomap), or if i is one of the K nearest neighbors of j (K-Isomap). Set edge lengths equal to dX(i,j). 2 Compute shortest paths Initialize dG(i,j) dX(i,j) if i,j are linked by an edge; dG(i,j) otherwise. Then for each value of k 1, 2, . . ., N in turn, replace all entries dG(i,j) by min{dG(i,j), dG(i,k) dG(k,j)}. The matrix of final values DG {dG(i,j)} will contain the shortest path distances between all pairs of points in G (16, 19). 3 Construct d-dimensional embedding Let p be the p-th eigenvalue (in decreasing order) of the matrix (DG) (17), and v p

i be the i-th

component of the p-th eigenvector. Then set the p-th component of the d-dimensional coordinate vector yi equal to pv p

i .

SLIDE 21

Isomap

[Tenenbaum, Silva, Langford. Science 2000]

SLIDE 22

Isomap

[Tenenbaum, Silva, Langford. Science 2000]

SLIDE 23

Isomap

Residual variance Number of dimensions Face images Swiss roll data

PCA Isomap

SLIDE 24

What you need to know

Dimensionality reduc9on

– why and when it’s important

Principal component analysis

– minimizing reconstruc9on error – rela9onship to covariance matrix and eigenvectors – using SVD

Non-linear dimensionality reduc9on

SLIDE 25

Graphical models

Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita

1

SLIDE 26

Probabilistic modeling

Given: several variables: x1, . . . xn , n is large. Task: build a joint distribution function Pr(x1, . . . xn) Goal: Answer several kind of projection queries on the distribution Basic premise

◮ Explicit joint distribution is dauntingly large ◮ Queries are simple marginals (sum or max) over the joint

distribution.

2

SLIDE 27

Examples of Joint Distributions So far

Naive Bayes: P(x1, . . . xd|y) , d is large. Assume conditional independence. Multivariate Gaussian Recurrent Neural Networks for Sequence labeling and prediction

3

SLIDE 28

Example

Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: 10 × 7 × 7 × 3 × 30 = 44100. Queries: Estimate fraction of people with

◮ Income > 200K and Degree=”Bachelors”, ◮ Income < 200K, Degree=”PhD” and experience > 10 years. ◮ Many, many more. 4

SLIDE 29

Alternatives to an explicit joint distribution

Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies

◮ Many highly correlated pairs

income ⊥ ⊥ age, income ⊥ ⊥ experience, age⊥ ⊥experience

◮ Ad hoc methods of combining these into a single estimate

Go beyond pairwise correlations: conditional independencies

◮ income ⊥

⊥ age, but income ⊥ ⊥ age | experience

◮ experience ⊥

⊥ degree, but experience ⊥ ⊥ degree | income

Graphical models make explicit an efficient joint distribution from these independencies

5

SLIDE 30

Graphical models

Model joint distribution over several variables as a product of smaller factors that is

1

Intuitive to represent and visualize

◮ Graph: represent structure of dependencies ◮ Potentials over subsets: quantify the dependencies 2

Efficient to query

◮ given values of any variable subset, reason about probability

distribution of others.

◮ many efficient exact and approximate inference algorithms

Graphical models = graph theory + probability theory.

6

SLIDE 31

Graphical models in use

Roots in statistical physics for modeling interacting atoms in gas and solids [ 1900] Early usage in genetics for modeling properties of species [ 1920] AI: expert systems ( 1970s-80s) Now many new applications:

◮ Error Correcting Codes: Turbo codes, impressive success story

(1990s)

◮ Robotics and Vision: image denoising, robot navigation. ◮ Text mining: information extraction, duplicate elimination,

hypertext classification, help systems

◮ Bio-informatics: Secondary structure prediction, Gene discovery ◮ Data mining: probabilistic classification and clustering. 7

SLIDE 32

Part I: Outline

1

Representation Directed graphical models: Bayesian networks

2

Inference Queries Exact inference on chains

3

Constructing a graphical model Graph Structure Parameters in Potentials

4

References

8

SLIDE 33

Representation

Structure of a graphical model: Graph + Potential

Graph

Nodes: variables x = x1, . . . xn

◮ Continuous: Sensor temperatures, income ◮ Discrete: Degree (one of Bachelors,

Masters, PhD), Levels of age, Labels of words

Edges: direct interaction

◮ Directed edges: Bayesian networks ◮ Undirected edges: Markov Random fields

Directed

Undirected
9

SLIDE 34

Representation

Potentials: ψc(xc)

Scores for assignment of values to subsets c of directly interacting variables. Which subsets? What do the potentials mean?

◮ Different for directed and undirected graphs

Probability

Factorizes as product of potentials Pr(x = x1, . . . xn) ∝

ψS(xS)

10

SLIDE 35

Directed graphical models: Bayesian networks

Graph G: directed acyclic

◮ Parents of a node: Pa(xi) = set of nodes in G pointing to xi

Potentials: defined at each node in terms of its parents. ψi(xi, Pa(xi)) = Pr(xi|Pa(xi) Probability distribution Pr(x1 . . . xn) =

n

i=1

Pr(xi|pa(xi))

11

SLIDE 36

Example of a directed graph

ψ1(L) = Pr(L)

NY CA London Other 0.2 0.3 0.1 0.4

ψ2(A) = Pr(A)

20–30 30–45 > 45 0.3 0.4 0.3

r, a Guassian distribution

(µ, σ) = (35, 10)

ψ2(E, A) = Pr(E|A)

0–10 10–15 > 15 20–30 0.9 0.1 30–45 0.4 0.5 0.1 > 45 0.1 0.1 0.8

ψ2(I, E, D) = Pr(I|D, A)

3 dimensional table, or a histogram approximation.

Probability distribution

Pa(x = L, D, I, A, E) = Pr(L) Pr(D) Pr(A) Pr(E|A) Pr(I|D, E)

12

SLIDE 37

Conditional Independencies

13

SLIDE 38

CIs and Fractorization

Theorem

Local CI = ⇒ Factorization

Proof.

x1, x2, . . . , xn topographically ordered (parents before children). Local CI: Pr(xi|x1, . . . , xi−1) = Pr(xi|Pa(xi)) Chain rule: Pr(x1, . . . , xn) =

i Pr(xi|x1, . . . , xi−1) = i Pr(xi|Pa(xi))

14

SLIDE 39

Popular Bayesian networks

Hidden Markov Models: speech recognition, information extraction

◮ State variables: discrete phoneme, entity tag

◮ Observation variables: continuous (speech waveform), discrete

(Word)

Kalman Filters: State variables: continuous

◮ Discussed later

Topic models for text data

1

Principled mechanism to categorize multi-labeled text documents while incorporating priors in a flexible generative framework

2

Application: news tracking

QMR (Quick Medical Reference) system PRMs: Probabilistic relational networks:

15

SLIDE 40

Part I: Outline

1

Representation Directed graphical models: Bayesian networks

2

Inference Queries Exact inference on chains

3

Constructing a graphical model Graph Structure Parameters in Potentials

4

References

16

SLIDE 41

Inference queries

1

Marginal probability queries over a small subset of variables:

◮ Find Pr(Income=’High & Degree=’PhD’) ◮ Find Pr(pixel y9 = 1)

Pr(x1) =

x2...xn

Pr(x1 . . . xn) =

m

x2=1

. . .

m

xn=1

Pr(x1 . . . xn) Brute-force requires O(mn−1) time.

2

Most likely labels of remaining variables: (MAP queries)

◮ Find most likely entity labels of all words in a sentence ◮ Find likely temperature at sensors in a room

x∗ = argmaxx1...xn Pr(x1 . . . xn)

17

SLIDE 42

Exact inference on chains

Given,

◮ Graph ◮ Potentials: ψi(yi, yi+1) ◮ Pr(y1, . . . yn) =

i ψi(yi, yi+1), Pr(y1)

Find, Pr(yi) for any i, say Pr(y5 = 1)

◮ Exact method: Pr(y5 = 1) =

y1,...y4 Pr(y1, . . . y4, 1) requires

exponential number of summations.

◮ A more efficient alternative... 18

SLIDE 43

Exact inference on chains

Pr(y5 = 1) =

y1,...y4

Pr(y1, . . . y4, 1) =

y1
y2
y3
y4

ψ1(y1, y2)ψ2(y2, y3)ψ3(y3, y4)ψ4(y4, 1) =

y1
y2

ψ1(y1, y2)

y3

ψ2(y2, y3)

y4

ψ3(y3, y4)ψ4(y4, 1) =

y1
y2

ψ1(y1, y2)

y3

ψ2(y2, y3)B3(y3) =

y1
y2

ψ1(y1, y2)B2(y2) =

y1

B1(y1) An alternative view: flow of beliefs Bi(.) from node i + 1 to node i

19

SLIDE 44

Adding evidence

Given fixed values of a subset of variables xe (evidence), find the

1

Marginal probability queries over a small subset of variables:

◮ Find Pr(Income=’High | Degree=’PhD’)

Pr(x1) =

x2...xm

Pr(x1 . . . xn|xe)

2

Most likely labels of remaining variables: (MAP queries)

◮ Find likely temperature at sensors in a room given readings

from a subset of them

x∗ = argmaxx1...xm Pr(x1 . . . xn|xe) Easy to add evidence, just change the potential.

20

SLIDE 45

Case study: HMMs for Information Extraction

21

SLIDE 46

Inference in HMMs

Given,

◮ Graph

◮ Potentials: Pr(yi|yi−1), Pr(xi|yi)

◮ Evidence variables: x = x1 . . . xn = o1 . . . on.

Find most likely values of the hidden state variables. y = y1 . . . yn argmaxy Pr(y|x = o) Define ψi(yi−1, yi) = Pr(yi|yi−1) Pr(xi = oi|yi) Reduced graph only a single chain of y nodes.

Algorithm same as earlier, just replace “Sum” with “Max”

This is the well-known Viterbi algorithm

22

SLIDE 47

The Viterbi algorithm

Let observations xt take one of k possible values, states yt take one

f m possible value.

y=1 ψ(y, y ′)Bt[y]

end for Return maxy B1[y]P(y)P(xt = ot|y) Time taken: O(nm2)

23

SLIDE 48

Numerical Example

P(y|y ′) = y’ P(y = 0|y ′) P(y = 1|y ′) 0.9 0.1 1 0.2 0.8 P(x|y) = y P(x = 0|y) P(x = 1|y) 0.7 0.3 1 0.6 0.4 P(y = 1) = 0.5 Observation [x0, x1, x2] = [0, 0, 0]

24

SLIDE 49

Contrast with RNNs for the same task

25

SLIDE 50

Part I: Outline

1

Representation Directed graphical models: Bayesian networks

2

Inference Queries Exact inference on chains

3

Constructing a graphical model Graph Structure Parameters in Potentials

4

References

26

SLIDE 51

Graph Structure

1

Manual: Designed by domain expert

◮ Used in applications where dependency structure is

well-understood

◮ Example: QMR systems, Kalman filters, Vision (Grids), HMM

for speech recognition and IE.

2

Learned from examples

◮ NP hard to find the optimal structure. ◮ Widely researched, mostly posed as a branch and bound search

problem.

◮ Useful in dynamic situations 27

SLIDE 52

Parameters in Potentials

1

Manual: Provided by domain expert

◮ Used in infrequently constructured graphs, example QMR

systems

◮ Also where potentials are an easy function of the attributes of

connected graphs, example: vision networks.

2

Learned: from examples

◮ More popular since difficult for humans to assign numeric values ◮ Many variants of parameterizing potentials. 1

Table potentials: each entry a parameter, example, HMMs

2

Potentials: combination of shared parameters and data attributes: example, CRFs.

28

SLIDE 53

Training for BN

Given sample D = {x1, . . . , xN} of data generated from a distribution P(x) represented by a graphical model with known structure G, learn potentials ψC(xC). LL(θ, D) =

N

i=1

log Pr(xi|θ) =

N

i=1

log

j

Pr(xi

j |xi Pa(j), θ)

=

i
j

log Pr(xi

j |xi Pa(j), θ)

Like normal classification task.

29

SLIDE 54

BN: Learning Table Potentials

Assume: all variables are discrete, parameters of each node is different: θ = [θ1, . . . , θn] Pr(xj|Pa(j), θj) = Table of real values denoting the probability of each value of xj corresponding to each combination of values of the parents. If each variables takes m possible values, and has k parents, then eachPr(xj|Pa(j), θj) will require mk(m − 1) free parameters in θj. Maximum Likelihood estimation of parameters: θj

vu1,...,uk

= Pr(xj = v|pa(xj) = u1, . . . , uk) (1) = N

i=1[[xi j == v, xi Pa(j) = u1, . . . , uk]]

N

i=1[[xi Pa(j) = u1, . . . , uk]]

(2)

30

SLIDE 55

HMM parameters

Three types of potentials:

1

Transition probabilities Pr(yt = v|yt−1 = u) = Number of transitions from u to v

Total transitions out of state u

Example:

2

Emission probabilities, Probability of emitting symbol v from state u Pr(xt = v|yt = u) = Number of times v generated from u

number of transition from u

31

SLIDE 56

Example: HMM parameter learning

D = (N = 3, n = 4) (y1, x1) (y2, x2) (y3, x3) (y4, x4) 1, A 1, B 2, A 3, C 2, B 1, A 3, A 3, D 1, B 1, B 2, C 3, D P(y) = 1 2 3 2/3 1/3 P(y|y ′) = y=1 y=2 y=3 1 2/5 2/5 1/5 2 1/3 2/3 3 1 P(x|y) = x=A x=B x=C x=D 1 2/5 3/5 2 3

32

SLIDE 57

Part I: Outline

1

Representation Directed graphical models: Bayesian networks

2

Inference Queries Exact inference on chains

3

Constructing a graphical model Graph Structure Parameters in Potentials

4

References

33

SLIDE 58