Dimensionality Reduc1on Lecture 23
David Sontag New York University
Slides adapted from Carlos Guestrin and Luke Zettlemoyer
Dimensionality Reduc1on Lecture 23 David Sontag New York - - PowerPoint PPT Presentation
Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer Dimensionality reduc9on Input data may have thousands or millions of dimensions! e.g., text data has ???,
Slides adapted from Carlos Guestrin and Luke Zettlemoyer
Slide from Yi Zhang
i
i
i
i
In general will not be inver9ble – cannot go from z back to x
From notes by Andrew Ng
i = (xi-x)•uj
1 m
m
(x(i)Tu)2 = 1 m
m
uTx(i)x(i)Tu = uT
m
m
x(i)x(i)T
Let x(i) be the ith data point minus the mean. Choose unit-length u to maximize: Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board) Covariance matrix Σ
T Xc
23
In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. You might lose some information, but if the eigenvalues much
5 10 15 20 25 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 Variance (%)
Slide from Aarti Singh
Percentage of total variance captured by dimension zj for j=1 to 10:
var(zj) = 1 m
m
X
i=1
(zi
j)2
= 1 m
m
X
i=1
(xi · uj)2 = λj
λj Pn
l=1 λl
Principal components:
2/m
12
/)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C?
!"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? Slide from Aarti Singh
Goal: use geodesic distance between points (with respect to manifold) Es9mate manifold using
points given by distance of shortest path Embed onto 2D plane so that Euclidean distance approximates graph distance
[Tenenbaum, Silva, Langford. Science 2000]
Table 1. The Isomap algorithm takes as input the distances dX(i,j) between all pairs i,j from N data points in the high-dimensional input space X, measured either in the standard Euclidean metric (as in Fig. 1A)
d-dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the
Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by dX(i, j)] they are closer than (-Isomap), or if i is one of the K nearest neighbors of j (K-Isomap). Set edge lengths equal to dX(i,j). 2 Compute shortest paths Initialize dG(i,j) dX(i,j) if i,j are linked by an edge; dG(i,j) otherwise. Then for each value of k 1, 2, . . ., N in turn, replace all entries dG(i,j) by min{dG(i,j), dG(i,k) dG(k,j)}. The matrix of final values DG {dG(i,j)} will contain the shortest path distances between all pairs of points in G (16, 19). 3 Construct d-dimensional embedding Let p be the p-th eigenvalue (in decreasing order) of the matrix (DG) (17), and v p
i be the i-th
component of the p-th eigenvector. Then set the p-th component of the d-dimensional coordinate vector yi equal to pv p
i .
[Tenenbaum, Silva, Langford. Science 2000]
[Tenenbaum, Silva, Langford. Science 2000]
Residual variance Number of dimensions Face images Swiss roll data
PCA Isomap
Graphical models
Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita
1
Probabilistic modeling
Given: several variables: x1, . . . xn , n is large. Task: build a joint distribution function Pr(x1, . . . xn) Goal: Answer several kind of projection queries on the distribution Basic premise
◮ Explicit joint distribution is dauntingly large ◮ Queries are simple marginals (sum or max) over the joint
distribution.
2
Examples of Joint Distributions So far
Naive Bayes: P(x1, . . . xd|y) , d is large. Assume conditional independence. Multivariate Gaussian Recurrent Neural Networks for Sequence labeling and prediction
3
Example
Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: 10 × 7 × 7 × 3 × 30 = 44100. Queries: Estimate fraction of people with
◮ Income > 200K and Degree=”Bachelors”, ◮ Income < 200K, Degree=”PhD” and experience > 10 years. ◮ Many, many more. 4
Alternatives to an explicit joint distribution
Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies
◮ Many highly correlated pairs
income ⊥ ⊥ age, income ⊥ ⊥ experience, age⊥ ⊥experience
◮ Ad hoc methods of combining these into a single estimate
Go beyond pairwise correlations: conditional independencies
◮ income ⊥
⊥ age, but income ⊥ ⊥ age | experience
◮ experience ⊥
⊥ degree, but experience ⊥ ⊥ degree | income
Graphical models make explicit an efficient joint distribution from these independencies
5
Graphical models
Model joint distribution over several variables as a product of smaller factors that is
1
Intuitive to represent and visualize
◮ Graph: represent structure of dependencies ◮ Potentials over subsets: quantify the dependencies 2
Efficient to query
◮ given values of any variable subset, reason about probability
distribution of others.
◮ many efficient exact and approximate inference algorithms
Graphical models = graph theory + probability theory.
6
Graphical models in use
Roots in statistical physics for modeling interacting atoms in gas and solids [ 1900] Early usage in genetics for modeling properties of species [ 1920] AI: expert systems ( 1970s-80s) Now many new applications:
◮ Error Correcting Codes: Turbo codes, impressive success story
(1990s)
◮ Robotics and Vision: image denoising, robot navigation. ◮ Text mining: information extraction, duplicate elimination,
hypertext classification, help systems
◮ Bio-informatics: Secondary structure prediction, Gene discovery ◮ Data mining: probabilistic classification and clustering. 7
Part I: Outline
1
Representation Directed graphical models: Bayesian networks
2
Inference Queries Exact inference on chains
3
Constructing a graphical model Graph Structure Parameters in Potentials
4
References
8
Representation
Structure of a graphical model: Graph + Potential
Graph
Nodes: variables x = x1, . . . xn
◮ Continuous: Sensor temperatures, income ◮ Discrete: Degree (one of Bachelors,
Masters, PhD), Levels of age, Labels of words
Edges: direct interaction
◮ Directed edges: Bayesian networks ◮ Undirected edges: Markov Random fields
Directed
Representation
Potentials: ψc(xc)
Scores for assignment of values to subsets c of directly interacting variables. Which subsets? What do the potentials mean?
◮ Different for directed and undirected graphs
Probability
Factorizes as product of potentials Pr(x = x1, . . . xn) ∝
10
Directed graphical models: Bayesian networks
Graph G: directed acyclic
◮ Parents of a node: Pa(xi) = set of nodes in G pointing to xi
Potentials: defined at each node in terms of its parents. ψi(xi, Pa(xi)) = Pr(xi|Pa(xi) Probability distribution Pr(x1 . . . xn) =
n
Pr(xi|pa(xi))
11
Example of a directed graph
NY CA London Other 0.2 0.3 0.1 0.4
ψ2(A) = Pr(A)
20–30 30–45 > 45 0.3 0.4 0.3
(µ, σ) = (35, 10)
ψ2(E, A) = Pr(E|A)
0–10 10–15 > 15 20–30 0.9 0.1 30–45 0.4 0.5 0.1 > 45 0.1 0.1 0.8
ψ2(I, E, D) = Pr(I|D, A)
3 dimensional table, or a histogram approximation.
Probability distribution
Pa(x = L, D, I, A, E) = Pr(L) Pr(D) Pr(A) Pr(E|A) Pr(I|D, E)
12
Conditional Independencies
Given three sets of variables X, Y , Z, set X is conditionally independent of Y given Z (X ⊥ ⊥ Y |Z) iff Pr(X|Y , Z) = Pr(X|Z) Local conditional independencies in BN: for each xi xi ⊥ ⊥ ND(xi)|Pa(xi) L ⊥ ⊥ E, D, A, I A ⊥ ⊥ L, D E ⊥ ⊥ L, D|A I ⊥ ⊥ A|E, D
CIs and Fractorization
Theorem
Local CI = ⇒ Factorization
Proof.
x1, x2, . . . , xn topographically ordered (parents before children). Local CI: Pr(xi|x1, . . . , xi−1) = Pr(xi|Pa(xi)) Chain rule: Pr(x1, . . . , xn) =
i Pr(xi|x1, . . . , xi−1) = i Pr(xi|Pa(xi))
14
Popular Bayesian networks
Hidden Markov Models: speech recognition, information extraction
◮ Observation variables: continuous (speech waveform), discrete
(Word)
Kalman Filters: State variables: continuous
◮ Discussed later
Topic models for text data
1
Principled mechanism to categorize multi-labeled text documents while incorporating priors in a flexible generative framework
2
Application: news tracking
QMR (Quick Medical Reference) system PRMs: Probabilistic relational networks:
15
Part I: Outline
1
Representation Directed graphical models: Bayesian networks
2
Inference Queries Exact inference on chains
3
Constructing a graphical model Graph Structure Parameters in Potentials
4
References
16
Inference queries
1
Marginal probability queries over a small subset of variables:
◮ Find Pr(Income=’High & Degree=’PhD’) ◮ Find Pr(pixel y9 = 1)
Pr(x1) =
Pr(x1 . . . xn) =
m
. . .
m
Pr(x1 . . . xn) Brute-force requires O(mn−1) time.
2
Most likely labels of remaining variables: (MAP queries)
◮ Find most likely entity labels of all words in a sentence ◮ Find likely temperature at sensors in a room
x∗ = argmaxx1...xn Pr(x1 . . . xn)
17
Exact inference on chains
Given,
◮ Graph ◮ Potentials: ψi(yi, yi+1) ◮ Pr(y1, . . . yn) =
i ψi(yi, yi+1), Pr(y1)
Find, Pr(yi) for any i, say Pr(y5 = 1)
◮ Exact method: Pr(y5 = 1) =
y1,...y4 Pr(y1, . . . y4, 1) requires
exponential number of summations.
◮ A more efficient alternative... 18
Exact inference on chains
Pr(y5 = 1) =
Pr(y1, . . . y4, 1) =
ψ1(y1, y2)ψ2(y2, y3)ψ3(y3, y4)ψ4(y4, 1) =
ψ1(y1, y2)
ψ2(y2, y3)
ψ3(y3, y4)ψ4(y4, 1) =
ψ1(y1, y2)
ψ2(y2, y3)B3(y3) =
ψ1(y1, y2)B2(y2) =
B1(y1) An alternative view: flow of beliefs Bi(.) from node i + 1 to node i
19
Adding evidence
Given fixed values of a subset of variables xe (evidence), find the
1
Marginal probability queries over a small subset of variables:
◮ Find Pr(Income=’High | Degree=’PhD’)
Pr(x1) =
Pr(x1 . . . xn|xe)
2
Most likely labels of remaining variables: (MAP queries)
◮ Find likely temperature at sensors in a room given readings
from a subset of them
x∗ = argmaxx1...xm Pr(x1 . . . xn|xe) Easy to add evidence, just change the potential.
20
Case study: HMMs for Information Extraction
21
Inference in HMMs
Given,
◮ Graph
◮ Evidence variables: x = x1 . . . xn = o1 . . . on.
Find most likely values of the hidden state variables. y = y1 . . . yn argmaxy Pr(y|x = o) Define ψi(yi−1, yi) = Pr(yi|yi−1) Pr(xi = oi|yi) Reduced graph only a single chain of y nodes.
This is the well-known Viterbi algorithm
22
The Viterbi algorithm
Let observations xt take one of k possible values, states yt take one
Given n observations: o1, . . . , on Given Potentials Pr(yt|yt−1) = P(y|y ′) (Table with m2 values), Pr(xt|yt) = P(x|y) (Table with mk values), Pr(y1) = P(y) start probabilities (Table with m values.) Find maxy Pr(y|x = o) Bn[y] = 1 y ∈ [1, . . . , m] for t = n . . . 2 do ψ(y, y ′) = P(y|y ′)P(xt = ot|y) Bt−1[y ′] = maxn
y=1 ψ(y, y ′)Bt[y]
end for Return maxy B1[y]P(y)P(xt = ot|y) Time taken: O(nm2)
23
Numerical Example
P(y|y ′) = y’ P(y = 0|y ′) P(y = 1|y ′) 0.9 0.1 1 0.2 0.8 P(x|y) = y P(x = 0|y) P(x = 1|y) 0.7 0.3 1 0.6 0.4 P(y = 1) = 0.5 Observation [x0, x1, x2] = [0, 0, 0]
24
Contrast with RNNs for the same task
25
Part I: Outline
1
Representation Directed graphical models: Bayesian networks
2
Inference Queries Exact inference on chains
3
Constructing a graphical model Graph Structure Parameters in Potentials
4
References
26
Graph Structure
1
Manual: Designed by domain expert
◮ Used in applications where dependency structure is
well-understood
◮ Example: QMR systems, Kalman filters, Vision (Grids), HMM
for speech recognition and IE.
2
Learned from examples
◮ NP hard to find the optimal structure. ◮ Widely researched, mostly posed as a branch and bound search
problem.
◮ Useful in dynamic situations 27
Parameters in Potentials
1
Manual: Provided by domain expert
◮ Used in infrequently constructured graphs, example QMR
systems
◮ Also where potentials are an easy function of the attributes of
connected graphs, example: vision networks.
2
Learned: from examples
◮ More popular since difficult for humans to assign numeric values ◮ Many variants of parameterizing potentials. 1
Table potentials: each entry a parameter, example, HMMs
2
Potentials: combination of shared parameters and data attributes: example, CRFs.
28
Training for BN
Given sample D = {x1, . . . , xN} of data generated from a distribution P(x) represented by a graphical model with known structure G, learn potentials ψC(xC). LL(θ, D) =
N
log Pr(xi|θ) =
N
log
Pr(xi
j |xi Pa(j), θ)
=
log Pr(xi
j |xi Pa(j), θ)
Like normal classification task.
29
BN: Learning Table Potentials
Assume: all variables are discrete, parameters of each node is different: θ = [θ1, . . . , θn] Pr(xj|Pa(j), θj) = Table of real values denoting the probability of each value of xj corresponding to each combination of values of the parents. If each variables takes m possible values, and has k parents, then eachPr(xj|Pa(j), θj) will require mk(m − 1) free parameters in θj. Maximum Likelihood estimation of parameters: θj
vu1,...,uk
= Pr(xj = v|pa(xj) = u1, . . . , uk) (1) = N
i=1[[xi j == v, xi Pa(j) = u1, . . . , uk]]
N
i=1[[xi Pa(j) = u1, . . . , uk]]
(2)
30
HMM parameters
Three types of potentials:
1
Transition probabilities Pr(yt = v|yt−1 = u) = Number of transitions from u to v
Total transitions out of state u
Example:
2
Emission probabilities, Probability of emitting symbol v from state u Pr(xt = v|yt = u) = Number of times v generated from u
number of transition from u
31
Example: HMM parameter learning
D = (N = 3, n = 4) (y1, x1) (y2, x2) (y3, x3) (y4, x4) 1, A 1, B 2, A 3, C 2, B 1, A 3, A 3, D 1, B 1, B 2, C 3, D P(y) = 1 2 3 2/3 1/3 P(y|y ′) = y=1 y=2 y=3 1 2/5 2/5 1/5 2 1/3 2/3 3 1 P(x|y) = x=A x=B x=C x=D 1 2/5 3/5 2 3
32
Part I: Outline
1
Representation Directed graphical models: Bayesian networks
2
Inference Queries Exact inference on chains
3
Constructing a graphical model Graph Structure Parameters in Potentials
4
References
33
More on graphical models
Koller and Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Wainwright’s article in FnT for Machine Learning. 2009. Kevin Murphy’s brief online introduction (http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html) Graphical models. M. I. Jordan. Statistical Science (Special Issue on Bayesian Statistics), 19, 140-155, 2004. (http: //www.cs.berkeley.edu/~jordan/papers/statsci.ps.gz) Other text books:
◮ R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J.
Springer-Verlag. 1999.
◮ J. Pearl. ”Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference.” Morgan Kaufmann. 1988.
◮ Graphical models by Lauritzen, Oxford science publications F.
2001.
34