Learning Arbitrary Statistical Mixtures of Discrete Distributions
lijian83@mail.tsinghua.edu.cn STOC15
Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo)
Learning Arbitrary Statistical Mixtures of Discrete Distributions - - PowerPoint PPT Presentation
STOC15 Learning Arbitrary Statistical Mixtures of Discrete Distributions Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo) lijian83@mail.tsinghua.edu.cn Problem Definition Related Work
Learning Arbitrary Statistical Mixtures of Discrete Distributions
lijian83@mail.tsinghua.edu.cn STOC15
Jian Li (Tsinghua), Yuval Rabani(HUJI), Leonard J. Schulman(Caltech), Chaitanya Swamy (Waterloo)
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
Δ𝑜 = 𝑦 ∈ 𝑆+
𝑜 ||𝑦||1 = 1 }
So each point in Δ𝑜 is a prob. distr. over [n] 𝜘 is a prob. distr. over Δ𝑜 (unknown to us) Goal: learn 𝜘 (i.e., transportation distance in 𝑀1 at most 𝜗.
Tran1 𝜘, 𝜘 ≤ 𝜗)
Mixture of discrete distributions
Δ𝑜 = 𝑦 ∈ 𝑆+
𝑜 ||𝑦||1 = 1 }
So each point in Δ𝑜 is a prob. distr. over [n] 𝜘 is a prob. distr. over Δ𝑜 (unknown to us) Goal: learn 𝜘 (i.e., transportation distance in 𝑀1 at most 𝜗.
Tran1 𝜘, 𝜘 ≤ 𝜗)
A 𝒍-snapshot sample: (k: snapshot#)
Take a sample point 𝑦 ∼ 𝜘
(𝑦 ∈ Δ𝑜) (we don’t get to observe 𝑦 directly)
Take 𝑙 i.i.d. samples 𝑡1𝑡2 … 𝑡𝑙 from 𝑦 (we observe 𝑡1𝑡2 … 𝑡𝑙, called a
k-snapshot sample)
Question:
How large the snapshot# 𝒍 needs to be in order to learn 𝝒?? How many 𝒍-snapshot samples do we need to learn 𝝒 ??
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
Previous work
Mixture of Gaussians: a large body of work
Only need 1-snapshot samples k-snapshot (k>1) is necessary for mixtures of discrete distributions Learn the parameters
Topic Models 𝜘 is a mixture of topics (each topic is a distribution of words)
How a document is generated:
Sample a topic from 𝑦 ∼ 𝜘
(𝑦 ∈ Δ𝑜)
Use 𝑦 to generate a document of size k (a document is a k-
snapshot sample)
Previous work
Mixture of Gaussians: a large body of work Only need 1-snapshot samples
k-snapshot (k>1) is necessary for mixtures of discrete distribution
Topic Models
Various assumptions:
LSI, Separability [Papadimitriou,Raghavan,Tamaki,Vempala’00] LDA [Blei, Ng, Jordan’03] Anchor words [Arora,Ge,Moitra’12] (snapshot#=2) Topic linear independent [Anandkumar, Foster, Hsu, Kakade, Liu’12]
(snapshot#=O(1))
Several others
Collaborative Filtering
L1 condition number [Kleinberg, Sandler ‘08]
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
Also known as earth mover distance, Rubinstein distance,
Wasserstein distance
Tran(𝑄, 𝑅): Distance between two probability
distributions 𝑄, 𝑅 If we want to turn P to Q, the metric is the cost of the
E.g., in discrete case, it is the solution of the following LP:
Also known as earth mover distance, Rubinstein distance,
Wasserstein distance
Tran1(𝑄, 𝑅): Distance between two probability
distributions 𝑄, 𝑅 If we want to turn P to Q, the metric is the cost of the
𝑦 − 𝑈 𝑦
1𝑒𝑄)
E.g., in discrete case, it is the solution of the following LP:
The Coin problem: 1-dimension
A mixture 𝜘 defined over [0,1] If mixture 𝜘 is a k-spike distribution (k different coins)
Require k-snapshot (k>1) samples
1 1 1
The Coin problem: 1-dimension
A mixture 𝜘 defined over [0,1] If mixture 𝜘 is a k-spike distribution, a lower bound is known Require k-snapshot (k>1) samples Lower bound : To guarantee Tran1 𝜘,
𝜘 ≤ 𝑃(1/𝑙) [Rabani,Schulman,Swamy’14] (1) (2k-1)-snapshot is necessary (2) We need exp(Ω(𝑙)) (2k-1)-snapshot samples
Our Result:
A nearly matching upper bound: 𝑙/𝜗 𝑃(𝑙)log 1/𝜀 (2k-1)-snapshot samples suffice (w.p. 1 − 𝜀)
The Coin problem: 1-dimension
A mixture 𝜘 over [0,1] 𝜘 is arbitrary (may even be continuous)
Lower bound [Rabani,Schulman,Swamy’14]: Still applies. (rewrite a bit)
Tran1 𝜘, 𝜘 ≤ 𝑃(1/𝐿) Our Result
A nearly matching upper bound
Using exp(O(𝐿)) K-snapshot samples, we can recover 𝜘
s.t. Tran1 𝜘, 𝜘 ≤ 𝑃(1/𝐿)
A tight tradeoff between K and transportation distance
Higher Dimension
A mixture 𝜘 over Δ𝑜 Assumption: 𝜘 is a k-spike distribution (think k very small,
k<<n) Our result:
Using poly(n) 1- and 2-snapshot samples and
𝑙/𝜗 𝑃(𝑙2) (2k-1)-snapshot samples, we can obtain a mixture 𝜘 s.t. Tran1 𝜘, 𝜘 ≤ 𝜗
L1 distance. Harder than L2
Higher Dimension A mixture 𝜘 over Δ𝑜 Assumption: 𝜘 is a k-spike distribution (think k very small,
k<<n)
Why L1 distance?
𝑄, 𝑅 ∈ Δ𝑜
𝑒𝑈𝑊 𝑄, 𝑅 = ||𝑄 − 𝑅||1
E.g.,
1 𝑜 , … , 1 𝑜 , 1 𝑜 , … , 1 𝑜 and 0, … , 0, 2 𝑜 , … , 2 𝑜 are two very
different distributions. But their L2 distance is small (1/ 𝑜)
Higher Dimension A mixture 𝜘 over Δ𝑜 Assumption: 𝜘 is an arbitrary distribution
supported on a k-dim slice of Δ𝑜 (again think k<<n) Our result:
Using poly(n) 1- and 2-snapshot samples, and
𝑙/𝜗 𝑃(𝑙) K-snapshot samples (𝐿 = poly(𝑙, 𝜗)), we can
𝜘 s.t. Tran1 𝜘, 𝜘 ≤ 𝜗
(0,1,0,0) (0,0,1,0) (0,0,0,1) A 2-dim slice in Simplex Δ4 (1,0,0,0)
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
A (even continuous) mixture 𝜘 of coins Consider a K-snapshot sample
Bernstein Polynomial Using samples, we can obtain
A simple but useful lemma:
𝑔 𝑦 − 𝑔 𝑧 ≤ ||𝑦 − 𝑧|| Pf based on the Dual formulation (Kantorovich&Rubinstein)
If we want to make
need
Require poly(𝐷/𝜗) samples
If we want to make
need
Require poly(𝐷/𝜗) samples What C and 𝝁 can we achieve?? WELL KNOWN in approximation theory (e.g., Rivlin03): So, with poly(𝐿) K-snapshot samples, Tran = 𝑃(1/ 𝐿) Bernstein polynomial approximation
Jackson’s theorem:
If we want to make
need
Require poly(𝐷/𝜗) samples What C and 𝝁 can we achieve?? with 𝐟𝐲𝐪(𝑳) K-snapshot samples, Tran = 𝑃(1/𝐿) Chebyshev polynomials By a change of basis
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
A mixture 𝜘 over Δ𝑜 𝜘 is a k-spike distribution over
a k-dim slice A of Δ𝑜 (k<<n) Outline:
Step 1: Reduce the learning problem from n-dim to k-dim
(we don’t want the snapshot# depends on n)
Step 2: Learn the projected mixture in the k-dim subspace
(require Tran2≤ 𝜗, snapshot# depends only on k, 𝜗)
Step 3: Project back to Δ𝑜
(0,1,0,0) (0,0,1,0) (0,0,0,1) A 2-dim slice in Simplex Δ4 (1,0,0,0)
Step 1: From n-dim to k-dim
Existing approach: apply SVD/PCA/Eigen decomposition to
the 2-moment matrix, then take the subspace spanned by the first few eigenvectors
Does NOT work!
Step 1: From n-dim to k-dim
Existing approach: apply SVD/PCA/Eigen decomposition to
the 2-moment matrix, then take the subspace spanned by the first few eigenvectors
Does NOT work!
Reason: we want Tran1 𝜘, 𝜘 ≤ 𝜗 (L1 metric)
L1 is not rotationally invariant. So it may happen (in the subspace) that
in some directions but in some other directions
Implication: in the reduced k-dim learning problem, we have to be very accurate in some directions (only by making snapshot# depend on n)
Step 1: From n-dim to k-dim What we do:
Find a k’-dim (k’<k) subspace B where the L1-ball is almost spherical, and the supporting slice A is close to B
in L1 metric
Step 1: From n-dim to k-dim (sketch)
(by deleting and splitting letters)
take the first few (normalized) principle axes, where
Step 2: Learn the projected mixture in the k-dim subspace (sketch) (1) project to a net of 1-dim directions (2) Learn the 1-d projections (3) Assemble the 1-d projections using LP
Similar to a Geometric Tomography question. Analysis uses Fourier decomposition and a multidimension version of Jackson theorem
Problem Definition Related Work Our Results The Coin Problem Higher Dimension Conclusion
Algorithms for learning mixtures of discrete distributions No assumption (on independence, conditional number etc.).
Worst case analysis
Tradeoff: Snapshot#, Tran, #samples Transportation distance
lijian83@mail.tsinghua.edu.cn
Def:
where T is a transportation from P to Q
The Dual formulation (Kantorovich&Rubinstein)
𝑔 𝑦 − 𝑔 𝑧 ≤ ||𝑦 − 𝑧||
Def:
where T is a transportation from P to Q
The Dual formulation (Kantorovich&Rubinstein)
If P, Q are finite supported discrete distributions, the above is simply the LP-duality
𝑔 𝑦 − 𝑔 𝑧 ≤ ||𝑦 − 𝑧|| Primal: Dual:
A simple but useful lemma:
Pf sketch:
This holds for any 1-Lip function f. So the lemma follows from the dual formulation
(by deleting and splitting letters)
2.
Consider and the polytope (C only depends on k and 𝜗)
3.
Compute the John Ellipsoid with axes
4.
Take the first few (normalized) principle axes
Step 1: From n-dim to k-dim
Step 2: Learn the projected mixture in the k-dim subspace B= For a K-snapshot sample 𝐭 = 𝑡1, … , 𝑡𝐿 , 𝑡𝑗 ∈ 𝑜 , let 𝑣 𝒕 = 𝑙=1..𝐿 𝐶𝑡𝑙 Suppose we take N samples 𝐭𝟐, … , 𝐭𝑶 The learnt project measure is the empirical measure
n h
𝐶1 𝐶𝑜 𝐶2
Delta func