Large-Scale Machine Learning
- I. Scalability issues
Jean-Philippe Vert jean-philippe.vert@{mines-paristech,curie,ens}.fr
1 / 76
Large-Scale Machine Learning I. Scalability issues Jean-Philippe - - PowerPoint PPT Presentation
Large-Scale Machine Learning I. Scalability issues Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 76 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression:
1 / 76
1
2
3
2 / 76
3 / 76
1
2
3
4 / 76
5 / 76
6 / 76
7 / 76
8 / 76
9 / 76
10 / 76
https://www.linkedin.com/pulse/supervised-machine-learning-pega-decisioning-solution-nizam-muhammad
11 / 76
12 / 76
1 Review a few standard ML techniques 2 Introduce a few ideas and techniques to scale them to modern, big
13 / 76
1
2
3
14 / 76
Dimension reduction Clustering Density estimation Feature learning
Regression Classification Structured output classification
15 / 76
Dimension reduction: PCA Clustering: k-means Density estimation Feature learning
Regression: OLS, ridge regression Classification: kNN, logistic regression, SVM Structured output classification
16 / 76
1
2
3
17 / 76
18 / 76
u =1, u⊥{u1,...,ui−1} n
i u − 1
n
j u
2
19 / 76
u =1, u⊥{u1,...,ui−1}
20 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
versicolor virginica
> data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > m <- princomp(log(iris[,1:4]))
21 / 76
22 / 76
1
2
3
23 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
24 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 5
PC1 PC2
Cluster 2 Cluster 3 Cluster 4 Cluster 5
24 / 76
C n
1
Assignment step: fix µ, optimize C ∀i = 1, . . . , n, Ci ← arg min
c∈{1,...,k} xi − µc
2
Update step ∀i = 1, . . . , k, µi ← 1 |Ci|
xj
25 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 2
PC1 PC2
Cluster 2
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 3
PC1 PC2
Cluster 2 Cluster 3
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 4
PC1 PC2
Cluster 2 Cluster 3 Cluster 4
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 76
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 5
PC1 PC2
Cluster 2 Cluster 3 Cluster 4 Cluster 5
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 76
27 / 76
1
2
3
28 / 76
2 3 4 5 2 4 6 8 10 12 x y
29 / 76
2 3 4 5 2 4 6 8 10 12 x y
29 / 76
n
β RSS(β) + λ d
i
30 / 76
31 / 76
n
p
i
31 / 76
n
p
i
λ
β∈Rp {R(β) + λΩ(β)} =
31 / 76
λ
λ
λ
32 / 76
(From Hastie et al., 2001) 33 / 76
34 / 76
35 / 76
β
2
n
36 / 76
37 / 76
High Bias Low Variance Low Bias High Variance
Training Sample Test Sample Low High
38 / 76
1 Randomly divide the training set (of size n) into K (almost) equal
2 For each portion, fit the model with different parameters on the
3 Average performance over the K groups, and take the parameter
39 / 76
1
2
3
40 / 76
41 / 76
41 / 76
41 / 76
41 / 76
42 / 76
n→+∞ P(ˆ
f measurable P(f (X) = Y )
43 / 76
44 / 76
45 / 76
β∈Rp
n
2 regularization
The problem is non-smooth, and typically NP-hard to solve The regularization has no effect since the 0/1 loss is invariant by scaling of β In fact, no function achieves the minimum when λ > 0 (why?)
46 / 76
−5 5 0.0 0.2 0.4 0.6 0.8 1.0 sigma(u) sigma(−u)
47 / 76
β∈Rp J(β) = 1
n
2
48 / 76
β J(β) = 1
n
2
n
n
βJ(β) = 1
n
i eyiβ⊤xi
n
i + 2λI
49 / 76
β J(β) = 1
n
2
βJ
50 / 76
β n
51 / 76
52 / 76
ϕ convex means we need to solve a convex optimization problem. A ”good” ϕ may be one which allows for fast optimization
Most ϕ lead to consistent estimators (see next slides) Some may be more efficient
53 / 76
f measurable R(f )
n
54 / 76
ϕ(f ) = 1
n
55 / 76
g measurable Rϕ(g) = R∗ ϕ .
g measurable R(g) = R∗ .
ϕ is small, then
56 / 76
Rϕ(f | X = x) = E [ϕ (Yf (X)) | X = x] = η(x)ϕ (f (x)) + (1 − η(x)) ϕ (−f (x)) Rϕ(−f | X = x) = E [ϕ (−Yf (X)) | X = x] = η(x)ϕ (−f (x)) + (1 − η(x)) ϕ (f (x)) Therefore: Rϕ(f | X = x) − Rϕ(−f | X = x) = [2η(x) − 1] × [ϕ (f (x)) − ϕ (−f (x))] This must be a.s. ≤ 0 because Rϕ(f ) ≤ Rϕ(−f ), which implies: if η(x) > 1
2, ϕ (f (x)) ≤ ϕ (−f (x)) =
⇒ f (x) ≥ 0 if η(x) < 1
2, ϕ (f (x)) ≥ ϕ (−f (x)) =
⇒ f (x) ≤ 0 These inequalities are in fact strict thanks to the assumptions we made on ϕ (left as exercice).
β∈Rp n
α∈Rn 2α⊤Y − α⊤XX ⊤α
58 / 76
1
2
3
59 / 76
4 6 8 10 −1 1 2 x y 60 / 76
n
d
i |)
i |)
61 / 76
62 / 76
2
R x1 x2 x1 x2
2
1,
2) ∈ R3:
1x′2 1 + 2x1x2x′ 1x′ 2 + x2 2x′2 2
1 + x2x′ 2
63 / 76
n
n
i=1 αiΦ(xi).
2
R x1 x2 x1 x2
2 64 / 76
2
R x1 x2 x1 x2
2
i=1 αiK(xi, x) by fitting a linear model
β∈RD n
65 / 76
i=1 αiΦ(xi) we have
α∈Rn n
66 / 76
β∈Rd n
67 / 76
β∈Rd n
4 6 8 10 −1 1 2 x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 1000
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 100
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 10
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 1
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.1
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.01
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.001
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.0001
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.00001
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.000001
x y
68 / 76
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.0000001
x y
68 / 76
4 6 8 10 −1 1 2
lambda = 1
x y
69 / 76
1
2
3
70 / 76
71 / 76
72 / 76
73 / 76
http://dx.doi.org/10.1016/S0022-0000(03)00025-4.
http://dx.doi.org/10.1137/060673096.
Theory, pages 144–152, New York, NY, USA, 1992. ACM Press. URL http://www.clopinet.com/isabelle/Papers/colt92.ps.Z.
161–168. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf.
Compression and Complexity of Sequences, pages 21–29, 1997. doi: 10.1109/SEQUEN.1997.666900. URL http://dx.doi.org/10.1109/SEQUEN.1997.666900.
forest problems. SIAM J. Comput., 24(2):296–317, apr 1995. doi: 10.1137/S0097539793242618. URL http://dx.doi.org/10.1137/S0097539793242618.
74 / 76
inference, and prediction. Springer, 2001.
http://dx.doi.org/10.1090/conm/026/737400.
41(1):191–201, 1992. URL http://www.jstor.org/stable/2347628.
Systems 25, pages 3113–3121. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4778-one-permutation-hashing.pdf.
pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/ 3182-random-features-for-large-scale-kernel-machines.pdf.
structured data. Journal of Machine Learning Research, 10:2615–2637, 2009.
75 / 76
http://links.jstor.org/sici?sici=0090-5364%28197707%295%3A4%3C595%3ACNR%3E2. 0.CO%3B2-O.
76 / 76