Large-Scale Machine Learning
Jean-Philippe Vert jean-philippe.vert@{mines-paristech,curie,ens}.fr
1 / 104
Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - - PowerPoint PPT Presentation
Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 104 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression
1 / 104
1
2
3
4
2 / 104
3 / 104
1
2
3
4
4 / 104
5 / 104
6 / 104
7 / 104
8 / 104
9 / 104
10 / 104
https://www.linkedin.com/pulse/supervised-machine-learning-pega-decisioning-solution-nizam-muhammad
11 / 104
12 / 104
1 Review a few standard ML techniques 2 Introduce a few ideas and techniques to scale them to modern, big
13 / 104
1
2
3
4
14 / 104
Dimension reduction Clustering Density estimation Feature learning
Regression Classification Structured output classification
15 / 104
Dimension reduction: PCA Clustering: k-means Density estimation Feature learning
Regression: OLS, ridge regression Classification: kNN, logistic regression, SVM Structured output classification
16 / 104
1
2
3
4
17 / 104
18 / 104
u =1, u⊥{u1,...,ui−1} n
i u − 1
n
j u
2
19 / 104
u =1, u⊥{u1,...,ui−1}
20 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
versicolor virginica
> data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > m <- princomp(log(iris[,1:4]))
21 / 104
22 / 104
1
2
3
4
23 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
24 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 5
PC1 PC2
Cluster 2 Cluster 3 Cluster 4 Cluster 5
24 / 104
C n
1
Assignment step: fix µ, optimize C ∀i = 1, . . . , n, Ci ← arg min
c∈{1,...,k} xi − µg
2
Update step ∀i = 1, . . . , k, µi ← 1 |Ci|
xj
25 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris dataset
PC1 PC2
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 2
PC1 PC2
Cluster 2
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 3
PC1 PC2
Cluster 2 Cluster 3
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 4
PC1 PC2
Cluster 2 Cluster 3 Cluster 4
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 104
−1 1 −0.4 −0.2 0.0 0.2 0.4
Iris k−means, k = 5
PC1 PC2
Cluster 2 Cluster 3 Cluster 4 Cluster 5
> irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 48 4 2 50 3 2 46
26 / 104
27 / 104
1
2
3
4
28 / 104
2 3 4 5 2 4 6 8 10 12 x y
29 / 104
2 3 4 5 2 4 6 8 10 12 x y
29 / 104
n
β RSS(β) + λ d
i
30 / 104
31 / 104
1
2
3
4
32 / 104
33 / 104
33 / 104
33 / 104
33 / 104
34 / 104
n→+∞ P(ˆ
fmeasurable P(f (X) = Y )
35 / 104
36 / 104
β n
37 / 104
38 / 104
β∈Rp J(β) = n
βJ
39 / 104
β∈Rp n
α∈Rn 2α⊤Y − α⊤XX ⊤α
40 / 104
1
2
3
4
41 / 104
4 6 8 10 −1 1 2 x y 42 / 104
n
d
i |)
i |)
43 / 104
n
n
i=1 αiΦ(xi).
2
R x1 x2 x1 x2
2 44 / 104
2
R x1 x2 x1 x2
2
i=1 αiK(xi, x) by fitting a linear model
β∈RD n
45 / 104
i=1 αiΦ(xi) we have
α∈Rn n
46 / 104
β∈Rd n
47 / 104
β∈Rd n
4 6 8 10 −1 1 2 x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 1000
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 100
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 10
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 1
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.1
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.01
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.001
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.0001
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.00001
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.000001
x y
48 / 104
β∈Rd n
4 6 8 10 −1 1 2
lambda = 0.0000001
x y
48 / 104
4 6 8 10 −1 1 2
lambda = 1
x y
49 / 104
1
2
3
4
50 / 104
1
2
3
4
51 / 104
52 / 104
53 / 104
Subsample data and run standard method Split and run on several machines (depends on algorithm)
54 / 104
1
2
3
4
55 / 104
approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters)
ℱ
56 / 104
f ∈F Rn[f ] = 1
n
57 / 104
f :Rd→Y R[f ]
F = min f ∈F R[f ]
F
F − R∗
ℱ
58 / 104
59 / 104
60 / 104
β∈B⊂Rd Rn[fβ] = n
61 / 104
Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E ≤ c (Eapp + ε) GD O(nd) O
ρ
ρ
ε1/α log2 1 ε
O
ρ
ρ
ε1/α log 1 ε log log 1 ε
O(d)
νκ2 ρ + o
ρ
ρ
ε
Optimization speed is catastrophic Learning speed is the best, and independent of α
62 / 104
https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104
1
2
3
4
64 / 104
d=1
||x||/sqrt(k) Frequency 0.0 0.5 1.0 1.5 2.0 2.5 3.0 100 200 300 400
d=10
||x||/sqrt(k) Frequency 0.0 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150
d=100
||x||/sqrt(k) Frequency 0.0 0.5 1.0 1.5 2.0 2.5 3.0 50 100 150 200 250
65 / 104
Sk n
66 / 104
67 / 104
68 / 104
j u:
j )u = u 2
k
j ∼ u 2
k
j ) = u 2
69 / 104
2 e−λ(1+ǫ)k
70 / 104
”add or subtract” [Achlioptas, 2003], 1 bit/entry, size≈ 1, 25GB Rij =
with probability 1/2 −1 with probability 1/2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD, compute f (x) in O(d log d)
72 / 104
1
2
3
4
73 / 104
d
D
k Kernel Phi JL random projec<on Random features?
74 / 104
2
d 2
2
d 2
2
i x + bi
j
j
76 / 104
k
k
2
77 / 104
µ(Rd) and bi ∼ U ([0, 2π]) to approximate any t.i.
i x + bi
2
2
i=1 1 π(1+ω2
i )
i=1 2 1+x2
i
79 / 104
80 / 104
1
2
3
4
81 / 104
query
Documents, Images, Videos, Database
x∈S q − x
1 Tree approaches
Recursively partition the data: Divide and Conquer Expected query time: O(log(n)) Many variants: KDtree, Balltree, PCA-tree, Vantage Point tree Shown to perform very well in relatively low-dim data
2 Hashing approaches
Each image in database represented as a code Significant reduction in storage Expected query time: O(1) or O(n) Compact codes preferred
83 / 104
84 / 104
85 / 104
left right
median distance
VP-Tree
86 / 104
left right
Ball tree
vectors per node
l e f t r i g h t
PCA tree
top eigenvector Random- Projection tree l e f t r i g h t Random direction
87 / 104
x1
X x1 x2 x3 x4 x5 y1 1 1 1 y2 1 1 1
h1 h2
ym
010 100 111 001 110
x2 x3 x4 x5
1 Choose a set of binary hashing functions to design a binary code 2 Index the database = compute codes for all points 3 Querying: compute the code of the query, and retrieve the points
88 / 104
https://en.wikipedia.org/wiki/Hash_function
89 / 104
Likely Unlikely
90 / 104
𝒔𝑼𝒚 > 0 𝒔𝑼𝒚 < 0 𝑠
𝜄
91 / 104
(pointers only) 00 00 … 00 01 … 00 10 Empty … … … 11 11 …
𝒊𝟐 𝒊𝟑
𝑆𝐸 𝒊𝟐, 𝒊𝟑: 𝑺𝑬 → {𝟏, 𝟐, 𝟑, 𝟒}
92 / 104
𝒊𝟐
𝟐 … 𝒊𝑳 𝟐 Buckets
00 … 00 … 00 … 01 … 00 … 10 Empty … … … … 11 … 11 … 𝒊𝟐
𝑴 … 𝒊𝑳 𝑴 Buckets
00 … 00 … 00 … 01 … 00 … 10 … … … … 11 … 11 Empty
Table 1 Table L
Large m increases precision but decreases recall Large L increases recall but also storage Optimization is possible to minimize run-time for a given application
92 / 104
k x + bk
d
k) ,
Gaussian N(0, 1) is 2-stable Cauchy dx/
93 / 104
1
2
3
4
94 / 104
95 / 104
96 / 104
97 / 104
b-bit minwise hashing [Li and K¨
2bk One-permutation hashing [Li et al., 2012]: use a single permutation, keep the smallest index in each consecutive block of size k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 1 1 1 1 1 1 1 1
1 2 3 4
π(S1): π(S2): π(S3):
1This shows in particular that the resemblance is positive definite 98 / 104
Let h : [1, d] → [1, k] a hash function For x ∈ Rd (or {0, 1}d) let Φ(x) ∈ Rk with ∀i = 1, . . . , k Φi(x) =
xj ”Accumulate coordinates i of x for which h(i) is the same Repeat L times and concatenate if needed, to limit the effect of collisions
No memory needed for projections (vs. LSH) No need for dictionnary (just a hash function that can hash anything) Sparsity preserving
99 / 104
1
2
3
4
100 / 104
Optimization by SGD Random projections, sketching
101 / 104
102 / 104
http://dx.doi.org/10.1016/S0022-0000(03)00025-4.
http://dx.doi.org/10.1137/060673096.
Theory, pages 144–152, New York, NY, USA, 1992. ACM Press. URL http://www.clopinet.com/isabelle/Papers/colt92.ps.Z.
161–168. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf.
Compression and Complexity of Sequences, pages 21–29, 1997. doi: 10.1109/SEQUEN.1997.666900. URL http://dx.doi.org/10.1109/SEQUEN.1997.666900.
forest problems. SIAM J. Comput., 24(2):296–317, apr 1995. doi: 10.1137/S0097539793242618. URL http://dx.doi.org/10.1137/S0097539793242618.
103 / 104
http://dx.doi.org/10.1090/conm/026/737400.
41(1):191–201, 1992. URL http://www.jstor.org/stable/2347628.
Systems 25, pages 3113–3121. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4778-one-permutation-hashing.pdf.
pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/ 3182-random-features-for-large-scale-kernel-machines.pdf.
structured data. Journal of Machine Learning Research, 10:2615–2637, 2009.
http://links.jstor.org/sici?sici=0090-5364%28197707%295%3A4%3C595%3ACNR%3E2. 0.CO%3B2-O.
104 / 104