CS425: Algorithms for Web Scale Data
Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org
CS425: Algorithms for Web Scale Data Most of the slides are from the - - PowerPoint PPT Presentation
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Training data 100 million ratings,
CS425: Algorithms for Web Scale Data
Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org
ο‘ Training data
ο‘ Test data
1 π
Ο(π,π¦)βπ ΖΈ π
π¦π β π π¦π 2
ο‘ Competition
2
1 3 4 3 5 5 4 5 5 3 3 2 2 2 5 2 1 1 3 3 1 480,000 users 17,700 movies
3
1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1 Test Data Set
1 R
π¦π β π π¦π 2
4
480,000 users 17,700 movies Predicted rating True rating of user x on item i ππ,π
Training Data Set
ο‘ The winner of the Netflix Challenge! ο‘ Multi-scale modeling of the data:
5
Global effects Factorization Collaborative filtering
ο‘ Global:
ο Baseline estimation: Joe will rate The Sixth Sense 4 stars
ο‘ Local neighborhood (CF/NN):
Joe will rate The Sixth Sense 3.8 stars
6
ο‘ Earliest and most popular collaborative
ο‘ Derive unknown ratings from those of βsimilarβ
ο‘ Define similarity measure sij of items i and j ο‘ Select k-nearest neighbors, compute the rating
7
ο ο
) ; ( ) ; (
x i N j ij x i N j xj ij xi
sijβ¦ similarity of items i and j rxjβ¦rating of user x on item j N(i;x)β¦ set of items similar to item i that were rated by x
ο‘ In practice we get better estimates if we
8
ΞΌ = overall mean rating bx = rating deviation of user x = (avg. rating of user x) β ΞΌ bi = (avg. rating of movie i) β ΞΌ
Problems/Issues: 1) Similarity measures are βarbitraryβ 2) Pairwise similarities neglect interdependencies among users 3) Taking a weighted average can be restricting Solution: Instead of sij use wij that we estimate directly from data
^
ο ο
) ; ( ) ; (
x i N j ij x i N j xj xj ij xi xi
baseline estimate for rxi
πππ = π + ππ + ππ
ο‘ Use a weighted sum rather than weighted avg.:
πβπ(π;π¦)
ο‘ A few notes:
similar to movie i
(it does not depend on user x)
9
ο‘ ΰ·
ο‘ How to set wij?
1 π
Ο(π,π¦)βπ ΖΈ π π¦π β π
π¦π 2
πππ β πππ π
all other users that rated i
10
ο‘ Goal: Make good recommendations
Lower RMSE ο better recommendations
that user has not yet seen. Canβt really do this!
And hope the system will also predict well the unknown ratings
11
1 3 4 3 5 5 4 5 5 3 3 2 2 2 5 2 1 1 3 3 1
ο‘ Idea: Letβs set values w such that they work well
ο‘ How to find such values w? ο‘ Idea: Define an objective function
ο‘ Find wij that minimize SSE on training data!
π¦,π
πβπ π;π¦
2
ο‘ Think of w as a vector of numbers
12
Predicted rating True rating
ο‘ A simple way to minimize a function π(π):
gradient: π = π β πΆπ(π)
13
π π§ π π§ + πΌπ(π§)
14 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example: Formulation
ο¨ Assume we have a dataset with a single user x and items 0, 1, and 2. We are
given all ratings, and we want to compute the weights w01, w02, and w03.
ο¨ Rating estimate: ΰ·
π
π¦π = ππ¦π + Οπβπ(π,π¦) π₯ππ π π¦π β ππ¦π
Training dataset already has the correct rxi values. We will use the estimation formula to compute the unknown weights w01, w02, and w03.
ο¨ Optimization problem: Compute wij values to minimize:
Ο(π,π)βπΊ ΰ· πππ β πππ π
ο¨ Plug in the formulas:
minimize J w = ππ¦0 + π₯01 π
π¦1 β ππ¦1 + π₯02 π π¦2 β ππ¦2
β π
π¦0 2
+ ππ¦1 + π₯01 π
π¦0 β ππ¦0 + π₯12 π π¦2 β ππ¦2
β π
π¦1 2
+ ππ¦2 + π₯02 π
π¦0 β ππ¦0 + π₯12 π π¦1 β ππ¦1
β π
π¦2 2
15 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example: Algorithm
Initialize unknown variables:
π±π¨ππ± = π₯01
πππ₯
π₯02
πππ₯
π₯12
πππ₯
= π₯01 π₯02 π₯12
Iterate: while |wnew - wold| > Ξ΅
wold = wnew wnew = wold - ο¨ Β·οJ(wold)
ο¨ is the learning rate (a parameter) How to compute οJ(wold)?
16 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example: Gradient-Based Update
J w = ππ¦0 + π₯01 π
π¦1 β ππ¦1 + π₯02 π π¦2 β ππ¦2
β π
π¦0 2
+ ππ¦1 + π₯01 π
π¦0 β ππ¦0 + π₯12 π π¦2 β ππ¦2
β π
π¦1 2
+ ππ¦2 + π₯02 π
π¦0 β ππ¦0 + π₯12 π π¦1 β ππ¦1
β π
π¦2 2
ππ²(π) = ππ²(π) ππππ ππ²(π) ππππ ππ²(π) ππππ
π₯01
πππ₯
π₯02
πππ₯
π₯12
πππ₯
= π₯01
πππ
π₯02
πππ
π₯12
πππ
β π ππΎ(π₯) ππ₯01 ππΎ(π₯) ππ₯02 ππΎ(π₯) ππ₯12
Each partial derivative is evaluated at wold.
17 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example: Computing Partial Derivatives
J w = ππ¦0 + π₯01 π
π¦1 β ππ¦1 + π₯02 π π¦2 β ππ¦2
β π
π¦0 2
+ ππ¦1 + π₯01 π π¦0 β ππ¦0 + π₯12 π
π¦2 β ππ¦2
β π
π¦1 2
+ ππ¦2 + π₯02 π π¦0 β ππ¦0 + π₯12 π
π¦1 β ππ¦1
β π
π¦2 2
ππΎ(π₯) ππ₯01 = 2 ππ¦0 + π₯01 π π¦1 β ππ¦1 + π₯02 π π¦2 β ππ¦2
β π
π¦0
π
π¦1 β ππ¦1
+2 ππ¦1 + π₯01 π
π¦0 β ππ¦0 + π₯12 π π¦2 β ππ¦2
β π
π¦1
π
π¦0 β ππ¦0
Reminder:
π( ax+b 2) πx
= 2 ax + b a
Evaluate each partial derivative at wold to compute the gradient direction.
ο‘ We have the optimization
problem, now what?
ο‘ Gradient descent:
πΌ
π₯πΎ = ππΎ(π₯)
ππ₯ππ = 2 ΰ·
π¦,π
ππ¦π + ΰ·
πβπ π;π¦
π₯ππ π
π¦π β ππ¦π
β π
π¦π
π
π¦π β ππ¦π
for π β {πΆ π; π , βπ, βπ } else
ππΎ(π₯) ππ₯ππ = π
πΆ π; π , we compute
ππ²(π) ππππ
18
ο¨ β¦ learning rate
while |wnew - wold| > Ξ΅: wold = wnew wnew = wold - ο¨ Β·οwold
πΎ π₯ = ΰ·
π¦
ππ¦π + ΰ·
πβπ π;π¦
π₯ππ π π¦π β ππ¦π β π π¦π
2
ο‘ So far: ΰ·
arbitrary similarity measure (wij οΉ sij)
interrelationships among the neighboring movies
ο‘ Next: Latent factor model
19
Global effects
Factorization
CF/NN
Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 CF+Biases+learned weights: 0.91
20
Geared towards females Geared towards males Serious Funny
21
The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
ο‘ βSVDβ on Netflix data: R β Q Β· PT ο‘ For now letβs assume we can approximate the
ratings and we donβt care about the values on the missing ones
22
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
users items
PT Q
items users
R
SVD: A = U ο VT factors factors
ο‘ How to estimate the missing rating of
23
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
items users users
?
PT
π
qi = row i of Q px = column x of PT factors
Q
factors
ο‘ How to estimate the missing rating of
24
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
items users users
?
PT
factors
Q
factors
π
qi = row i of Q px = column x of PT
ο‘ How to estimate the missing rating of
25
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
items users users
?
Q PT
2.4 f factors f factors
π
qi = row i of Q px = column x of PT
Geared towards females Geared towards males Serious Funny
26
The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
Geared towards females Geared towards males Serious Funny
27
The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
ο‘ FYI, SVD:
ο‘ So in our case:
A
m n
ο
m n
VT
28
U ΰ· πππ = ππ β ππ
ο‘ We already know that SVD gives minimum
π,V,Ξ£ ΰ· ππβπ΅
ππ 2
ο‘ Note two things:
π π
π»π»π Great news: SVD is minimizing RMSE
all entries (no-rating in interpreted as zero-rating). But our R has missing entries!
29
ο‘ SVD isnβt defined when entries are missing! ο‘ Use specialized methods to find P, Q
π,π Ο π,π¦ βR π π¦π β ππ β ππ¦ 2
30
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
PT Q
users items
ΖΈ π
π¦π = ππ β ππ¦
factors factors
items users
32 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
General Concept: Overfitting
Almost-linear data is fit to a linear function and a polynomial function. Polynomial model fits perfectly to data. Linear model has some error in the training set. Linear model is expected to perform better on test data, because it filters
Image source: Wikipedia
ο‘ Our goal is to find P and Q such that:
πΈ,πΉ
π,π βπΊ
π
33
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
PT Q
users items
factors factors
items users
ο‘ Want to minimize SSE for unseen test data ο‘ Idea: Minimize SSE on training data
ο‘ This is a classical example of overfitting:
parameters) the model starts fitting noise
generalizing well to unseen test data
34
1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1
ο‘ To solve overfitting we introduce
35
i i x x training x i xi Q P
2 2 2 1 2 ,
1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1
ο¬1, ο¬2 β¦ user set regularization parameters
βerrorβ βlengthβ
Note: We do not care about the βrawβ value of the objective function, but we care in P,Q that achieve the minimum of the objective
36 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Regularization
ο¨ What happens if the user x has rated hundreds of movies?
The error term will dominate, and weβll get a rich model Noise is less of an issue because we have lots of data
ο¨ What happens if the user x has rated only a few movies?
Length term for px will have more effect, and weβll get a simple model
ο¨ Same argument applies for items
i i x x training x i xi Q P
2 2 2 1 2 ,
ο¬1, ο¬2 β¦ user set regularization parameters
βerrorβ βlengthβ
Geared towards females Geared towards males serious funny
37
The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
minfactors βerrorβ + ο¬ βlengthβ
οΊ ο» οΉ οͺ ο« ο© ο« ο« ο
ο₯ ο₯ ο₯
i i x x training x i xi Q P
q p p q r
2 2 2 ,
) (
min
ο¬
Geared towards females Geared towards males serious funny
38
The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
The Princess Diaries
minfactors βerrorβ + ο¬ βlengthβ
οΊ ο» οΉ οͺ ο« ο© ο« ο« ο
ο₯ ο₯ ο₯
i i x x training x i xi Q P
q p p q r
2 2 2 ,
) (
min
ο¬
Geared towards females Geared towards males serious funny
39
The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
minfactors βerrorβ + ο¬ βlengthβ
The Princess Diaries
οΊ ο» οΉ οͺ ο« ο© ο« ο« ο
ο₯ ο₯ ο₯
i i x x training x i xi Q P
q p p q r
2 2 2 ,
) (
min
ο¬
Geared towards females Geared towards males serious funny
40
The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Oceanβs 11 Sense and Sensibility
Factor 1 Factor 2
The Princess Diaries
minfactors βerrorβ + ο¬ βlengthβ
οΊ ο» οΉ οͺ ο« ο© ο« ο« ο
ο₯ ο₯ ο₯
i i x x training x i xi Q P
q p p q r
2 2 2 ,
) (
min
ο¬
ο‘ Want to find matrices P and Q: ο‘ Gradient descent:
πΌπ = [πΌπππ] and πΌπππ = Οπ¦,π β2 π
π¦π β ππππ¦ ππ¦π + 2π2πππ
41
How to compute gradient
Compute gradient of every element independently!
i i x x training x i xi Q P
2 2 2 1 2 ,
42 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example
Rewrite objective as: min ΰ·
π¦,π
π
π¦π β ππ0ππ¦0 + ππ1ππ¦1 + ππ2ππ¦2 2
+π1 Οπ¦ ππ¦0
2 + ππ¦1 2 + ππ¦2 2
+π2 Οπ ππ0
2 + ππ1 2 + ππ2 2
οΊ ο» οΉ οͺ ο« ο© ο« ο« ο
i i x x training x i xi Q P
q p p q r
2 2 2 1 2 ,
) (
ο¬ ο¬
ππ¦ = ππ¦0 ππ¦1 ππ¦2 ππ = ππ0 ππ1 ππ2
Assume we want 3 factors per user and item:
43 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Example
min ΰ·
π¦,π
π π¦π β ππ0ππ¦0 + ππ1ππ¦1 + ππ2ππ¦2
2
+π1 ΰ·
π¦
ππ¦0
2 + ππ¦1 2 + ππ¦2 2
+π2 ΰ·
π
ππ0
2 + ππ1 2 + ππ2 2
ππ¦ = ππ¦0 ππ¦1 ππ¦2 ππ = ππ0 ππ1 ππ2
Compute gradient for variable qi0: πΌππ0 = ΰ·
π¦,π
β2 π
π¦π β (ππ0ππ¦0 + ππ1ππ¦1 + ππ2ππ¦2) ππ¦0 + 2π2ππ0
Do the same for every free variable
44 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Gradient Descent - Computation Cost
ο¨ How many free variables do we have?
(# of users + # of items) . (# of factors)
ο¨ Which ratings do we process to compute πΌπππ ?
All ratings for item i
ο¨ Which ratings do we process to compute πΌππ¦π ?
All ratings for user x
ο¨ What is the complexity of one iteration?
O(# of ratings . # of factors) πΌπ = [πΌπππ] and πΌπππ = Οπ¦,π β2 π
π¦π β ππππ¦ ππ¦π + 2π2πππ
45 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Stochastic Gradient Descent
ο¨ Gradient Descent (GD): Update all free variables in one step.
Need to process all ratings.
ο¨ Stochastic Gradient Descent (SGD): Update the free variables
associated with a single rating in one step.
ο€ Need many more steps to converge ο€ Each step is much faster ο€ In practice: SGD much faster than GD ο¨ GD: πΉο¬πΉ β ο¨ Οπππ οπΉ(πππ) ο¨ SGD: πΉο¬πΉ β ποπΉ(πππ)
46 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Stochastic Gradient Descent
πΌπππ = ΰ·
π¦,π
β2 π
π¦π β ππππ¦ ππ¦π + 2π2πππ
πΌππ¦π = ΰ·
π¦,π
β2 π
π¦π β ππππ¦ πππ + 2π1ππ¦π
Which free variables are associated with rating rxi?
ππ¦ = ππ¦0 ππ¦1 . . ππ¦π ππ = ππ0 ππ1 . . πππ
47 CS 425 β Lecture 9 Mustafa Ozdal, Bilkent University
Stochastic Gradient Descent
For each rxi:
ππ¦π = (π
π¦π β ππ β ππ¦ )
(derivative of the βerrorβ)
ππ β ππ + π1 ππ¦π ππ¦ β π2 ππ
(update equation)
ππ¦ β ππ¦ + π2 ππ¦π ππ β π1 ππ¦
(update equation) Note: The operations above are vector operations πΌπππ = ΰ·
π¦,π
β2 π
π¦π β ππππ¦ ππ¦π + 2π2πππ
πΌππ¦π = ΰ·
π¦,π
β2 π
π¦π β ππππ¦ πππ + 2π1ππ¦π
π β¦ learning rate
ο‘ Stochastic gradient decent:
necessary) and update factors: For each rxi:
π¦π β ππ β ππ¦ )
(derivative of the βerrorβ)
(update equation)
(update equation)
ο‘ 2 for loops:
48
π β¦ learning rate
ο‘ Convergence of GD vs. SGD
49
Iteration/step Value of the objective function GD improves the value
at every step. SGD improves the value but in a βnoisyβ way. GD takes fewer steps to converge but each step takes much longer to compute. In practice, SGD is much faster!
Koren, Bell, Volinksy, IEEE Computer, 2009
50
52
ο‘ ΞΌ = overall mean rating ο‘ bx = bias of user x ο‘ bi = bias of movie i
user-movie interaction movie bias user bias User-Movie interaction
ο‘
Characterizes the matching between users and movies
ο‘
Attracts most research in the field
ο‘
Benefits from algorithmic and mathematical innovations
Baseline predictor
behavior
contributions of the competition
ο‘ We have expectations on the rating by
β Rating scale of user x β Values of other ratings user gave recently (day-specific mood, anchoring, multi-user accounts) β (Recent) popularity of movie i β Selection bias; related to number of ratings user gave on the same day (βfrequencyβ)
53
ο‘ Example:
lower than the mean: bx = -1
average movie: bi = + 0.5
= 3.7 - 1 + 0.5 = 3.2
54
Overall mean rating Bias for user x Bias for movie i
User-Movie interaction
ο‘ Solve: ο‘ Stochastic gradient decent to find parameters
are treated as parameters (we estimate them)
55
regularization goodness of fit ο¬ is selected via grid- search on a validation set
ο i i x x x x i i R i x x i i x xi P Q
2 4 2 3 2 2 2 1 2 ) , ( ,
56
0.885 0.89 0.895 0.9 0.905 0.91 0.915 0.92 1 10 100 1000 RMSE Millions of parameters CF (no time bias) Basic Latent Factors Latent Factors w/ Biases
Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 Latent factors: 0.90 Latent factors+Biases: 0.89 Collaborative filtering++: 0.91
57
ο‘ Sudden rise in the
ο‘ Movie age
without any reasons
inherently better than newer ones
59
temporal dynamics, KDD β09
ο‘ Original model:
ο‘ Add time dependence to biases:
(2) Each bin corresponds to 10 consecutive weeks
ο‘ Add temporal dependence to factors
60
61
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91 0.915 0.92 1 10 100 1000 10000 RMSE Millions of parameters CF (no time bias) Basic Latent Factors CF (time bias) Latent Factors w/ Biases + Linear time factors + Per-day user biases + CF
Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 Latent factors: 0.90 Latent factors+Biases: 0.89 Collaborative filtering++: 0.91
62
Latent factors+Biases+Time: 0.876
63
64
June 26th submission triggers 30-day βlast callβ
ο‘ Ensemble team formed
ο‘ BellKor
ο‘ Strategy
65
ο‘ Submissions limited to 1 a day
ο‘ 24 hours before deadlineβ¦
Ensemble posts a score that is slightly better than BellKorβs
ο‘ Frantic last 24 hours for both teams
ο‘ Final submissions
deadline
66
67
68
ο‘ Some slides and plots borrowed from
ο‘ Further reading:
dynamics, KDD β09
69