Mining Streams
Computational Tools for Data Science 02807, E 2018
Paul Fischer
Institut for Matematik og Computer Science Danmarks Tekniske Universitet
Efterår 2018
1
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
Computational Tools for Data Science 02807, E 2018 Paul Fischer - - PowerPoint PPT Presentation
Mining Streams Computational Tools for Data Science 02807, E 2018 Paul Fischer Institut for Matematik og Computer Science Danmarks Tekniske Universitet Efterr 2018 02807 Computational Tools for Data Science, Lecture 9 1 2018 P c .
Institut for Matematik og Computer Science Danmarks Tekniske Universitet
1
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ What is clustering ◮ Hierarchical clustering ◮ The k-means algorithm ◮ The DBSCAN algorithm (not in the book) ◮ Evaluating clusterings
2
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ People with similar interests in social media. ◮ People with similar taste for movies in a streaming provider. ◮ Detecting similarities in medical tests. ◮ Detection of groups in statistical data.
3
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
4
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
n−1
5
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ Initialisation Each data point is a cluster by itself, i.e., Ci = {xi} and ci = xi. ◮ Merging Find clusters Ci and Cj where dist(ci, cj) is minimal (breaking ties, e.g., randomly).
◮ Stop the process when some criterium is satisfied, e.g., a certain number of clusters is
6
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
|Ci|+|Cj|
7
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ A number of clusters has been specified beforehand. When only this number is left, the
◮ The density of the cluster resulting from a merger is bad. The density is the average distance
◮ See more in the book.
8
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
9
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ Iterate steps 1 and 2 until no (only very small) changes occur.
10
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
11
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
◮ One defines the concept of (density) reachable for the data points. ◮ The algorithm uses two parameters: ε > 0, the neighborhood radius, and m ∈ N+, the
◮ The algorithms classifies points as core (centrally in a cluster), rim (at the edge of a cluster)
◮ The number of clusters is not fixed beforehand, it is implicitly controlled by ε and m. ◮ A point x is core if there area at least m points (incl. x) within distance ε, i.e.,
◮ A point z is directly reachable from x if dist(x, z) ≤ ε and x is core. ◮ A point z is reachable from x if there are points x1, x2, . . . , xk, such that x = x1, z = xk, xi+1 is
12
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
13
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
Mark all xi ∈ S as unvisited; for i = 0, . . . , n − 1 do if xi is unvisited then N ← neigh(xi, ε); if |N| < m then Mark xi as noise; else C ← ∅ ; Mark xi as core; expand(xi, N, C, ε, m); end end end
return all points z with dist(x, z) ≤ ε
C ← C ⊎ {x}; for z ∈ N do if z is not visited then Mark z as visited; N′ ← neigh(z, ε); if
≥ m then N ← N ⊎ N′; end end if z is not in any cluster then C ← C ⊎ {z}; if
≥ m then Mark z as core; else Mark z as rim; end end end
14
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
k−1
1 |Ci|
15
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer
16
02807 Computational Tools for Data Science, Lecture 9 c
2018 P
. Fischer