2018-02-27 6. Learning Partitions of a Set How to use set - PDF document

2018-02-27 6. Learning Partitions of a Set How to use set partitions? Also known as clustering! } to create classes and then classify examples Usually, we partition sets into subsets with elements } to find outliers in data sets that are somewhat similar (and since similarity is often } to establish what feature values interesting classes task dependent, different partitions of the same set are should have. possible and often needed). In contrast to classification tasks, partitioning is not using given classes, it creates its } to find events that over time occur sufficiently often own classes (although there might be some constraints enough on what is allowed as a class; unsupervised learning). } to find “plays” for a group of agents to help them As such, learning partitions is not just to later classify achieve their goals additional examples, it also is about discovery of what } ... should be classified together! Machine Learning J. Denzinger Machine Learning J. Denzinger Known methods to learn partitions: Comments: } k-means clustering and many improvements/ } All clustering methods have parameters (in addition enhancements of it (like x-means) to the similarity measure) that substantially influence what a partitioning of the set is created } PAM (partitioning around medoids) E quite some literature on how to compare } sequential leader clustering partitionings } hierarchical clustering methods } But similarity is the key parameter and dependent on } conceptual clustering methods what the clustering is aimed to achieve. } fuzzy clustering (allows an example to be in several } Often we use a distance measure instead of similarity clusters with different membership values) (which means we change maximizing to minimizing). } ... Machine Learning J. Denzinger Machine Learning J. Denzinger 6.1 k-means clustering: Learning phase: General idea Representing and storing the knowledge See essentially every text book on Machine Learning. The clusters are represented by their centroids and The basic idea is to use a given similarity (or distance) these k elements (which are described by their values measure and a given number k to create k clusters out of for the features that we have in the examples) are the the given examples (data points) by putting examples that stored knowledge. are similar to each other into the same cluster. Since clusters need to have a center point, we start with k randomly selected center points, create clusters, compute the best center points for each cluster (centroids) and then repeat the clustering with the new center points. This whole process is repeated either a certain number of times, until the centroids do not change, or the quality of the partitioning improvement is below a threshold. Also, usually we do several runs using different initial center points. Machine Learning J. Denzinger Machine Learning J. Denzinger 1

2018-02-27 Learning phase: Learning phase: What or whom to learn from Learning method As for so many learning methods, we learn from The basic algorithm is as follows (using a convergence examples that are vectors of values for features: parameter ε ): ex 1 : (val 11 ,..., val 1n ) Randomly select k elements {m 1 ,...,m k } from Ex; Quality new := 0 ... Repeat ex s : (val s1 ,..., val sn ) For i=1 to k do C i := {} val ij ∈ feat j with feat j being the set of possible values for For all ex ∈ Ex do feature j. This forms the set Ex. assign ex to the C i with sim (m i ,ex) maximal For i=1 to k do m i := 1/|C i | * Σ ex ∈ Ci ex Additionally, we need to be able to define a similarity Quality old := Quality new function sim : (feat 1 x ... x feat n ) 2 -> R (real numbers) Quality new := Σ i=1 k Σ ex ∈ Ci sim (m i ,ex) and sim is provided/selected from the outside. until Quality new - Quality old < ε Machine Learning J. Denzinger Machine Learning J. Denzinger Learning phase: Application phase: Learning method (cont.) How to detect applicable knowledge A key component of the algorithm is the computation By finding the nearest centroid (with regard to sim ) to of the new m i s (“means”, centroids). As one name a given example to classify. suggests, they are supposed to represent the means of the members of a cluster and are computed by determining the mean value of the examples in the cluster for each individual feature. Machine Learning J. Denzinger Machine Learning J. Denzinger Application phase: Application phase: How to apply knowledge Detect/deal with misleading knowledge Assign the example to classify to the cluster Again, not part of the method. User has responsibility represented by the centroid to which it is nearest. to do a re-learning if unsatisfied with current results. Machine Learning J. Denzinger Machine Learning J. Denzinger 2

2018-02-27 General questions: General questions: Generalize/detect similarities? Generalize/detect similarities? (cont.) Obviously, the similarity function sim is responsible for Among the general candidates for similarites for two this. While there are some general candidates for sim , vectors x = (x 1 ,...,x n ) and y = (y 1 ,...,y n ) are often we need specialized functions incorporating Euclidean distance: sqrt( Σ i=1 n (x i -y i ) 2 ) (the smaller the already knowledge from the particular application. better) In general, similarity functions should fulfill Weighted Euclidean distance: sqrt( Σ i=1 n w i *(x i -y i ) 2 ) (for Reflexivity: sim (x,x) = 1 w i weight for feature i; the smaller the better, again). Symmetry: sim (x,y) = sim (y,x) Manhattan distance: Σ i=1 n |x i -y i | (the smaller the better, again) Machine Learning J. Denzinger Machine Learning J. Denzinger General questions: (Conceptual) Example Dealing with knowledge from other sources Very indirectly, by influencing parameters, like the We have two features (with real numbers as values), similarity function (or parameters in it). use Manhattan distance as similarity measure (ties are broken by assigning to the cluster with smaller index) Example: x 2 x 2 and set k=2. We have the following set Ex: X ex 1 : (1,1); ex 2 : (2,1); ex 3 : (2,2); ex 4 : (2,3); ex 4 : (3,3) X X X Select m 1 ,m 2 : (1,1), (3,3) x 1 x 1 C 1 = {(1,1), (2,1), (2,2)} C 2 = {(2,3), (3,3)} The left clustering is the result of using standard New m i s: m 1 =(1.67,1.33); m 2 = (2.5,3) Euclidean distance. The right clustering can be generated using a weighted Euclidean distance with No change in C i s, therefore no change in m i s and weights w 1 =0 and w 2 =1 Quality. Finished. Machine Learning J. Denzinger Machine Learning J. Denzinger (Conceptual) Example Pros and cons New run: ✚ efficient and simple to implement Select m 1 ,m 2 : (2,3), (2,1) - has often problems with non-numerical features - choosing k is not always easy C 1 = {(2,3), (3,3), (2,2)} C 2 = {(2,1), (1,1)} - finds a local optimum, not a global one New m i s: m 1 =(2.33,2.67); m 2 = (1.5,1) - has problems with noisy data and with outliers No change in C i s, therefore no change in m i s and Quality. Finished. Note where (2,2) ends up in each run! Machine Learning J. Denzinger Machine Learning J. Denzinger 3

2018-02-27 6. Learning Partitions of a Set How to use set - PDF document

2018-02-27 6. Learning Partitions of a Set How to use set partitions? Also known as clustering! } to create classes and then classify examples Usually, we partition sets into subsets with elements } to find outliers in data sets that are

3/26/2018 1 3/26/2018 2 3/26/2018 3 3/26/2018 4 3/26/2018 5 3/26/2018 6 3/26/2018 7

11/10/2018 1 11/10/2018 2 11/10/2018 3 11/10/2018 4 11/10/2018 5 11/10/2018 6

4/5/2018 1 4/5/2018 2 4/5/2018 3 4/5/2018 4 4/5/2018 5 4/5/2018 6 4/5/2018 7 4/5/2018

June 2018 July 2018 July 2018 July 2018 July 2018 August 2018 August 2018 September 2018

Some examples of MCTs work in in 2018 8 us usin ing g war ard d fun undi ding ng 21

Bulbs as Companion Plants Jill Selinger February 1, 2018 1 2/8/2018 2 2/8/2018 Anemone

Woking High School and Tag Rugby Trust Tour to Zambia July 2018 1 29/01/2018 1 29/01/2018 1

2018 8 8 1 2 2 3 3 2018 11 53 4 4 2018 2017 4

Q2 2018 results August 8, 2018 Q2 2018 Highlights Q2 2018 Results Q2 2018 Highlights Solid

2018 Proposed Municipal Budget Index TOPIC PAGE # 2018 Budget Overview 1-2 2018 Operating

One Health Growing partnerships between medical professionals 1 3/29/2018 2 3/29/2018 3

BP 3Q 2018 RESULTS Secret 1 BP 3Q 2018 RESULTS 1 BP 3Q 2018 RESULTS Craig Marshall Head of

Q1 2018 Results 6 April 2018 1 Agenda Q1 2018 Key Highlights & Destination Progress Page 3

Sage Department Update July 18, 2018 2018-2019 Sage Data Elementary 2018-2019 Sage Data

Q1 2018 RESULTS Warsaw, June 12th, 2018 Agenda Q1 2018 key highlights Investment

9M-2018 RESULTS PRESENTATION TO FINANCIAL ANALYSTS OCTOBER 24 TH , 2018 9M-2018 Results: Net

L ECTURE 10: D ISCRETE -T IME D YNAMICAL S YSTEMS 1 I NSTRUCTOR : G IANNI A. D I C ARO (M ORE ) G

Approach & Technology FOSS4G-Europe, Bremen, 2014-07-14 Peter Baumann Jacobs University |

Dynamic Models of Long Term Consequences of Disaster Relief Anton Kleywegt 1 1 School of

X3D Graphics and VR Don Brutzman Web3D Consortium W3C Workshop, Virtual Reality (VR) and the Web

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

320454 Big Data Project A Instructor: Peter Baumann email: p.baumann@jacobs-university.de tel:

Loon R.W. Oldford The loon package Loon is an interactive visualization system built using tcltk .

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

2018-02-27 6. Learning Partitions of a Set How to use set - PDF document

2018-02-27 6. Learning Partitions of a Set How to use set partitions? Also known as clustering! } to create classes and then classify examples Usually, we partition sets into subsets with elements } to find outliers in data sets that are

3/26/2018 1 3/26/2018 2 3/26/2018 3 3/26/2018 4 3/26/2018 5 3/26/2018 6 3/26/2018 7

11/10/2018 1 11/10/2018 2 11/10/2018 3 11/10/2018 4 11/10/2018 5 11/10/2018 6

4/5/2018 1 4/5/2018 2 4/5/2018 3 4/5/2018 4 4/5/2018 5 4/5/2018 6 4/5/2018 7 4/5/2018

June 2018 July 2018 July 2018 July 2018 July 2018 August 2018 August 2018 September 2018

Some examples of MCTs work in in 2018 8 us usin ing g war ard d fun undi ding ng 21

Bulbs as Companion Plants Jill Selinger February 1, 2018 1 2/8/2018 2 2/8/2018 Anemone

Woking High School and Tag Rugby Trust Tour to Zambia July 2018 1 29/01/2018 1 29/01/2018 1

2018 8 8 1 2 2 3 3 2018 11 53 4 4 2018 2017 4

Q2 2018 results August 8, 2018 Q2 2018 Highlights Q2 2018 Results Q2 2018 Highlights Solid

2018 Proposed Municipal Budget Index TOPIC PAGE # 2018 Budget Overview 1-2 2018 Operating

One Health Growing partnerships between medical professionals 1 3/29/2018 2 3/29/2018 3

BP 3Q 2018 RESULTS Secret 1 BP 3Q 2018 RESULTS 1 BP 3Q 2018 RESULTS Craig Marshall Head of

Q1 2018 Results 6 April 2018 1 Agenda Q1 2018 Key Highlights &amp; Destination Progress Page 3

Sage Department Update July 18, 2018 2018-2019 Sage Data Elementary 2018-2019 Sage Data

Q1 2018 RESULTS Warsaw, June 12th, 2018 Agenda Q1 2018 key highlights Investment

9M-2018 RESULTS PRESENTATION TO FINANCIAL ANALYSTS OCTOBER 24 TH , 2018 9M-2018 Results: Net

L ECTURE 10: D ISCRETE -T IME D YNAMICAL S YSTEMS 1 I NSTRUCTOR : G IANNI A. D I C ARO (M ORE ) G

Approach &amp; Technology FOSS4G-Europe, Bremen, 2014-07-14 Peter Baumann Jacobs University |

Dynamic Models of Long Term Consequences of Disaster Relief Anton Kleywegt 1 1 School of

X3D Graphics and VR Don Brutzman Web3D Consortium W3C Workshop, Virtual Reality (VR) and the Web

Programming Shared-memory Platforms with OpenMP Xu Liu Topics for Today Introduction to

320454 Big Data Project A Instructor: Peter Baumann email: p.baumann@jacobs-university.de tel:

Loon R.W. Oldford The loon package Loon is an interactive visualization system built using tcltk .

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Q1 2018 Results 6 April 2018 1 Agenda Q1 2018 Key Highlights & Destination Progress Page 3

Approach & Technology FOSS4G-Europe, Bremen, 2014-07-14 Peter Baumann Jacobs University |