Machine Learning: Joy of Data Sarath Chandar University of - - PowerPoint PPT Presentation

machine learning joy of data
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Joy of Data Sarath Chandar University of - - PowerPoint PPT Presentation

Machine Learning: Joy of Data Sarath Chandar University of Montreal Regression Predict my houses price ! Price Prediction Size in feet 2 ( x ) Training set of Price ($) in 1000's ( y ) 2104 460 housing prices 1416 232 (Portland, OR)


slide-1
SLIDE 1

Machine Learning: Joy of Data

Sarath Chandar University of Montreal

slide-2
SLIDE 2

Regression

slide-3
SLIDE 3

Predict my house’s price !

slide-4
SLIDE 4

Price Prediction

Notation: m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable Size in feet2 (x) Price ($) in 1000's (y) 2104 460 1416 232 1534 315 852 178 … …

Training set of housing prices (Portland, OR)

slide-5
SLIDE 5

Training Set Learning Algorithm h Size of house Estimated price

Linear Regression

slide-6
SLIDE 6

Linear Regression

How to choose ‘s ?

Training Set

Hypothesis: ‘s: Parameters

Size in feet2 (x) Price ($) in 1000's (y) 2104 460 1416 232 1534 315 852 178 … …

slide-7
SLIDE 7

Linear Regression

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

slide-8
SLIDE 8

Linear Regression

y x

Idea: Choose so that is close to for our training examples

slide-9
SLIDE 9

Linear Regression

Hypothesis: Parameters: Cost Function: Goal:

slide-10
SLIDE 10

Linear Regression

slide-11
SLIDE 11

Gradient Descent

Have some function Want Outline:

  • Start with some
  • Keep changing to reduce

until we hopefully end up at a minimum

slide-12
SLIDE 12

Gradient Descent

1 0 J(0,1)

slide-13
SLIDE 13

Gradient Descent

0 1 J(0,1)

slide-14
SLIDE 14

Gradient Descent

Gradient descent algorithm

slide-15
SLIDE 15

If α is too small, gradient descent can be slow. If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

Gradient Descent

slide-16
SLIDE 16

Gradient Descent

Gradient descent can converge to a local minimum, even with the learning rate α fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

slide-17
SLIDE 17

Gradient descent algorithm

update and simultaneously

Gradient Descent

slide-18
SLIDE 18

(for fixed , this is a function of x) (function of the parameters )

slide-19
SLIDE 19

(for fixed , this is a function of x) (function of the parameters )

slide-20
SLIDE 20

(for fixed , this is a function of x) (function of the parameters )

slide-21
SLIDE 21

(for fixed , this is a function of x) (function of the parameters )

slide-22
SLIDE 22

(for fixed , this is a function of x) (function of the parameters )

slide-23
SLIDE 23

(for fixed , this is a function of x) (function of the parameters )

slide-24
SLIDE 24

(for fixed , this is a function of x) (function of the parameters )

slide-25
SLIDE 25

(for fixed , this is a function of x) (function of the parameters )

slide-26
SLIDE 26

(for fixed , this is a function of x) (function of the parameters )

slide-27
SLIDE 27

Classification

slide-28
SLIDE 28

Classification: Definition

  • Given a collection of records (training set )

– Each record contains a set of attributes, one of the attributes is the class.

  • Find a model for class attribute as a function
  • f the values of other attributes.
  • Goal: previously unseen records should be

assigned a class as accurately as possible.

– A test set is used to determine the accuracy of the

  • model. Usually, the given data set is divided into training

and test sets, with training set used to build the model and test set used to validate it.

slide-29
SLIDE 29

Illustrating Classification Task

Apply Model

Induction Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes

10

Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?

10

Test Set Learning algorithm Training Set

slide-30
SLIDE 30

Examples of Classification Task

  • Predicting tumor cells as benign or malignant
  • Classifying credit card transactions

as legitimate or fraudulent

  • Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

  • Categorizing news stories as finance,

weather, entertainment, sports, etc

slide-31
SLIDE 31

Clustering

slide-32
SLIDE 32

Cluster Analysis

  • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized Intra-cluster distances are minimized

slide-33
SLIDE 33

Applications of Cluster Analysis

 Understanding

– Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

 Summarization

– Reduce the size of large data sets

Discovered Clusters Industry Group

1

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN

Technology1-DOWN

2

Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3

Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP

Oil-UP

slide-34
SLIDE 34

Notion of a Cluster can be Ambiguous

How many clusters? Four Clusters Two Clusters Six Clusters

slide-35
SLIDE 35

Types of Clustering

  • A clustering is a set of clusters
  • Important distinction between hierarchical

and partitional sets of clusters

  • Partitional Clustering

– A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

  • Hierarchical clustering

– A set of nested clusters organized as a hierarchical tree

slide-36
SLIDE 36

Partitional Clustering

Original Points A Partitional Clustering

slide-37
SLIDE 37

Hierarchical Clustering

p4 p1 p3 p2 p4 p1 p3 p2

p4 p1 p2 p3 p4 p1 p2 p3 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram

slide-38
SLIDE 38

Types of clusters : Well Seperated

  • Well-Separated Clusters:

– A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters

slide-39
SLIDE 39

Types of Clusters : Center-Based

  • Center-based

– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any

  • ther cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point

  • f a cluster

4 center-based clusters

slide-40
SLIDE 40

Types of Clusters : Contiguity Based

  • Contiguous Cluster (Nearest neighbor or

Transitive)

– A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

8 contiguous clusters

slide-41
SLIDE 41

Types of Clusters : Density-Based

  • Density-based

– A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

slide-42
SLIDE 42

Types of Clusters : Conceptual clusters

  • Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent a particular concept. .

2 Overlapping Circles

slide-43
SLIDE 43

Clustering Algorithms

  • K-means and its variants
  • Density-based clustering
  • Hierarchical clustering
slide-44
SLIDE 44

K-means clustering

  • Partitional clustering approach.
  • Each cluster is associated with a centroid (center point)
  • Each point is assigned to the cluster with the closest centroid
  • Number of clusters, K, must be specified
  • The basic algorithm is very simple
slide-45
SLIDE 45

K-means clustering : Details

  • Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

  • The centroid is (typically) the mean of the points in the cluster.
  • ‘Closeness’ is measured by Euclidean distance, cosine similarity,

correlation, etc.

  • K-means will converge for common similarity measures

mentioned above.

  • Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few points change clusters’

slide-46
SLIDE 46

Two different K-means Clusterings

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Sub-optimal Clustering

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Optimal Clustering Original Points

slide-47
SLIDE 47

An Example

slide-48
SLIDE 48
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

slide-49
SLIDE 49
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

slide-50
SLIDE 50
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

slide-51
SLIDE 51
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

slide-52
SLIDE 52
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

slide-53
SLIDE 53
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 6

slide-54
SLIDE 54

Evaluating K-means clusters



 

K i C x i

i

x m dist SSE

1 2

) , (

  • Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – x is a data point in cluster Ci and mi is the representative point for cluster Ci

  • can show that micorresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters

  • A good clustering with smaller K can have a lower SSE than a poor

clustering with higher K

slide-55
SLIDE 55

Importance of choosing Initial centroids..

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

slide-56
SLIDE 56

Limitations of K-means

  • K-means has problems when clusters are of

differing

– Sizes – Densities – Non-globular shapes

  • K-means has problems when the data contains
  • utliers.
slide-57
SLIDE 57

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

slide-58
SLIDE 58

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

slide-59
SLIDE 59

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

slide-60
SLIDE 60

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters. Find parts of clusters, that need to put together.

slide-61
SLIDE 61

Overcoming K-means Limitations

Original Points K-means Clusters

slide-62
SLIDE 62

Overcoming K-means Limitations

Original Points K-means Clusters

slide-63
SLIDE 63

Acknowledgements

Most of the slides are borrowed from “Introduction to Data Mining” by Tan, Steinbach and Kumar. Andrew Ng’s Machine Learning course.

slide-64
SLIDE 64

Some Standard Text Books

  • Machine Learning by Tom Mitchell.
  • Introduction to Data Mining by Tan, Steinbach and Kumar.
  • Pattern Recognition and Machine Learning by Chris. Bishop.
  • Machine Learning a Probabilistic Perspective by Kevin Murphy.
  • Deep Learning by Goodfellow, Bengio, Courville.
slide-65
SLIDE 65

Thank You Slides will be available online: http://sarathchandar.in/mlw.html