Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and
- M. A. Hall
Data Mining Practical Machine Learning Tools and Techniques Slides - - PowerPoint PPT Presentation
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology Whats a concept? Classification,
Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and
2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Classification, association, clustering, numeric prediction
♦ Relations, flat files, recursion
♦ Nominal, ordinal, interval, ratio
♦ ARFF, attributes, missing values, getting to know data
3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Concepts: kinds of things that can be learned
♦ Instances: the individual, independent examples of a
concept
♦ Attributes: measuring aspects of an instance
4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Classification learning:
predicting a discrete class
♦ Association learning:
detecting associations between features
♦ Clustering:
grouping similar instances into clusters
♦ Numeric prediction:
predicting a numeric quantity
5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Scheme is provided with actual outcome
6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Can predict any attribute’s value, not just the class, and more
than one attribute’s value at a time
♦ Hence: far more association rules than classification rules ♦ Thus: constraints are necessary
7 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ The class of an example is not known
… … … Iris virginica 1.9 5.1 2.7 5.8 102 101 52 51 2 1 Iris virginica 2.5 6.0 3.3 6.3 Iris versicolor 1.5 4.5 3.2 6.4 Iris versicolor 1.4 4.7 3.2 7.0 Iris setosa 0.2 1.4 3.0 4.9 Iris setosa 0.2 1.4 3.5 5.1 Type Petal width Petal length Sepal width Sepal length
8 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Scheme is being provided with target value
… … … … … 40 False Normal Mild Rainy 55 False High Hot Overcast True High Hot Sunny 5 False High Hot Sunny Play-time Windy Humidity Temperature Outlook
9 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
10 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
= Steven M Graham M Pam F Grace F Ray M = Ian M Pippa F Brian M = Anna F Nikki F Peggy F Peter M
11 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Ian Pam Female Nikki Ian Pam Female Anna Ray Grace Male Brian Ray Grace Female Pippa Ray Grace Male Ian Peggy Peter Female Pam Peggy Peter Male Graham Peggy Peter Male Steven ? ? Female Peggy ? ? Male Peter parent2 Parent1 Gender Name
12 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
yes Anna Nikki … … … Yes Nikki Anna … … … Yes Pippa Ian … … … Yes Pam Steven No Graham Steven No Peter Steven … … … No Steven Peter No Peggy Peter Sister of? Second person First person No All the rest Yes Anna Nikki Yes Nikki Anna Yes Pippa Brian Yes Pippa Ian Yes Pam Graham Yes Pam Steven Sister of? Second person First person
Closed-world assumption
13 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Ian Ian Ray Ray Peggy Peggy Parent2 Female Female Female Female Female Female Gender Pam Pam Grace Grace Peter Peter Parent1 Name Parent2 Parent1 Gender Name Ian Ian Ray Ray Peggy Peggy Pam Pam Grace Grace Peter Peter Female Female Male Male Male Male No All the rest Yes Anna Nikki Yes Nikki Anna Yes Pippa Brian Yes Pippa Ian Yes Pam Graham Yes Pam Steven Sister
Second person First person
If second person’s gender = female and first person’s parent = second person’s parent then sister-of = yes
14 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Several relations are joined together to make one
♦ Example: concept of nuclear-family
♦ Example: “supplier” predicts “supplier address”
15 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Yes Other positive examples here Yes Ian Pam Female Nikki ? ? Female Grace Ray Ian Ian Ian Peggy Peggy Parent2 Male Female Female Female Female Male Gender Grace Pam Pam Pam Peter Peter Parent1 Name Parent2 Parent1 Gender Name ? Peggy ? ? ? ? ? Peter ? ? ? ? Female Female Male Male Male Male No All the rest Yes Ian Grace Yes Nikki Pam Yes Nikki Peter Yes Anna Peter Yes Pam Peter Yes Steven Peter Ancestor
Second person First person
16 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ (e.g. Quinlan’s FOIL)
♦ Problems: (a) noise and (b) computational complexity
If person1 is a parent of person2 then person1 is an ancestor of person2 If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3
17 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ All instances are described by the same attributes ♦ One or more instances within an example may be
♦ e.g. drug activity prediction
18 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Possible solution: “irrelevant value” flag
♦ Nominal, ordinal, interval and ratio
19 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Values themselves serve only as labels or names ♦ Nominal comes from the Latin word for name
♦ Values: “sunny”,”overcast”, and “rainy”
20 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Values: “hot” > “mild” > “cool”
21 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Zero point is not defined!
22 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Distance between an object and itself is zero
♦ All mathematical operations are allowed
♦ Answer depends on scientific knowledge (e.g. Fahrenheit
knew no lower limit to temperature)
23 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ But: “enumerated” and “discrete” imply order
♦ But: “continuous” implies mathematical continuity
24 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Dimensional considerations
(i.e. expressions must be dimensionally correct)
♦ Circular orderings
(e.g. degrees in compass)
♦ Partial orderings
(e.g. generalization/specialization relations)
25 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Differences: styles of record keeping, conventions, time
periods, data aggregation, primary keys, errors
♦ Data must be assembled, integrated, cleaned up ♦ “Data warehouse”: consistent point of access
26 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
% % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no
...
27 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Similar to nominal attributes but list of values is
♦ Uses the ISO-8601 combined date and time format
@attribute description string @attribute today date
28 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ The value of a relational attribute is a separate set
♦ Nested attribute block gives the structure of the
@attribute bag relational @attribute outlook { sunny, overcast, rainy } @attribute temperature numeric @attribute humidity numeric @attribute windy { true, false } @end bag
29 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
% % Multiple instance ARFF file for the weather data % @relation weather @attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 } @attribute bag relational @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @end bag @data 1, “sunny, 85, 85, false\nsunny, 80, 90, true”, no 2, “overcast, 83, 86, false\nrainy, 70, 96, false”, yes ...
30 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ E.g.: word counts in a text categorization problem
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}
31 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Numeric attributes are interpreted as
(normalization/standardization may be required)
♦ Instance-based schemes define distance between nominal
values (0 if values are equal, 1 otherwise)
32 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
(e.g. “young” < “pre-presbyopic” < “presbyopic”)
If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age ≤ pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
33 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Types: unknown, unrecorded, irrelevant ♦ Reasons:
♦ Most schemes assume that is not the case: “missing” may
need to be coded as additional value
34 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
data (e.g. age of customer)
checked for consistency
35 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
♦ Nominal attributes: histograms (Distribution
consistent with background knowledge?)
♦ Numeric attributes: graphs
(Any obvious outliers?)