data mining
play

Data mining Machine Intelligence Thomas D. Nielsen September 2008 - PowerPoint PPT Presentation

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008 1 / 37 What is Data Mining? ? Introduction Data mining September 2008 2 / 37 What is Data Mining? ? Introduction Data mining September 2008


  1. Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008 1 / 37

  2. What is Data Mining? ? Introduction Data mining September 2008 2 / 37

  3. What is Data Mining? ? Introduction Data mining September 2008 2 / 37

  4. What is Data Mining? ! Introduction Data mining September 2008 2 / 37

  5. What is Data Mining? Data Mining in practice Introduction Data mining September 2008 3 / 37

  6. What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt Introduction Data mining September 2008 3 / 37

  7. What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt evaluate + iterate Introduction Data mining September 2008 3 / 37

  8. What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt evaluate + iterate general algorithmic methods data/domain − specific operations Introduction Data mining September 2008 3 / 37

  9. What is Data Mining? An overview Unsupervised Learning Supervised Learning Labeled Data Unlabeled Data Classification Clustering Predictive Modeling Descriptive Modeling Rule Mining, Association Analysis Introduction Data mining September 2008 4 / 37

  10. Classification A high-level view r e fi Spam i s s a yes/no l C Classification Data mining September 2008 5 / 37

  11. Classification A high-level view SubAllCap yes/no TrustSend yes/no InvRet yes/no r Body’adult’ e fi yes/no Spam i s s a yes/no l C Body’zambia’ yes/no Classification Data mining September 2008 5 / 37

  12. Classification A high-level view Cell -1 1..64 Cell-2 1..64 Cell-3 1..64 r e fi Symbol i s s a A..Z,0..9 l C Cell-324 1..64 Classification Data mining September 2008 5 / 37

  13. Classification Labeled Data Attributes Class variable (Cases, Examples) (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet . . . B’zambia’ Spam y n n . . . n y Instances n n n . . . n n n y n . . . n y n n n . . . n n . . . . . . . . . . . . . . . . . . Attributes Class variable Cell-1 Cell-2 Cell-3 . . . Cell-324 Symbol 1 1 4 . . . 12 B Instances 1 1 1 . . . 3 1 34 37 43 . . . 22 Z 1 1 1 . . . 7 0 . . . . . . . . . . . . . . . . . . (In principle, any attribute can become the designated class variable) Classification Data mining September 2008 6 / 37

  14. Classification Classification in general Attributes : Variables A 1 , A 2 , . . . , A n (discrete or continuous). Class variable : Variable C . Always discrete: states ( C ) = { c 1 , . . . , c l } (set of class labels ) A (complete data) Classifier is a mapping C : states ( A 1 , . . . , A n ) → states ( C ) . A classifier able to handle incomplete data provides mappings C : states ( A i 1 , . . . , A i k ) → states ( C ) for subsets { A i 1 , . . . , A i k } of { A 1 , . . . , A n } . A classifier partitions Attribute-value space (also: instance space ) into subsets labelled with class labels. Classification Data mining September 2008 7 / 37

  15. Classification Iris dataset Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species of Iris. PL PW first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). SL Attributes Class variable SW SL SW PL PW Species 5.1 3.5 1.4 0.2 Setosa 4.9 3.0 1.4 0.2 Setosa 6.3 2.9 6.0 2.1 Virginica 6.3 2.5 4.9 1.5 Versicolor . . . . . . . . . . . . . . . Classification Data mining September 2008 8 / 37

  16. Classification Labeled data in instance space: Classification Data mining September 2008 9 / 37

  17. Classification Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier Classification Data mining September 2008 9 / 37

  18. Classification Decision Regions Piecewise linear: e.g. Naive Axis-parallel linear: e.g. Deci- Bayes sion Trees Nonlinear: e.g. Neural Network Classification Data mining September 2008 10 / 37

  19. Classification Classifiers differ in . . . Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,. . . ) Neural networks K-nearest neighbors Classification Data mining September 2008 11 / 37

  20. Decision Trees Example Attributes: height ∈ [ 0 , 2 . 5 ] , sex ∈ { m , f } . Class labels: { tall , short } . 2.5 s 2.0 f tall tall m h h 1.0 short < 1 . 7 ≥ 1 . 7 short < 1 . 8 ≥ 1 . 8 0 short tall short tall m f Partition of instance space Representation by decision tree Decision trees Data mining September 2008 12 / 37

  21. Decision Trees A decision tree is a tree - whose internal nodes are labeled with attributes - whose leaves are labeled with class labels - edges going out from node labeled with attribute A are labeled with subsets of states ( A ) , such that all labels combined form a partition of states ( A ) . Possible partitions states ( A ) = R : [ −∞ , 2 . 3 [ , [ 2 . 3 , ∞ ] [ −∞ , 1 . 9 [ , [ 1 . 9 , 3 . 5 [ , [ 3 . 5 , ∞ ] states ( A ) = { a , b , c } : { a } , { b } , { c } { a , b } , { c } Decision trees Data mining September 2008 13 / 37

  22. Decision Trees Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf. s f m h h < 1 . 7 ≥ 1 . 7 < 1 . 8 ≥ 1 . 8 short tall short tall [m,1.85] C ([ m , 1 . 85 ]) = tall Decision trees Data mining September 2008 14 / 37

  23. Decision Trees Learning a decision tree In general, we look for a small decision tree with minimal classification error over the data set � ( a 1 , c 1 ) , ( a 2 , c 2 ) , . . . , ( a n , c n ) � . A B t f t f B B A c 2 t f t f t f c 1 c 2 c 1 c 2 c 1 c 2 Bad tree Good tree Note: if data is noise-free , i.e. there are no instances ( a i , c i ) , ( a j , c j ) with a i = a j and c i � = c j , then there always exists decision tree with zero classification error. Decision trees Data mining September 2008 15 / 37

  24. Decision Trees The ID3 algorithm A t f yes X Decision trees Data mining September 2008 16 / 37

  25. Decision Trees The ID3 algorithm A f t yes B t f ? ? Top-down construction of the decision tree. For an “open” node X : Let D ( X ) be the instances that can reach X . If all instances agree on the class c , then label X with c and make it a leaf. Otherwise, find best attribute A and partition of states ( A ) , replace X with A , and make an outgoing edge from A for each member of the partition. Decision trees Data mining September 2008 16 / 37

  26. Decision Trees Notes: The exact algorithm is formulated as a recursive procedure. One can modify the algorithm by providing weaker conditions for termination (necessary for noisy data): - If <some other termination condition applies> , turn X into a leaf with <most appropriate class label> . Decision trees Data mining September 2008 17 / 37

  27. Decision Trees Scoring new partitions B t f X c 1 D ( X ) Decision trees Data mining September 2008 18 / 37

  28. Decision Trees Scoring new partitions For each candidate attribute A with partition a 1 , a 2 , a 3 B of states ( A ) : t f Let p i ( c ) be the relative frequency of class label c in D ( a i ) . Measure for uniformity of class label distribution in D ( X i ) (entropy): A c 1 X H X i := − p i ( c ) log 2 ( p i ( c )) a 1 a 2 a 3 c ∈ states ( C ) Score of new partition (-expected entropy): X 1 X 2 X 3 3 | D ( X i ) | X Score ( A , a 1 , a 2 , a 3 ) := − | D ( X ) | H X i D ( X 1 ) D ( X 2 ) D ( X 3 ) i = 1 Decision trees Data mining September 2008 18 / 37

  29. Decision Trees Searching for partitions When trying attribute A look for the partition of states ( A ) with highest score. In practice: Can try all choices for A . Cannot try all partitions of states ( A ) . Therefore For states ( A ) = R : only consider partitions of the form ] − ∞ , r [ , [ r , ∞ [ . Example: A : 1 3 4 6 10 12 17 18 22 25 C : y y y n n y y y n n Pick the partition with minimal expected entropy. For states ( A ) = { a 1 , . . . , a k } : only consider partition { a 1 } , . . . , { a k } . Decision trees Data mining September 2008 19 / 37

  30. Decision Trees Decision boundaries revisited Decision trees Data mining September 2008 20 / 37

  31. Attributes with many values The expected entropy measure favors attributes with many values: For example, an attribute Date (with the possible dates as states) will have a very low expected entropy but is unable to generalize! One approach for avoiding this problem is to select attributes based on GainRation: GainRation ( D , A ) = score ( S , A ) H A X H A = − p ( a ) log 2 ( p ( a )) , a ∈ states ( A ) where p ( a ) is the relative frequency of A = a in D . Decision trees Data mining September 2008 21 / 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend