problems for multivariate data analysis
play

Problems for Multivariate Data Analysis Censored data. Riffle: an R - PowerPoint PPT Presentation

Problems for Multivariate Data Analysis Censored data. Riffle: an R Package for Nonmetric Clustering Tied ranks and reduced variance when < 5 5. Geoffrey B. Matthews and Robin A. Matthews Systematic bias when


  1. Problems for Multivariate Data Analysis • Censored data. Riffle: an R Package for Nonmetric Clustering – Tied ranks and reduced variance when “ < 5” ⇒ “5”. Geoffrey B. Matthews and Robin A. Matthews – Systematic bias when omitted. Western Washington University Bellingham, WA, USA • Missing data. – Omit entire row when one variable column is missing? • Noisy, “useless” parameters. – Measured anyway. – Can be unrelated to major patterns. Riffle: an R Package for Nonmetric Clustering Riffle • Dissimilar data types Matthews & Hearne, IEEE PAMI, 1991 – Chemical A clustering algorithm: ∗ ph, alkalinity • group similar points into clusters. – Physical A nonmetric algorithm: ∗ temperature, percent canopy cover, sediment size, land • uses only order statistics for continuous data use classes • can handle both continuous and categorical data together – Biological Uses variables independently: ∗ chlorophyll, sex (male, female, juvenile) • ignores scattered missing values • uses incommensurable variables without normalizing ∗ rare species (counts 1-2) ∗ common species (counts 10,000-100,000) Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

  2. Proportional Reduction in Error Proportional Reduction in Error • Measuring Predictability for Categorical Variables Independent variables: red green blue Errors A 5 8 2 7 red green blue Errors B 2 3 9 5 A 6 3 9 9 C 8 1 0 1 B 4 2 6 6 Minimum: Totals 15 12 11 13 C 2 1 3 3 0% reduction Totals 12 6 18 18 0 18 Errors predicting (red, green, blue) a priori : 12 + 11 = 23 Errors predicting (red, green, blue) given (A, B, C): 7 + 5 + 1 = 13 Perfectly predictable variables: 23 − 13 = 10 red green blue Errors Proportional reduction in error: 23 23 A 12 0 0 0 B 0 0 18 0 Maximum: • More meaningful and robust than, e.g., χ 2 C 0 6 0 0 100% reduction Totals 12 6 18 0 18 18 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Clustering with categorical variables Handling ordered variables • Assign clusters to maximize predictability over other variables. • Cuts adjusted to maximize predictability of clusters Variable 1 A A A A 12 12 A A A A A A A A A 1 8 2 A A 1 A A A A A A A A A A Point Cluster A A A A A A 10 A A A 10 A A A B 3 4 1 11 A A A A A A A A A A A A A A 1 A A A C 1 2 3 A A A A A A 2 A 8 A 8 A C C C C C C 3 B C C y C y C 4 C 6 C C 6 C C C C Variable 2 C C C C 5 A A 9 3 2 4 4 5 6 C B B B B B B B B B B B B B 2 8 2 11 B B B B B B 7 B B B B B B B B B 2 B 2 B C 1 9 3 B B B B 8 C B B B B B B 9 A B B 0 0 10 C 2 4 6 8 10 12 2 4 6 8 10 12 Variable 3 x x 11 B A 8 3 7 . . 2 . . . . B 2 4 3 25 x y x y C 6 6 2 A 20 10 0 A 0 10 20 A 30 0 0 A 0 0 30 B 0 10 10 B 20 0 0 B 0 20 0 B 20 0 0 C 0 0 10 C 0 10 0 C 0 0 10 C 0 10 0 ... 20/40 30/40 30/30 30/30 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

  3. Cutting Gaussian variables Alternative handling of Gaussian variables (EM) • Generate independent Gaussians from cluster statistics µ i , σ i • Assign to most likely group, instead of max predictability. • Cut where max likelihood changes from one to another. • Not used in Riffle. 12 A A A A 12 A A A A A A A A A A A A A A A A A A A A A A A A A A A A 10 A 10 A A A A A A A A A A A A A A A A A A A density density A A A A A A 8 A 8 A C C C C C C C C y C 6 C y C C 6 C C C C C C C C 4 4 B y B y B B B B B B B B B B B B B B B B B B B B B B B x B B 2 2 B x B B B B B B B B B B B B 0 0 2 4 6 8 10 12 2 4 6 8 10 12 x x x y A 30 0 0 A 0 1 29 B 1 19 0 B 20 0 0 C 0 0 10 C 0 10 0 28/29 30/31 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Essential Algorithm Getting things started variables <- quantile.cuts(data) To find initial cuts for variables: clusters <- seed.clusters(variables) • Use quantiles for cut points. score <- reduction.in.error(variables, clusters) • Use quantiles for µ i , overall σ for σ i . while (improving(score)) { To find initial clusters, given cut variables: variables <- best.cuts(variables, clusters) • Select one point randomly as seed. clusters <- best.clusters(variables, clusters) • Find other seeds by selecting points as different as possible. score <- reduction.in.error(variables, clusters) • Assign each seed to a different cluster. } • Assign all other points to cluster of most similar seed. return (clusters, variables) Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

  4. Embellishments Data Exploration vs. Confirmation • Clustering in general is exploratory. • Each variable is dealt with independently. • Clustering data with known groups: • Each variable has a score (predictability vs. cluster). – correlation between clusters and groups measures significance. • Use score to eliminate variables, or rank them in importance. – identifies important variables as the ones with • We use this to handle the curse of dimensionality and find a high predictibility. small set of critical variables. – determine not only significance of effect, but also which variables are affected the most. • Data reduction – we have used this to chart seasonal effects. Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Conclusion • We have used Riffle successfully for over 10 years for ecological and toxicological data analysis. • Riffle can cluster using incommensurate variables. • Riffle handles censored data and missing data with few assumptions. • Riffle can reduce complexity in highly multivariate datasets. • R package available 2006. Riffle: an R Package for Nonmetric Clustering

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend