Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng - PowerPoint PPT Presentation

HPC Workload Characterization Using Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alexander Sim, Suren Byna, Hyeonsang Eom Distributed Computing Systems Laboratory Department of Computer Science and Engineering Seoul National University, Korea

Table of Contents ▪ Background ▪ Data Preprocessing ▪ Feature Selection for Dimension Reduction ▪ Application of Clustering model ▪ Performance Evaluation ▪ Cluster Characterization ▪ Conclusion 2

High Performance Computing (HPC) system ▪ Applications running on HPC system demand for efficient storage management and high performance computation 3

High Performance Computing (HPC) system ▪ Applications running on HPC system demand for efficient storage management and high performance computation ▪ Tunable parameters are provided for higher performance ▪ Number of compute nodes, Stripe count, Stripe size, .. 8 compute nodes 4 stripe count Use Burst Buffer 4

Drawbacks on deploying HPC environment ▪ Users are not familiar with using tunable parameters ▪ They use default configurations the system provides or maximum available resource Cori maximum Cori default stripe count : 248 stripe size : 1MB 5

Drawbacks on deploying HPC environment ▪ Users are not familiar with using tunable parameters ▪ They use default configurations the system provides or maximum available resource ▪ Some of the HPC applications do not meet I/O demands ▪ I/O characteristics for each applications are different ▪ I/O performance differs depending on the HPC system Cori maximum Cori default stripe count : 248 stripe size : 1MB 6

Drawbacks on deploying HPC environment ▪ Users are not familiar with using tunable parameters ▪ They use default configurations the system provides or maximum available resource ▪ Some of the HPC applications do not meet I/O demands ▪ I/O characteristics for each applications are different ▪ I/O performance differs depending on the HPC system Cori maximum Cori default stripe count : 248 stripe size : 1MB Understanding the different I/O demands of HPC applications is important 7

Used Dataset ▪ Real-world user log data from Oct. 2017 to Jan. 2018 ▪ Total 4-month Darshan log data is used ▪ Darshan I/O profiling tool captures I/O behaviors of applications run on Cori system ▪ Darshan interacts with Slurm workload manager ▪ Parser is used to extract meaningful information from Darshan log and Lustre monitoring tool ▪ Total 78 features are obtained from the parser 8

Used Dataset ▪ Real-world user log data from Oct. 2017 to Jan. 2018 ▪ Total 4-month Darshan log data is used ▪ Darshan I/O profiling tool captures I/O behaviors of applications run on Cori system ▪ Darshan interacts with Slurm workload manager and Lustre monitoring tool ▪ Parser is used to extract meaningful information from Darshan log ▪ Total 78 features are obtained from the parser ▪ I/O throughput ( writeRateTotal ) is the target variable ▪ HPC applications are categorized based on their I/O behaviors 9

Data Preprocessing ▪ User logs with less than 1GB I/O are dropped ▪ They cannot capture the relationship between features and the target variable 10

Data Preprocessing ▪ User logs with less than 1GB I/O are dropped ▪ They cannot capture the relationship between features and the target variable ▪ Data having negative values are all set to zero 11

Data Preprocessing ▪ User logs with less than 1GB I/O are dropped ▪ They cannot capture the relationship between features and the target variable ▪ Data having negative values are all set to zero ▪ The features with zero variance are eliminated ▪ The features with the constant value are not meaningful at all 12

Data Preprocessing ▪ User logs with less than 1GB I/O are dropped ▪ They cannot capture the relationship between features and the target variable ▪ Data having negative values are all set to zero ▪ The features with zero variance are eliminated ▪ The features with the constant value are not meaningful at all ▪ The features having highly correlated value with other features are eliminated ▪ The correlation value threshold is set to 0.8 ▪ It is to reduce redundancy among the feature selection results 13

Data Preprocessing ▪ User logs with less than 1GB I/O are dropped ▪ They cannot capture the relationship between features and the target variable ▪ Data having negative values are all set to zero ▪ The features with zero variance are eliminated ▪ The features with the constant value are not meaningful at all ▪ The features having highly correlated value with other features are eliminated ▪ The correlation value threshold is set to 0.8 ▪ It is to reduce redundancy among the feature selection results ▪ The feature data is normalized to range from 0 to 1 ▪ The features can have same scale and weight when calculated by feature selection methods 14

Data Preprocessing ▪ Top 20 mostly executed programs after preprocessing step Total 62,946 data from 353 different applications 15

Feature Selection for Dimension Reduction ▪ Feature selection methods 16

Feature Selection for Dimension Reduction ▪ Feature selection methods ▪ Mutual Information regression ▪ F Regression ▪ Decision Tree ▪ Extra Tree 17

Feature Selection for Dimension Reduction ▪ Feature selection methods ▪ Mutual Information regression ▪ F Regression ▪ Decision Tree ▪ Extra Tree ▪ Min-max Mutual Information (the new feature selection method) 18

Feature Selection for Dimension Reduction ▪ Feature selection methods ▪ Mutual Information regression ▪ F Regression ▪ Decision Tree ▪ Extra Tree ▪ Min-max Mutual Information (the new feature selection method) ▪ The data preprocessing step of removing features that have highly correlated value with other features is not applied ▪ Min-max mutual information selects features that are less correlated to each other ▪ The first feature that has highest correlation value with wrtieRateTotal is selected, and then this process is repeated 19

Feature Selection for Dimension Reduction ▪ Analysis of Feature Selection results 20

Application of Clustering Model ▪ Clustering models ▪ KMeans Clustering ▪ Gaussian Mixture Model ▪ Ward Linkage Clustering 21

Application of Clustering Model ▪ Clustering models ▪ KMeans Clustering ▪ Gaussian Mixture Model ▪ Ward Linkage Clustering ▪ Cluster Validity Metrics ▪ Davies-Bouldin index (DBI) metric ▪ Silhouette score metric ▪ Combined score metric ▪ . 22

Application of Clustering Model ▪ Clustering models ▪ KMeans Clustering ▪ Gaussian Mixture Model ▪ Ward Linkage Clustering ▪ Cluster Validity Metrics ▪ Davies-Bouldin index (DBI) metric ▪ Silhouette score metric ▪ Combined score metric ▪ . For DBI , the lower the better the cluster quality For Silhouette and Combined score , the higher the better the cluster quality 23

Performance Evaluation ▪ Selecting Best Clustering method ▪ The features selected from Min-max mutual information are used ▪ The most suitable feature selection method for our dataset's characteristic: every feature is considerably correlated to each other ▪ The number of clusters varies from 3 to 20 24

Performance Evaluation ▪ Selecting Best Clustering method ▪ The features selected from Min-max mutual information are used ▪ The most suitable feature selection method for our dataset's characteristic: every feature is considerably correlated to each other ▪ The number of clusters varies from 3 to 20 Kmeans and Ward linkage show high cluster performance The performance is highest when the number of clusters is 3 28

Performance Evaluation ▪ Feature Selection methods comparison ▪ The impact the five feature selection methods have on Kmeans clustering method is evaluated ▪ Mutual information, F-regression, Decision tree, Extra tree, and Min-max mutual information 29

Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng - PowerPoint PPT Presentation

HPC Workload Characterization Using Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alexander Sim, Suren Byna, Hyeonsang Eom Distributed Computing Systems Laboratory Department of Computer Science and Engineering Seoul

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter School, November 2018

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Engines Previously We talked about the motivation behind vertical search engines,

Words & Pictures Clustering and Bag of Words Many

A Quality Metric for Visualization of Clusters in Graphs Amyra Meidiana, Seok-Hee Hong, Peter

Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) Outline Introduction

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

CMS Data Availability and Request Process David Lanctin, M.P.H. Technical Advisor ResDAC,

Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng - PowerPoint PPT Presentation

HPC Workload Characterization Using Feature Selection and Clustering Jiwoo Bang, Chungyong Kim, Kesheng Wu, Alexander Sim, Suren Byna, Hyeonsang Eom Distributed Computing Systems Laboratory Department of Computer Science and Engineering Seoul

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter School, November 2018

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Streaming algorithms for k -center clustering with outliers and with anonymity Richard Matthew

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Engines Previously We talked about the motivation behind vertical search engines,

Words &amp; Pictures Clustering and Bag of Words Many

A Quality Metric for Visualization of Clusters in Graphs Amyra Meidiana, Seok-Hee Hong, Peter

Clustering Sriram Sankararaman (Adapted from slides by Junming Yin) Outline Introduction

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

CMS Data Availability and Request Process David Lanctin, M.P.H. Technical Advisor ResDAC,

Words & Pictures Clustering and Bag of Words Many