Optimised Framework based on Rough Set Theory for Big Data - - PowerPoint PPT Presentation

optimised framework based on rough set theory for big
SMART_READER_LITE
LIVE PREVIEW

Optimised Framework based on Rough Set Theory for Big Data - - PowerPoint PPT Presentation

Recent Trends in Knowledge Compilation Dagstuhl Seminar 17381 Marie Sklodowska-Curie Actions - Individual Fellowship Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts Presented by


slide-1
SLIDE 1

Recent Trends in Knowledge Compilation – Dagstuhl Seminar 17381 Marie Sklodowska-Curie Actions - Individual Fellowship

Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts

Presented by

  • Dr. Zaineb Chelly Dagdia
slide-2
SLIDE 2

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

  • Dr. Zaineb Chelly

RoSTBiDFramework 1/24

slide-3
SLIDE 3

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

2

Background

  • Dr. Zaineb Chelly

RoSTBiDFramework 1/24

slide-4
SLIDE 4

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

2

Background

3

RoSTBiDFramework

  • Dr. Zaineb Chelly

RoSTBiDFramework 1/24

slide-5
SLIDE 5

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

2

Background

3

RoSTBiDFramework

4

Conclusion

  • Dr. Zaineb Chelly

RoSTBiDFramework 1/24

slide-6
SLIDE 6

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

2

Background

3

RoSTBiDFramework

4

Conclusion

  • Dr. Zaineb Chelly

RoSTBiDFramework 1/24

slide-7
SLIDE 7

Introduction Background RoSTBiDFramework Conclusion

The MSC project Proposal title “Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts”

  • Dr. Zaineb Chelly

RoSTBiDFramework 2/24

slide-8
SLIDE 8

Introduction Background RoSTBiDFramework Conclusion

The MSC project Proposal title “Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts” ✎ Duration in months : 24 ✎ Panel ENG : Information Science and Engineering (ENG) ✎ Descriptor : Machine learning, statistical data processing and applications

  • Dr. Zaineb Chelly

RoSTBiDFramework 2/24

slide-9
SLIDE 9

Introduction Background RoSTBiDFramework Conclusion

The MSC project Proposal title “Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts” ✎ Duration in months : 24 ✎ Panel ENG : Information Science and Engineering (ENG) ✎ Descriptor : Machine learning, statistical data processing and applications ✔ Project started on the 1st of March 2017

  • Dr. Zaineb Chelly

RoSTBiDFramework 2/24

slide-10
SLIDE 10

Introduction Background RoSTBiDFramework Conclusion

The MSC project

  • Dr. Zaineb Chelly

RoSTBiDFramework 3/24

slide-11
SLIDE 11

Introduction Background RoSTBiDFramework Conclusion

Partner organisations

  • Dr. Zaineb Chelly

RoSTBiDFramework 4/24

slide-12
SLIDE 12

Introduction Background RoSTBiDFramework Conclusion

Partner organisations ✓ Host : Aberystwyth University, UK

  • Dr. Zaineb Chelly

RoSTBiDFramework 4/24

slide-13
SLIDE 13

Introduction Background RoSTBiDFramework Conclusion

Partner organisations ✓ Host : Aberystwyth University, UK ✓ Partner Organisations :

➺ University of Birmingham, UK ➺ University of Paris 13, France ➺ University of Granada, Spain ➺ *Non-academic partner France

  • Dr. Zaineb Chelly

RoSTBiDFramework 4/24

slide-14
SLIDE 14

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Outline

1

Introduction

2

Background Big Data Rough Set Theory

3

RoSTBiDFramework

4

Conclusion

  • Dr. Zaineb Chelly

RoSTBiDFramework 5/24

slide-15
SLIDE 15

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Specification “Datasets which could not be captured, managed, and processed by general computers within an acceptable scope.” – [Apache Hadoop, 2010] –

  • Dr. Zaineb Chelly

RoSTBiDFramework 6/24

slide-16
SLIDE 16

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Specification “Datasets which could not be captured, managed, and processed by general computers within an acceptable scope.” – [Apache Hadoop, 2010] – ➜ Having bigger data requires different approaches : ✔ Techniques; ✔ Tools; ✔ Architecture;

  • Dr. Zaineb Chelly

RoSTBiDFramework 6/24

slide-17
SLIDE 17

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Distributed processing : Apache Spark ✍ Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.

  • Dr. Zaineb Chelly

RoSTBiDFramework 7/24

slide-18
SLIDE 18

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Distributed processing : Apache Spark ✍ Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. ✍ It is based on the MapReduce model.

  • Dr. Zaineb Chelly

RoSTBiDFramework 7/24

slide-19
SLIDE 19

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Distributed processing : Apache Spark ✍ Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. ✍ It is based on the MapReduce model. ✍ It is an in-memory cluster computing that increases the processing speed of an application.

  • Dr. Zaineb Chelly

RoSTBiDFramework 7/24

slide-20
SLIDE 20

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Distributed processing : Apache Spark ✍ Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. ✍ It is based on the MapReduce model. ✍ It is an in-memory cluster computing that increases the processing speed of an application. ✍ It is based on Resilient Distributed Datasets (RDD) which supports in-memory processing computation.

  • Dr. Zaineb Chelly

RoSTBiDFramework 7/24

slide-21
SLIDE 21

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

MapReduce ☞ MapReduce divides the workload into multiples independent tasks and schedule them across cluster nodes.

  • Dr. Zaineb Chelly

RoSTBiDFramework 8/24

slide-22
SLIDE 22

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

MapReduce ☞ MapReduce divides the workload into multiples independent tasks and schedule them across cluster nodes.

Data are distributed to all the nodes of the cluster as it is being loaded in. Data are split into chunks which are managed by different nodes in the cluster.

➠ Even though the file chunks are distributed across several machines they form a single namespace.

  • Dr. Zaineb Chelly

RoSTBiDFramework 8/24

slide-23
SLIDE 23

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Rough Set Theory Basic Concepts

  • Dr. Zaineb Chelly

RoSTBiDFramework 9/24

slide-24
SLIDE 24

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

Rough Set Theory Basic Concepts The indiscernibility relations The Lower Approximation The Upper Approximation The Boundary Region The Positive Region The Dependency of attributes

  • Dr. Zaineb Chelly

RoSTBiDFramework 9/24

slide-25
SLIDE 25

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

RST for Feature Selection

1

Calculate the IND of the classes;

  • Dr. Zaineb Chelly

RoSTBiDFramework 10/24

slide-26
SLIDE 26

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

RST for Feature Selection

1

Calculate the IND of the classes;

2

Generate all the possible combinations of features;

  • Dr. Zaineb Chelly

RoSTBiDFramework 10/24

slide-27
SLIDE 27

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

RST for Feature Selection

1

Calculate the IND of the classes;

2

Generate all the possible combinations of features;

3

For each combination :

Calculate the IND; Calculate the lower approximation; Calculate the positive region; Calculate the dependency;

  • Dr. Zaineb Chelly

RoSTBiDFramework 10/24

slide-28
SLIDE 28

Introduction Background RoSTBiDFramework Conclusion Big Data Rough Set Theory

RST for Feature Selection

1

Calculate the IND of the classes;

2

Generate all the possible combinations of features;

3

For each combination :

Calculate the IND; Calculate the lower approximation; Calculate the positive region; Calculate the dependency;

4

Select the reduct(s) where :

The feature set is composed of minimal features; The DEP of the feature set equals the DEP of the data set (all the features);

  • Dr. Zaineb Chelly

RoSTBiDFramework 10/24

slide-29
SLIDE 29

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Outline

1

Introduction

2

Background

3

RoSTBiDFramework Challenges Research methodology Proposed solution

4

Conclusion

  • Dr. Zaineb Chelly

RoSTBiDFramework 11/24

slide-30
SLIDE 30

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Current state

  • Dr. Zaineb Chelly

RoSTBiDFramework 12/24

slide-31
SLIDE 31

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Current state It has become difficult to quickly acquire the most useful information from the huge amount of data at hand.

  • Dr. Zaineb Chelly

RoSTBiDFramework 12/24

slide-32
SLIDE 32

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Current state It has become difficult to quickly acquire the most useful information from the huge amount of data at hand. ➽ It is necessary to perform data (pre-)processing as a first step!

  • Dr. Zaineb Chelly

RoSTBiDFramework 12/24

slide-33
SLIDE 33

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

State-of-the-art Sequential and MapReduce based dimensionality reduction techniques involve the user for parameterisation; Are not able to deal with the veracity aspect; Are not able to deal with the data computational requirements;

  • Dr. Zaineb Chelly

RoSTBiDFramework 13/24

slide-34
SLIDE 34

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

State-of-the-art Sequential and MapReduce based dimensionality reduction techniques involve the user for parameterisation; Are not able to deal with the veracity aspect; Are not able to deal with the data computational requirements; Aim ⇒ To fill these major research gaps with an optimised framework for big data pre-processing in certain and imprecise contexts.

  • Dr. Zaineb Chelly

RoSTBiDFramework 13/24

slide-35
SLIDE 35

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Research methodology

1

Performing RST feature selection using information contained within the data alone in a parallel way.

  • Dr. Zaineb Chelly

RoSTBiDFramework 14/24

slide-36
SLIDE 36

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Research methodology

1

Performing RST feature selection using information contained within the data alone in a parallel way.

2

Dealing with the data veracity aspect.

  • Dr. Zaineb Chelly

RoSTBiDFramework 14/24

slide-37
SLIDE 37

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Research methodology

1

Performing RST feature selection using information contained within the data alone in a parallel way.

2

Dealing with the data veracity aspect.

3

Avoiding prohibitive computations.

  • Dr. Zaineb Chelly

RoSTBiDFramework 14/24

slide-38
SLIDE 38

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-39
SLIDE 39

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-40
SLIDE 40

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-41
SLIDE 41

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Limitation. . .
  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-42
SLIDE 42

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Limitation. . .

1

Expensive solution to the problem : The total number of feature subsets to generate is ∑N

i=1 (N i ) = 2N − 1;

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-43
SLIDE 43

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Limitation. . .

1

Expensive solution to the problem : The total number of feature subsets to generate is ∑N

i=1 (N i ) = 2N − 1;

2

Only practical for very simple data sets;

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-44
SLIDE 44

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Limitation. . .

1

Expensive solution to the problem : The total number of feature subsets to generate is ∑N

i=1 (N i ) = 2N − 1;

2

Only practical for very simple data sets;

3

Hardware constraints do not allow us to store a high number of entries;

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-45
SLIDE 45

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Motivation : Rough Set Theory Problem Definition How to find the reduct?

1

RST is an exhaustive search (generate all possible subsets);

2

Retrieve those with a maximum rough set dependency degree;

  • Limitation. . .

1

Expensive solution to the problem : The total number of feature subsets to generate is ∑N

i=1 (N i ) = 2N − 1;

2

Only practical for very simple data sets;

3

Hardware constraints do not allow us to store a high number of entries; ↝ The solution to these issues is to “make a new formulation of Rough Set Theory to the Big Data Context”.

  • Dr. Zaineb Chelly

RoSTBiDFramework 15/24

slide-46
SLIDE 46

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

The approach : Sp-RST A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework

  • Dr. Zaineb Chelly

RoSTBiDFramework 16/24

slide-47
SLIDE 47

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

The approach : Sp-RST A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework Sp-RST is based on :

  • Dr. Zaineb Chelly

RoSTBiDFramework 16/24

slide-48
SLIDE 48

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

The approach : Sp-RST A Distributed Rough Set Theory based Algorithm for an Efficient Big Data Pre-processing under the Spark Framework Sp-RST is based on :

1

RST for feature selection;

2

A distributed implementation design;

3

Spark for in-memory processing task.

4

Deals with the Big Data certain context.

  • Dr. Zaineb Chelly

RoSTBiDFramework 16/24

slide-49
SLIDE 49

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Experimental Setup Used Benchmark, experimental plan, testbed and tools : Amazon Commerce Big Data : 1500 data items, 10 000 features and 50 distinct classes. Experiments are performed on the Grid5000 testbed which is a French national large-scale and versatile platform. A dual 8 core Intel Xeon E5-2630v3 CPUs and 128 GB memory are used. Sp-RST is implemented in Scala 2.11 within the Spark 2.1.1 framework. The Random Forest classifier is used for testing.

  • Dr. Zaineb Chelly

RoSTBiDFramework 17/24

slide-50
SLIDE 50

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Some Results Stability of Feature Selection : The number of features selected is very similar in all cases (less than 100 features difference), but varies significantly based on the number of partitions used : Our method is very reliable in identifying features for removal.

  • Dr. Zaineb Chelly

RoSTBiDFramework 18/24

slide-51
SLIDE 51

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Some Results Stability of Feature Selection : The number of features selected is very similar in all cases (less than 100 features difference), but varies significantly based on the number of partitions used : Our method is very reliable in identifying features for removal. Classification Error With and Without Sp-RST : Sp- RST introduces no significant information loss as results are at least comparable.

  • Dr. Zaineb Chelly

RoSTBiDFramework 18/24

slide-52
SLIDE 52

Introduction Background RoSTBiDFramework Conclusion Challenges Research methodology Proposed solution

Some Results Stability of Feature Selection : The number of features selected is very similar in all cases (less than 100 features difference), but varies significantly based on the number of partitions used : Our method is very reliable in identifying features for removal. Classification Error With and Without Sp-RST : Sp- RST introduces no significant information loss as results are at least comparable. Execution Times With and Without Sp-RST : The overall execution time is decreasing for increasing number of partitions.

  • Dr. Zaineb Chelly

RoSTBiDFramework 18/24

slide-53
SLIDE 53

Introduction Background RoSTBiDFramework Conclusion

Outline

1

Introduction

2

Background

3

RoSTBiDFramework

4

Conclusion

  • Dr. Zaineb Chelly

RoSTBiDFramework 19/24

slide-54
SLIDE 54

Introduction Background RoSTBiDFramework Conclusion

Open Problems and Collaborations There is some trade-off between the number of partitions and the number of nodes used. If only few nodes are available, it may be advisable to use a larger number of partitions to reduce execution times while the number of partitions becomes less important if a high degree of parallelization can be offered.

  • Dr. Zaineb Chelly

RoSTBiDFramework 20/24

slide-55
SLIDE 55

Introduction Background RoSTBiDFramework Conclusion

Open Problems and Collaborations There is some trade-off between the number of partitions and the number of nodes used. If only few nodes are available, it may be advisable to use a larger number of partitions to reduce execution times while the number of partitions becomes less important if a high degree of parallelization can be offered. How can we deal with larger partitions while satisfying the low execution time (complexity) constraint? Any other splitting idea/distributed system to deal with larger partitions? How to deal with the parallelization of the algorithm while satisfying the ‘no’ information loss constraint? Any other Big Data applications?

  • Dr. Zaineb Chelly

RoSTBiDFramework 20/24

slide-56
SLIDE 56

Thank you for your attention Questions?