address the challenges posed by Big Data Syed Muhammad Fawad Ali - PowerPoint PPT Presentation

Next-generation ETL Framework to address the challenges posed by Big Data Syed Muhammad Fawad Ali

Agenda 1. Introduction 2. Motivation 3. Extendable ETL Framework 4. Conclusion 5. Q&A 2

1. INTRODUCTION 3

Background Growth of Data Dealing with Big Data Challenges with Big Data Gain great benefits in Acquisition Volume of produced, science and business. collected, and stockpiled Processing digital data has been Loading continuously growing Requires a great scientific exponentially. Analyzing contribution to deal with it. Expected useful data size by 2020 - 16 Trilltion GB 4

Focus - ETL aspect of big data TRADITIONAL ETL FRAMEWORKS Designed for creating traditional Data Warehouse (DW), to efficiently support lightweight computations on smaller data sets BIG DATA REQUIREMENTS Big Data demands new and advanced computations e.g., from data cleansing or data visualization aspects. 5

2. Motivation 6

Study on existing ETL frameworks Focus of Study we carried out an intensive study [1] on the existing methods for designing, implementing, and optimizing of ETL workflows. We analyzed several techniques w.r.t their pros, cons, and challenges in the context of metrics such as: Support for variety of data ● support for quality metrics ● support for ETL activities as user-defined functions ● autonomous behavior ● [1] S. M. F. Ali and R. Wrembel. From conceptual design to performance optimization of ETL workflows: current state of research and open problems. The VLDB Journal, pages 1 – 25, 2017. 7

Summary of the Study Variety of Data Efficient Execution of WFs Monitoring & Recommendation Regardless of today’s big ata Extensive input is required from Limited support for semi needs - lack of emphasis on ETL developers during the design structured and unstructured the issues of efficient, and implementation phase of data. reliable, and improved DWH and ETL development life Exponential growth of the execution of an ETL cycle. variety of data format workflow. It can be error prone, time especially the unstructured No support for user- consuming, and inefficient. and raw data defined functions . Need of an ETL framework to need to extend the support for Required techniques based on provide recommendations on: processing an unstructured data task parallelism, data (1) an efficient ETL workflow design along with other data formats parallelism, and a (2) how and when to improve the (e.g., video, audio, binary). combination of both for performance of an ETL workflow traditional ETL operators as without conceding other quality well as user-defined metrics functions . 8

“ The consequence of the aforementioned observation is that designing and optimizing ETL workflows for Big Data is much more difficult than for traditional data and is much needed at this point in time. 9

3. The Extendable ETL Framework 10

Architecture of the ETL Workflow A three layered architecture: 1. Bottom layer - WF designer 2. Top Layer - a distributed framework 3. Middle layer UDF Component ▷ Recommender ▷ Library - Cost Model ▷ Monitoring Agent ▷ 11

A UDF’s Component The idea behind introducing a UDFs component is to assist the ETL developer in writing a parallelizable UDF by separating parallelization concerns from the code. 12

A Recommender A Recommender includes an extendable set of machine learning ▷ algorithms to optimize a given ETL workflow (based on metadata collected during past ETL executions). Metadata may be collected with the ▷ help of Monitoring Agent. An ETL developer may be able to ▷ experiment with alternative algorithms to optimize ETL workflows (e.g., Dependency Graph approach), Scheduling Strategies ) 13

A Cost Model The library of cost models may include ▷ models for: ○ monetary cost, ○ performancecost, and both cost and execution performance ○ A Recommender may choose the ▷ appropriate cost model from a library of cost models to make optimal decisions based on the ETL developer’s input and Monitoring agent. 14

A Monitoring Agent Monitoring Agent allows to: monitor ETL workflow executions ○ ■ # input rows, # output rows, execution time of each step, number of rows processed per second report errors ○ ■ task or workflow failures and the possible reasons schedule executions. ○ ■ execution time of ETL workflows and creating a dependency chart for ETL tasksand workflows ○ gather various performance statistics. ■ execution time of each ETL activity w.r.t rows processed per second, execution time of the entire ETL workflow w.r.t rows processed per second, memory consumptionby each ETL activity 15

A Monitoring Agent Information collected by the Monitoring ▷ agent is stored in an ETL framework repository to be utilized by Recommender and Cost Model to make recommendations to the ETL developer and to generate optimal ETL workflows. 16

4. Conclusion 17

We believe that the proposed ETL framework is a step forward towards a fully automated ETL framework to help the ETL developers optimize ETL tasks and an overall ETL workflow for Big Data with the help of recommendations, monitoring WFs, and UDFs provided by the tool. Currently we are working on the first steps towards building a complete ETL Framework A UDFs Component - to provide the library of reusable parallel ▷ algorithmic skeletons for the ETL developer and A Cost Model - to generate the most efficient execution plan for ▷ an ETL workflow. 18

Thanks! Any questions? You can contact me at: fawadali.ali@gmail.com 19

References 1. S. M. F. Ali and R. Wrembel. From conceptual design to performance optimization of ETL w orkflows: current state of research and open problems. The VLDB Journal, pages 1 – 25, 2017. S. K. Bansal. Tow ards a semantic extract-transform-load (ETL) framework for big data integration. In Proceedings of International Congress on Big Data, pages 522 – 529. 2. IEEE, 2014. 3. J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. How e, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik. The BigDAWG Polystore System. SIGMOD Record, pages 11 – 16, 2015. T. Ibaraki, T. Hasegaw a, K. Teranaka, and J. Iw ase. The multiple choice knapsackproblem. Journal of Operations Research Society Japan, pages 59 – 94, 1978. 4. 5. A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer, and D. Epema. Performance analysis of cloud computing services for many-tasks scientific computing. Transactions on Parallel and Distributed systems, pages 931 – 945, 2011. 6. K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, J. Shalf, H. J. Wasserman, and N. J. Wright. Performance analysis of high performance computing applications on the amazon w eb services cloud. In International Conference on Cloud Computing Technology and Science, pages 159 – 168. IEEE, 2010. A. Karagiannis, P. Vassiliadis, and A. Simitsis. Scheduling strategies for efficient ETL execution. Information Systems, pages 927 – 945, 2013. 7. 8. M. Marjani, F. Nasaruddin, A. Gani, A. Karim, I. A. T. Hashem, A. Siddiqa, and I. Yaqoob. Big IoT data analytics: Architecture, opportunities, and open research challenges. IEEE Access, pages 5247 – 5261, 2017. 9. B. Martinho and M. Y. Santos. An architecture for data warehousing in big data environments. In Proceedings of Research and Practical Issues of Enterprise Information Systems, pages 237 – 250. Springer, 2016. A. Simitsis, P. Vassiliadis, and T. Sellis. State-space optimization of ETL w orkflows. IEEE Transactions on Know ledge and Data Engineering (TKDE), pages 1404 – 1419, 2005. 10. 11. A. Simitsis, K. Wilkinson, U. Dayal, and M. Castellanos. Optimizing ETL w orkflows for fault-tolerance. In Proceedings of IEEE International Conference on Data Engineering (ICDE), 2010. 12. I. Terrizzano, P. Schwarz, M. Roth, and J. E. Colino. Data Wrangling: The Challenging Journey from the Wild to the Lake. In Proceedings of Conference on Innovative Data Systems Research (CIDR), 2015. 13. V. Viana, D. De Oliveira, and M. Mattoso. Tow ards a cost model for scheduling scientificworkflows activities in cloud environments. In Proceedings of IEEE World Congress on Services, pages 216 – 219, 2011. 20

address the challenges posed by Big Data Syed Muhammad Fawad Ali - PowerPoint PPT Presentation

Next-generation ETL Framework to address the challenges posed by Big Data Syed Muhammad Fawad Ali Agenda 1. Introduction 2. Motivation 3. Extendable ETL Framework 4. Conclusion 5. Q&A 2 1. INTRODUCTION 3 Background Growth of Data

Chapter 2 Linear Ill-Posed Problems Observations from previous chapter Ill-Posed Problems in

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Regularization of ill-posed problems Uno H amarik University of Tartu, Estonia Content 1.

C++ Concepts for C++ Concepts for Ill-posed Inverse Problems Ill-posed Inverse Problems Or: How

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Big Cities, Big Data, Big Lessons! Leveraging Multi-Sector Data in Public Health to Address

Challenges to communications regulation posed by technology convergence The big regulatory

lecture 18 virtual physical physical virtual cache 2 address address address address -

1. IPv6 address (abbreviated representation) 2. IPv6 address (abbreviation and expansion

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

Lower Yakima Valley Groundwater Advisory Committee April 17, 2014 Questions Posed to the EPA

A modified TSVD method for discrete ill-posed problems S. Noschese L. Reichel VDM60 - Nonlinear

Co u rse ratings IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits Pierre Perrault (INRIA

Neglect & the Law Detective Chief Inspector Tracey Harman Aims Consider what wilful

Medical Errors and Error Disclosure in Outpatient Care Urmimala Sarkar, MD, MPH Disclosures

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&C Artur Apresyan HL-LHC CMS

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Bakry meets Villani Fabrice Baudoin Purdue University Purdue Probability Seminar Dominique

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC & AI

address the challenges posed by Big Data Syed Muhammad Fawad Ali - PowerPoint PPT Presentation

Next-generation ETL Framework to address the challenges posed by Big Data Syed Muhammad Fawad Ali Agenda 1. Introduction 2. Motivation 3. Extendable ETL Framework 4. Conclusion 5. Q&A 2 1. INTRODUCTION 3 Background Growth of Data

Chapter 2 Linear Ill-Posed Problems Observations from previous chapter Ill-Posed Problems in

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Regularization of ill-posed problems Uno H amarik University of Tartu, Estonia Content 1.

C++ Concepts for C++ Concepts for Ill-posed Inverse Problems Ill-posed Inverse Problems Or: How

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Big Cities, Big Data, Big Lessons! Leveraging Multi-Sector Data in Public Health to Address

Challenges to communications regulation posed by technology convergence The big regulatory

lecture 18 virtual physical physical virtual cache 2 address address address address -

1. IPv6 address (abbreviated representation) 2. IPv6 address (abbreviation and expansion

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

Lower Yakima Valley Groundwater Advisory Committee April 17, 2014 Questions Posed to the EPA

A modified TSVD method for discrete ill-posed problems S. Noschese L. Reichel VDM60 - Nonlinear

Co u rse ratings IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data

Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits Pierre Perrault (INRIA

Neglect &amp; the Law Detective Chief Inspector Tracey Harman Aims Consider what wilful

Medical Errors and Error Disclosure in Outpatient Care Urmimala Sarkar, MD, MPH Disclosures

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&amp;C Artur Apresyan HL-LHC CMS

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Bakry meets Villani Fabrice Baudoin Purdue University Purdue Probability Seminar Dominique

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC &amp; AI

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Neglect & the Law Detective Chief Inspector Tracey Harman Aims Consider what wilful

MTD-BO 4: ETL Overview Including LGADs, System Testing, I&C Artur Apresyan HL-LHC CMS

RAPIDS, FOSDEM19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC & AI