automating user centered design of data intensive
play

Automating User-Centered Design of Data- Intensive Processes - PowerPoint PPT Presentation

Automating User-Centered Design of Data- Intensive Processes Research Project Report (RPR) Vasileios Theodorou 26-05-2015 Host University Home University Coadvisor: Supervisor: Supervisor: Dr. Maik Thiele Prof. Alberto Abell Prof.


  1. Automating User-Centered Design of Data- Intensive Processes Research Project Report (RPR) Vasileios Theodorou 26-05-2015 Host University Home University Coadvisor: Supervisor: Supervisor: Dr. Maik Thiele Prof. Alberto Abelló Prof. Wolfgang Lehner RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  2. Example - Two Alternative Flows Conceptual model of flow: “Details about suppliers i n Europe sorted on revenue” • ETL Flow A • ETL Flow B 2 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  3. Measures from experiments ETL Flow A ETL Flow B Process cycle time 10.4 sec 18.9 sec Performance Throughput 52,906 tuples/sec 29,179 tuples/sec % of correct tuples 91.5% 100% Data quality % of non-null tuples 90.3% 95.2% # of precedence 20 40 Understandability dependencies Length of longest 9 steps 23 steps Manageability path E XECUTION • TPC-H with s.f.=1 • Executed on Pentaho Data Integration (Kettle) • Data quality improved – Performance, Understandability and Manageability reduced 3 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  4. Agenda A PPROACH  Conceptual model reflecting user requirements  User requirements-driven flow redesign  Automatic “quality” pattern integration  Configurable testing C HALLENGES AND D ISCUSSION  Relate patterns to utility  Assess pattern significance, model accuracy & completeness  Future plan 4 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  5. ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) T RADE - OFFS  It’s not only about performance!  Improving some quality attributes can affect others positively or negatively 5 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  6. ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) C ONTRIBUTION  Define a set of ETL process quality characteristics AND the relationships between them  Provide quantitative measures for each characteristic, backed by literature! M ETHODOLOGY  SLR for quality attributes specific to data intensive processes  Collection from literature of (proven) metrics for monitoring and quantitatively evaluating ETL processes I NVITED J OURNAL E XTENSION  Special Issue of Journal CCPE 2015 (under minor revision)  Introduce and apply goal modeling “stepping” on defined models  Showcase evaluation of use case ETLs using proposed measures 6 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  7. User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14) requirements T RADITIONAL A PPROACH P ROBLEMS Business User  Expensive process IT  Hard to map requirements-implementation  IT optimize only for performance ETL Process  Need more dynamicity (Big Data, data scope…) DB1 I NSPIRATION  Model-driven approach DW  ETL process as a business process  Agile BI, Self-service BI DB2 A PPROACH  User at the center of the iterative process  Functional and non-functional requirements are analyzed at the same time using automatic Pattern management 7 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  8. User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14)  High level representation for Business Users  Translation to low level models for IT and vice versa 8 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  9. Automated Process Redesign (POIESIS) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) A UTOMATIC GENERATION OF ALTERNATIVE PHYSICAL ETL FLOWS • Alternative designs: Same functionality (constant data schemata), different flow components- permutations • Policies and patterns • Measures estimation for evaluation 9 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  10. Logical Modeling & FCPs Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) L OGICAL M ODELLING OF ETL F LOWS  Each operator is a node in a DAG structure  Flow Component Patterns represented in the same logical model  Each (combination of) pattern application(s) produces a new ETL flow 10 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  11. Flow Component Patterns (FCPs) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) Component Types FCP example: Crosscheck Data Sources Application Point: Cro ssflo w Component Atomic ETL Step • Edge • Node Sequence Component Extract from • Complete Graph SC1 E Alternative Sequence Component Data Source Project Atomic ETL Step P Application Properties: Attributes of Interest Join on • Applicability based on rules  Pruning J Speci fi c Keys Flow Component • Fitness based on heuristics  Optimization A1 SC2 Sequence Component Project out Compare Added Attributes Attributes C P Crossflow Component 11 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  12. Example Visualization Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) M ULTIDIMENSIONAL A NALYSIS  Pareto frontier  Each point represents an ETL flow  Metrics (compound and detailed) compared to initial flow 12 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  13. Quality-aware testing Paper: Bijoux: Data Generator for Evaluating ETL Process Quality (DOLAP ’ 14) A PPROACH  An automatic, semantic-aware framework for generating testing workloads for evaluating quality of ETL processes  Using a taxonomy of ETL operations and their semantics, create synthetic datasets to test flows  Configurable properties (e.g., selectivity, distribution) to emphasize specific flow parts characteristics I NVITED J OURNAL E XTENSION  Information Systems, Elsevier 2015 (under review)  Highlight workflow perspective and analyze properties like flow coverage  Propose architecture and showcase updated implementation that scales 13 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  14. Execution on the Cloud E LASTICITY FOR RESPONSIVENESS EC2 instance 1 nd: Slave Web app  Hundreds of flows executed very fast edg:  Load balancing based on pre-evaluation JSON Pentaho DI (Kettle) m1: m2: Master Web app EC2 instance 2 nd: Load Balancer edg: JSON Slave Web app Pre-evaluator O PEN RESEARCH QUESTIONS m1: Policy Manager m2: Pentaho DI (Kettle)  Do instances share state? Common input Monitor ... data? m1: Measures JSON  Can results be generalized for platform EC2 instance n m2: Collector dependent executions? Slave Web app nd: edg: Pentaho DI (Kettle) 14 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  15. Decomposition to Structural Patterns Q UALITY EVALUATION OF ETL FLOWS  Different design choices  large number of alternative ETL flows  Need for fine-grained cost models  Repository of patterns to increase reusability of models P2 P4 O5 O5 {T} {T} e3 e3 O3 O3 O11 O13 O11 O13 e4 e4 e13 e13 {T} {T} {F} {T} {T} {F} O4 O4 e11 e11 e2 e2 O1 O2 e5 O9 O10 O1 O2 e5 O9 O10 e10 e1 e1 e10 e6 e12 e12 e6 {F} {F} {F} {F} e9 e9 O6 O6 O8 O12 O8 O12 P1 e7 e7 P3 e8 e8 O7 O7 P ATTERN - BASED DECOMPOSITION OF ETL FLOWS  Classify structural patterns & identify on each flow  Derive utility as a function of the patterns that each flow contains  Adaptive model: Knowledge Base enrichment Flow evaluation improvement 15 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  16. Challenges R ELATE S TRUCTURAL P ATTERNS TO QUALITY MEASURES  When and where is a quality pattern worth considering?  Knowledge Base including pattern applications – detailed (measured) quality tradeoffs  Also rules about pattern combinations M ODEL - THEORETIC PROPERTIES  Accuracy, completeness  How to evaluate significance of models? 16 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  17. Future Plan BPM '16 ER '16 EDBT '16 J OURNALS  DSS ’ 16: Using statistical methods to examine model-theoretic properties of ETL utility characteristics  IJDWM ’ 16: ETL utility characteristics modelling and results from empirical study 17 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend