pdi sizing overview and case study
play

PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead - PowerPoint PPT Presentation

PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead Solution Engineer, Hitachi Vantara Introduction and Agenda Introduction What is PDI Sizing, and Why Do We Need to be Concerned About it? Agenda Brief Anatomy of PDI


  1. PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead Solution Engineer, Hitachi Vantara

  2. Introduction and Agenda • Introduction – What is PDI Sizing, and Why Do We Need to be Concerned About it? • Agenda – Brief Anatomy of PDI – Example Sizing Problem – Review Test Cases – Example Sizing Solution – Review Major Constraints, Bottlenecks, and Best Practices • Next Steps – Recommendations – Resources Here is a sample footnote.

  3. Example Sizing Problem • Retail Business has daily data that must be readied for next day analysis – Data volume fluctuates daily – 8 hour delivery window Peak = 10 TB – 10 TB per day peak, 5 TB per day average 10 TB per Day = 400 GB per Hour

  4. Sizing Disclaimers • Past performance is not a guarantee of future results • The best practice is to run throughput tests with fully representative data and transformation profiles on actual equipment • Sizing should accommodate data growth and operational margins • The results here represent throughput with a single transformation type under controlled conditions. – Increasing the variety of transformations may result in lower performance • See Pentaho Best Practices for performance tuning

  5. What is PDI Sizing? Determining the number of nodes and cores needed to Data Data process data within time Sources Targets constraints

  6. PDI Sizing Variables and Constraints Customer Jobs Transformations Inputs Demands Available Memory Available Available Input Output CPU (cores) ops/sec bandwidth bandwidth Number of Nodes Time

  7. Pentaho Data Integration Sizing Factors • Available Time – Processing time required – Turn-around time requirements • Amount of Data – Interdependencies and lag • Available Resources – Computing power: cores, memory, storage, network – Number of nodes • Complexity of Transformations

  8. PDI Anatomy • Platform – CPU, Cores, Memory – JVM • Jobs – Orchestration • Transformations attributes – Threads – Connections – Steps • Blocking Steps • Expensive Steps – Rowset buffers – Step copies – Multiple streams

  9. PDI Sizing – Hardware Constraints • Enterprise Grade supported Processors ( non end-of-life ) • 8-core processors • 32 GB or more of available RAM – 24GB+ per JVM • High Speed network connections ( 1Gb/sec – 10 Gb/sec ) • Low number Network Hops – Co-located nodes on same segment • Cluster Configurations – Carte Clustering – Hadoop Map Reduce – AWS Auto Scaling – Spark Clusters

  10. Example 1: High I/O, Moderate CPU use-case** • Single 8-core node with a single type of transformation • Throughput Peak: – 229 GB per Hour – 916 GB per 4-Hours – 5.5 TB per 24-Hours

  11. Summarized Results – High I/O, Moderate CPU use-case** Number of Concurrent Hourly Daily Transformations Throughput Throughput Notes 121.9 GB per Hour 2.9 TB per Day Medium (15-steps) 1 This represents an 228.6 GB per Hour 5.4 TB per Day 88% increase in 2 throughput This is less than 229.2 GB per Hour 5.5 TB per Day 3 1% more throughput

  12. Example 2: High I/O, High CPU use-case** • Single 8-core node with a single type of transformation • Throughput Peak: – 113 GB per Hour – 452 GB per 4-Hours – 2.7 TB per 24-Hours

  13. Summarized Results – High I/O, High CPU use-case** Number of Concurrent Hourly Daily Transformations Throughput Throughput Notes 88.8 GB per Hour 2.1 TB per Day Medium (15-steps) 1 This represents a 112.6 GB per Hour 2.7 TB per Day 27% increase in 2 throughput This is less than 113.4 GB per Hour 2.7 TB per Day 3 1% more throughput

  14. Example Sizing Solution Capacity Requirement: 10 TB per Day Contingency: Solution: 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core ( Additional nodes ) ( Two 8-core nodes )

  15. PDI Sizing – Use-Case Variation Considerations • Data formats : JSON, XML, CSV, Binary • Transformation Sizes : small, medium, large ( low-CPU, high-CPU) • Step copies : 1, 2, 4, 8, 16 • Field sizes: 10B, 100B, 1K, 4K, 10K • Row sizes: 1K, 4K, 16K - and Rowset buffer • Step types: Regex, Javascript • Aggregations: Sort, Join, Analytic (Sum, Standard Deviation)

  16. PDI Sizing – Best Practices • Performance Tuning – Optimize input and output structures (e.g. Pre-sort data) – Identify slowest steps – Use optimal steps, such as Bulk Loading • Scaling up – Number of step copies – Number of instances – Clustering transformations • Perform load test - Determine Peak throughput • Monitoring and alerting

  17. Considerations and Recommendations • Capacity planning – CPU, Storage, • System Maintenance Network – Allow for maintenance, scheduled and MTBF – Allow for operating margins • Backups, Upgrades, Backlog Recovery • 20% to 50% – Redundancy, Failover – Allow for System overhead – Ongoing system and performance monitoring • 10% • Data forecasting – Monthly and Annual growth • Data Cycles – Ad-hoc projects – Near Real-time streaming versus – Data Growth daily analytical batches – Additional Transformations • System optimization – See Pentaho Best Practices

  18. PDI Sizing – support.pentaho.com • Best Practices • Product Documentation • Enterprise Support • Pentaho Services

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend