PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead - - PowerPoint PPT Presentation

pdi sizing overview and case study
SMART_READER_LITE
LIVE PREVIEW

PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead - - PowerPoint PPT Presentation

PDI Sizing Overview and Case Study Steve Szabo Pentaho Lead Solution Engineer, Hitachi Vantara Introduction and Agenda Introduction What is PDI Sizing, and Why Do We Need to be Concerned About it? Agenda Brief Anatomy of PDI


slide-1
SLIDE 1

PDI Sizing Overview and Case Study

Steve Szabo Pentaho Lead Solution Engineer, Hitachi Vantara

slide-2
SLIDE 2

Introduction and Agenda

  • Introduction

– What is PDI Sizing, and Why Do We Need to be Concerned About it?

  • Agenda

– Brief Anatomy of PDI – Example Sizing Problem – Review Test Cases – Example Sizing Solution – Review Major Constraints, Bottlenecks, and Best Practices

  • Next Steps

– Recommendations – Resources

Here is a sample footnote.

slide-3
SLIDE 3

Peak = 10 TB

  • Retail Business has daily data that must be readied for

next day analysis

– Data volume fluctuates daily – 8 hour delivery window – 10 TB per day peak, 5 TB per day average

Example Sizing Problem

10 TB per Day = 400 GB per Hour

slide-4
SLIDE 4

Sizing Disclaimers

  • Past performance is not a guarantee of future results
  • The best practice is to run throughput tests with fully representative data and

transformation profiles on actual equipment

  • Sizing should accommodate data growth and operational margins
  • The results here represent throughput with a single transformation type under

controlled conditions.

– Increasing the variety of transformations may result in lower performance

  • See Pentaho Best Practices for performance tuning
slide-5
SLIDE 5

What is PDI Sizing?

Data Sources Data Targets

Determining the number

  • f nodes and cores needed to

process data within time constraints

slide-6
SLIDE 6

PDI Sizing Variables and Constraints

Time Available Input bandwidth Available Output bandwidth

CPU (cores) ops/sec

Available Memory Number of Nodes Transformations Jobs Inputs Customer Demands

slide-7
SLIDE 7

Pentaho Data Integration Sizing Factors

  • Available Time

– Processing time required – Turn-around time requirements

  • Amount of Data

– Interdependencies and lag

  • Available Resources

– Computing power: cores, memory, storage, network – Number of nodes

  • Complexity of Transformations
slide-8
SLIDE 8

PDI Anatomy

  • Platform

– CPU, Cores, Memory – JVM

  • Jobs

– Orchestration

  • Transformations attributes

– Threads – Connections – Steps

  • Blocking Steps
  • Expensive Steps

– Rowset buffers – Step copies – Multiple streams

slide-9
SLIDE 9

PDI Sizing – Hardware Constraints

  • Enterprise Grade supported Processors ( non end-of-life )
  • 8-core processors
  • 32 GB or more of available RAM – 24GB+ per JVM
  • High Speed network connections ( 1Gb/sec – 10 Gb/sec )
  • Low number Network Hops – Co-located nodes on same segment
  • Cluster Configurations

– Carte Clustering – Hadoop Map Reduce – AWS Auto Scaling – Spark Clusters

slide-10
SLIDE 10

Example 1: High I/O, Moderate CPU use-case**

  • Single 8-core node with a

single type of transformation

  • Throughput Peak:

– 229 GB per Hour – 916 GB per 4-Hours – 5.5 TB per 24-Hours

slide-11
SLIDE 11

Summarized Results – High I/O, Moderate CPU use-case**

Number of Concurrent Transformations Hourly Throughput Daily Throughput Notes 1 121.9 GB per Hour 2.9 TB per Day Medium (15-steps) 2 228.6 GB per Hour 5.4 TB per Day This represents an 88% increase in throughput 3 229.2 GB per Hour 5.5 TB per Day This is less than 1% more throughput

slide-12
SLIDE 12

Example 2: High I/O, High CPU use-case**

  • Single 8-core node with a

single type of transformation

  • Throughput Peak:

– 113 GB per Hour – 452 GB per 4-Hours – 2.7 TB per 24-Hours

slide-13
SLIDE 13

Summarized Results – High I/O, High CPU use-case**

Number of Concurrent Transformations Hourly Throughput Daily Throughput Notes 1 88.8 GB per Hour 2.1 TB per Day Medium (15-steps) 2 112.6 GB per Hour 2.7 TB per Day This represents a 27% increase in throughput 3 113.4 GB per Hour 2.7 TB per Day This is less than 1% more throughput

slide-14
SLIDE 14

Example Sizing Solution

5.5 TB / Day / 8-core 10 TB per Day 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core 5.5 TB / Day / 8-core (Two 8-core nodes) Capacity Requirement: Solution: Contingency: (Additional nodes)

slide-15
SLIDE 15

PDI Sizing – Use-Case Variation Considerations

  • Data formats : JSON, XML, CSV, Binary
  • Transformation Sizes : small, medium, large ( low-CPU, high-CPU)
  • Step copies : 1, 2, 4, 8, 16
  • Field sizes: 10B, 100B, 1K, 4K, 10K
  • Row sizes: 1K, 4K, 16K - and Rowset buffer
  • Step types: Regex, Javascript
  • Aggregations: Sort, Join, Analytic (Sum, Standard Deviation)
slide-16
SLIDE 16

PDI Sizing – Best Practices

  • Performance Tuning

– Optimize input and output structures (e.g. Pre-sort data) – Identify slowest steps – Use optimal steps, such as Bulk Loading

  • Scaling up

– Number of step copies – Number of instances – Clustering transformations

  • Perform load test - Determine Peak throughput
  • Monitoring and alerting
slide-17
SLIDE 17

Considerations and Recommendations

  • Capacity planning – CPU, Storage,

Network – Allow for operating margins

  • 20% to 50%

– Allow for System overhead

  • 10%
  • Data Cycles

– Near Real-time streaming versus daily analytical batches

  • System Maintenance

– Allow for maintenance, scheduled and MTBF

  • Backups, Upgrades, Backlog Recovery

– Redundancy, Failover – Ongoing system and performance monitoring

  • Data forecasting – Monthly and Annual growth

– Ad-hoc projects – Data Growth – Additional Transformations

  • System optimization – See Pentaho Best

Practices

slide-18
SLIDE 18

PDI Sizing – support.pentaho.com

  • Best Practices
  • Product Documentation
  • Enterprise Support
  • Pentaho Services
slide-19
SLIDE 19