Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product - - PowerPoint PPT Presentation

processing big data with pentaho
SMART_READER_LITE
LIVE PREVIEW

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product - - PowerPoint PPT Presentation

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara Agenda Pentahos Latest and Upcoming Features for Processing Big Data Batch or Real-time Process big data visually in future-proof way


slide-1
SLIDE 1

Processing Big Data with Pentaho

Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara

slide-2
SLIDE 2

Agenda

  • Process big data visually in future-proof way

– Demo

  • Combine stream data processing with batch

– Demo Pentaho’s Latest and Upcoming Features for Processing Big Data – Batch or Real-time

slide-3
SLIDE 3

Big Data Processing is HARD

1) Gartner Analyst, Nick Heudecker; infoworld.com, Sept 2015

"Through 2018, 70% of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.” – GARTNER1

1

New Skills Necessary

2

High Effort and Risk

3

Continuous Change

slide-4
SLIDE 4

Big Data Integration and Analytics Workflow with Pentaho

Big Data Challenges

  • Processing

Semi/un/structured data

  • Blending big data with

traditional data

  • Maintaining security,

governance of data

  • Processing streaming

data in real time and historically

  • Enabling and
  • perationalizing data

science

Data Lake Analytic Database

Pentaho Analyzer

Sensor Big or small data

Pentaho Data Integration Pentaho Reporting MSG Queue Kafka, JMS, MQTT

Machine Learning R, Python Stream Feedback Loop LOB Applications Embedded

Pentaho Data Integration

slide-5
SLIDE 5

Process Big Data Visually in a Future Proof Way

slide-6
SLIDE 6

Visual Big Data Processing with Pentaho

  • What: Visually ingest and process Big Data at enterprise scale
  • What Special: Visually develop once and execute on any engine

with Adaptive Execution Layer (AEL)

  • Why

– Difficult to find qualified developers – Difficult to keep up with new technologies

  • Available since Pentaho 7.1
slide-7
SLIDE 7

Adaptive Execution of Big Data

Build Once, Execute on Any Engine

Challenge: With rapidly changing big data technology, coding on various engines can be time-consuming or impossible with existing resources Solution: Future-proof data integration and analytics development in a drag-and-drop visual development environment, eliminating the need for specialized coding and API knowledge. Seamlessly switch between execution engines to fit data volume and transformation complexity

PDI Pentaho Kettle

slide-8
SLIDE 8

Adaptive Execution for Spark

Process Big Data Faster on Spark Without Any Coding

Challenge: Finding the talent and time to work with Spark and newer big data technologies Solution: More easily develop big data applications in PDI using adaptive execution to ingest, process and blend data from a range of big data sources and scale on Spark clusters

PDI Pentaho Kettle

slide-9
SLIDE 9

Upcoming Enhanced Adaptive Execution Layer

  • Simplified Setup

– Fewer steps to setup – Easy to configure fail-over, load-balancing

  • Development productivity

– Robust transformation error and status reporting – Customization of Spark jobs

  • Robust Enterprise Security

– Client to AEL connection can be secured – End-2-end Kerberos impersonation from client tool to cluster

PDI Client Spark/Hadoop Processing Nodes HADOOP CLUSTER AEL-Spark Engine

(Spark Driver) AEL-Spark Daemon (Edge Nodes)

Hadoop/Spark Compatible Storage Cluster

HDFS Azure Storage Amazon S3 Etc…

Spark Executors

slide-10
SLIDE 10

Upcoming Big Data File Format Handling

Big Data platforms introduced various data formats to improve performance, compression, and interoperability

What:

  • Visual handling of data files with Big Data formats

Parquet and Avro

– Reading and writing files with specific steps – Natively execute in Spark via AEL

Why:

  • Ease of development of Big Data processing
  • Performance improvement due to avoidance of

intermediate formats

slide-11
SLIDE 11

Demonstration

slide-12
SLIDE 12

Retail Web Log Data Processing with Pentaho

  • Run within Spoon via Pentaho during development and then use Spark cluster

for production

  • Lookups, sort, and Parquet file in/out and other steps as to test parallel and

serial processing within Spark Cluster

slide-13
SLIDE 13

Combine Stream Processing with Batch Processing

slide-14
SLIDE 14

What is Stream Data Processing? And Why?

  • Batch data processing is useful, but sometimes businesses need to obtain

crucial insights faster and act on them

  • Many use cases must consider data 2+ times: on the wire, and then

subsequently as historical data

  • Get crucial time-sensitive insights

– React to customer interactions on a website or mobile app – Predict risk of equipment breakdown before it happens Former POV “secure data in DW, then OLAP ASAP afterward” gives way to Current POV “analyze on the wire, write behind”

slide-15
SLIDE 15

NEW Stream Data Processing with Pentaho

  • Visually ingest and produce data from/to Kafka using

NEW steps

  • Process micro-batch chunks of data using either a

time-based or a message size-based window

  • Switch processing engines between Spark (Streaming)
  • r Native Kettle
  • Harden stream processing libraries and steps to

process data from traditional message queues

  • Benefits:

– Lower the bar to build streaming applications – Enable combining batch and stream data processing

slide-16
SLIDE 16

How to Process Stream Data in Pentaho

  • Steps for Kafka ingestion and publish

– Kafka Consumer – Kafka Producer

  • Steps for stream processing

– Get records from stream

  • Ingest and process continuous stream of data

in near real-time in parent transformation

  • Process micro-batch of stream data in

separate child transformation

slide-17
SLIDE 17

Combined Data Processing Using Spark & Pentaho

Web Clickstream and Other Logs Traditional DB/DW and NoSQL Datastores Traditional Message Bus DATA SOURCES IoT Data Kafka Cluster Data Collector Pentaho DI

PDI collects data from sources including Kafka Clusters

Data Publisher Analytical Databases Pentaho Analytics HADOOP/SPARK CLUSTER Data Store Micro Services

RT Data Processors Batch Data Processors Hadoop MR HDFS

Pentaho DI

PDI can process streaming data using Spark and Spark Streaming or Kettle engine in a completely visual way

Pentaho DI

PDI can retrieve processed or blended data from Hadoop/Spark and publish to Kafka clusters or external databases

Ingest Process Publish Reporting Kafka Cluster

slide-18
SLIDE 18

Demonstration

slide-19
SLIDE 19

Retail Store Event Processing

  • Can be run within Spoon via Pentaho or within AEL-Spark engine
  • Utilizes Kafka in/out, Parquet out and other steps as to demonstrate

stream data ingestion, window processing and much more…

slide-20
SLIDE 20

Availability and Roadmap

slide-21
SLIDE 21

Availability

  • Adaptive Execution Layer(AEL) and Spark-AEL available in Pentaho 7.1

– Secure Spark integration, high-availability and security of AEL is EE only – Supported Hadoop distros in Pentaho 7.1 - Cloudera CDH and Pentaho 8.0 – Cloudera CDH and Hortonworks HDP

  • Kafka steps and stream data processing available in Pentaho 8.0

– Kafka from Cloudera and Hortonworks to be supported

slide-22
SLIDE 22

Roadmap

  • Extending AEL to support other Spark distros and other data processing engines
  • Advanced stream processing with other real-time messaging protocols and

windowing mechanism

  • Enabling Big Data driven machine learning on batch or stream data
  • Integrated with broader Hitachi Vantara portfolio
slide-23
SLIDE 23

SUMMARY: Visual Future-Proof Big Data Processing with Pentaho

Visually build stream data processing pipelines for different streaming engines

  • Configure Stream data processing logic
  • Execute logic in multiple stream processing

engines without rework

  • Connect to streaming data sources

NEW in Pentaho

ü Native Streaming in PDI ü Spark Streaming via AEL ü Kafka Connectivity Leverage the power of Adaptive Execution to future-proof data processing pipelines

  • Configure logic without coding
  • Switch processing engines without rework
  • Handle Big Data formats more efficiently

NEW in Pentaho

ü Adaptive Execution Layer ü Visual Spark via AEL ü Native Big data Format Handling

slide-24
SLIDE 24

Next Steps

Want to learn more?

  • Meet-the-Experts:

– Anthony DeShazor – Luke Nazarro – Carlo Russo

  • Recommended Breakout Sessions:

– Jonathan Jarvis: Understanding Parallelism with PDI and Adaptive Execution with Spark – Mark Burnette: Understanding the Big Data Technology Ecosystem

slide-25
SLIDE 25