Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product - PowerPoint PPT Presentation

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara

Agenda Pentaho’s Latest and Upcoming Features for Processing Big Data – Batch or Real-time • Process big data visually in future-proof way – Demo • Combine stream data processing with batch – Demo

Big Data Processing is HARD 1 2 3 New Skills High Effort Continuous Necessary and Risk Change "Through 2018, 70% of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.” – GARTNER 1 1) Gartner Analyst, Nick Heudecker; infoworld.com, Sept 2015

Big Data Integration and Analytics Workflow with Pentaho Big Data Challenges MSG Queue Kafka, JMS, • Processing Machine Learning MQTT Semi/un/structured data Sensor Pentaho LOB • Blending big data with R, Python Analyzer Applications traditional data • Maintaining security, Feedback Stream Embedded Loop governance of data Pentaho Pentaho Data Data Lake Data • Processing streaming Integration Integration data in real time and Pentaho historically Big or Reporting Analytic small data • Enabling and Database operationalizing data science

Process Big Data Visually in a Future Proof Way

Visual Big Data Processing with Pentaho • What: Visually ingest and process Big Data at enterprise scale • What Special: Visually develop once and execute on any engine with Adaptive Execution Layer (AEL) • Why – Difficult to find qualified developers – Difficult to keep up with new technologies • Available since Pentaho 7.1

Adaptive Execution of Big Data PDI Pentaho Kettle Build Once, Execute on Any Engine Challenge: With rapidly changing big data technology, coding on various engines can be time-consuming or impossible with existing resources Solution: Future-proof data integration and analytics development in a drag-and-drop visual development environment, eliminating the need for specialized coding and API knowledge. Seamlessly switch between execution engines to fit data volume and transformation complexity

Adaptive Execution for Spark PDI Pentaho Kettle Process Big Data Faster on Spark Without Any Coding Challenge: Finding the talent and time to work with Spark and newer big data technologies Solution: More easily develop big data applications in PDI using adaptive execution to ingest, process and blend data from a range of big data sources and scale on Spark clusters

Upcoming Enhanced Adaptive Execution Layer • Simplified Setup – Fewer steps to setup HADOOP CLUSTER – Easy to configure fail-over, load-balancing Spark/Hadoop Processing Nodes AEL-Spark • Development productivity PDI Daemon (Edge Client – Robust transformation error and status Nodes) Spark Executors reporting – Customization of Spark jobs Hadoop/Spark Compatible Storage Cluster • Robust Enterprise Security Azure Amazon HDFS AEL-Spark Storage S3 – Client to AEL connection can be secured Etc… Engine (Spark – End-2-end Kerberos impersonation from Driver) client tool to cluster

Upcoming Big Data File Format Handling Big Data platforms introduced various data formats to improve performance, compression, and interoperability What: • Visual handling of data files with Big Data formats Parquet and Avro – Reading and writing files with specific steps – Natively execute in Spark via AEL Why: • Ease of development of Big Data processing • Performance improvement due to avoidance of intermediate formats

Demonstration

Retail Web Log Data Processing with Pentaho • Run within Spoon via Pentaho during development and then use Spark cluster for production • Lookups, sort, and Parquet file in/out and other steps as to test parallel and serial processing within Spark Cluster

Combine Stream Processing with Batch Processing

What is Stream Data Processing? And Why? • Batch data processing is useful, but sometimes businesses need to obtain crucial insights faster and act on them • Many use cases must consider data 2+ times: on the wire, and then subsequently as historical data • Get crucial time-sensitive insights – React to customer interactions on a website or mobile app – Predict risk of equipment breakdown before it happens Former POV “secure data in DW, then OLAP ASAP afterward” gives way to Current POV “analyze on the wire, write behind”

NEW Stream Data Processing with Pentaho • Visually ingest and produce data from/to Kafka using NEW steps • Process micro-batch chunks of data using either a time-based or a message size-based window • Switch processing engines between Spark (Streaming) or Native Kettle • Harden stream processing libraries and steps to process data from traditional message queues • Benefits: – Lower the bar to build streaming applications – Enable combining batch and stream data processing

How to Process Stream Data in Pentaho • Steps for Kafka ingestion and publish • Ingest and process continuous stream of data in near real-time in parent transformation – Kafka Consumer – Kafka Producer • Process micro-batch of stream data in • Steps for stream processing separate child transformation – Get records from stream

Combined Data Processing Using Spark & Pentaho DATA SOURCES HADOOP/SPARK CLUSTER Micro Services IoT Data Kafka RT Data Processors Batch Data Cluster Processors Hadoop MR Web Clickstream Data Data Analytical Pentaho and Other Logs Collector Publisher Databases Analytics Data Store HDFS Pentaho DI Pentaho DI Traditional DB/DW PDI collects data from PDI can retrieve and NoSQL sources including processed or blended Datastores Kafka Clusters data from Hadoop/Spark Kafka and publish to Kafka Pentaho DI Cluster clusters or external PDI can process streaming data using Spark databases and Spark Streaming or Kettle engine in a completely visual way Traditional Message Bus Ingest Process Publish Reporting

Demonstration

Retail Store Event Processing • Can be run within Spoon via Pentaho or within AEL-Spark engine • Utilizes Kafka in/out, Parquet out and other steps as to demonstrate stream data ingestion, window processing and much more…

Availability and Roadmap

Availability • Adaptive Execution Layer(AEL) and Spark-AEL available in Pentaho 7.1 – Secure Spark integration, high-availability and security of AEL is EE only – Supported Hadoop distros in Pentaho 7.1 - Cloudera CDH and Pentaho 8.0 – Cloudera CDH and Hortonworks HDP • Kafka steps and stream data processing available in Pentaho 8.0 – Kafka from Cloudera and Hortonworks to be supported

Roadmap • Extending AEL to support other Spark distros and other data processing engines • Advanced stream processing with other real-time messaging protocols and windowing mechanism • Enabling Big Data driven machine learning on batch or stream data • Integrated with broader Hitachi Vantara portfolio

SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Leverage the power of Adaptive Execution Visually build stream data processing pipelines to future-proof data processing pipelines for different streaming engines • Configure logic without coding • Configure Stream data processing logic • Switch processing engines without rework • Execute logic in multiple stream processing engines without rework • Handle Big Data formats more efficiently • Connect to streaming data sources NEW in Pentaho NEW in Pentaho ü Adaptive Execution Layer ü Native Streaming in PDI ü Visual Spark via AEL ü Spark Streaming via AEL ü Native Big data Format Handling ü Kafka Connectivity

Next Steps Want to learn more? • Meet-the-Experts: – Anthony DeShazor – Luke Nazarro – Carlo Russo • Recommended Breakout Sessions: – Jonathan Jarvis: Understanding Parallelism with PDI and Adaptive Execution with Spark – Mark Burnette: Understanding the Big Data Technology Ecosystem

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product - PowerPoint PPT Presentation

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara Agenda Pentahos Latest and Upcoming Features for Processing Big Data Batch or Real-time Process big data visually in future-proof way

Monitoring and Analyzing London's Air Quality with Pentaho Mark Semenenko, Pentaho Sales Engineer,

Pentaho Business Analytics Evolves Pedro Alves Pentaho SVP Community & Product Designer,

Install/Update to Pentaho 8.0 From Hitachi Vantara Steven Brown Pentaho Manager, Enterprise

Automated Machine Learning (AutoML) and Pentaho Caio Moreno de Souza Pentaho Senior Consultant,

Integrating New Visualizations with Pentaho Using the Viz API Nick Keune, Pentaho Embedded &

Roadmap: Operating Pentaho at Scale Jens Bleuel Senior Product Manager, Pentaho Agenda

Pentaho Data Integration Best Architecture Practices Matt Casters Pentaho Chief Architect of

Leverage the Power of Pentaho Visualizations Within Your Application Andrew Grohe Pentaho

Pentaho 8.0 and Beyond Matt Howard Pentaho Sr. Director of Product Management, Hitachi Vantara

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Data Science 101 Arik Pelkey Pentaho Senior Director Product Marketing, Hitachi Vantara

Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Best Practices for Choosing Content Reporting Tools and Datasources Andrew Grohe Pentaho

Design Patterns Leveraging Spark in PDI Chris Skirde Pentaho Director of Sales Engineering,

Pentaho World 2017 Agenda 1. Safehub by incentro a. Who b. What c. Why d. How 2. Some

Farmer Clusters Pete Thompson Game & Wildlife Conservation Trust Biodiversity Adviser

Partnering value proposition Through partnering with the ETC and its ecosystem, partners have

HPC Clusters: Best Practices and Performance Study Agenda HPC at HPE System

A k-means approach to clustering disease progressions Duc Thanh Anh Luong Varun Chandola

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li,

Strengthening th the management capacity of f th the Energy and Biomass Clu luster fr from

INTENTIONAL COMMUNITY CLUSTER DEVELOPMENTS INTENT: To allow a con fi gura on of small

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

Sambuz

Useful Links

Newsletter

Mail Us

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product - PowerPoint PPT Presentation

Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara Agenda Pentahos Latest and Upcoming Features for Processing Big Data Batch or Real-time Process big data visually in future-proof way

Monitoring and Analyzing London's Air Quality with Pentaho Mark Semenenko, Pentaho Sales Engineer,

Pentaho Business Analytics Evolves Pedro Alves Pentaho SVP Community &amp; Product Designer,

Install/Update to Pentaho 8.0 From Hitachi Vantara Steven Brown Pentaho Manager, Enterprise

Automated Machine Learning (AutoML) and Pentaho Caio Moreno de Souza Pentaho Senior Consultant,

Integrating New Visualizations with Pentaho Using the Viz API Nick Keune, Pentaho Embedded &amp;

Roadmap: Operating Pentaho at Scale Jens Bleuel Senior Product Manager, Pentaho Agenda

Pentaho Data Integration Best Architecture Practices Matt Casters Pentaho Chief Architect of

Leverage the Power of Pentaho Visualizations Within Your Application Andrew Grohe Pentaho

Pentaho 8.0 and Beyond Matt Howard Pentaho Sr. Director of Product Management, Hitachi Vantara

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Data Science 101 Arik Pelkey Pentaho Senior Director Product Marketing, Hitachi Vantara

Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Best Practices for Choosing Content Reporting Tools and Datasources Andrew Grohe Pentaho

Design Patterns Leveraging Spark in PDI Chris Skirde Pentaho Director of Sales Engineering,

Pentaho World 2017 Agenda 1. Safehub by incentro a. Who b. What c. Why d. How 2. Some

Farmer Clusters Pete Thompson Game &amp; Wildlife Conservation Trust Biodiversity Adviser

Partnering value proposition Through partnering with the ETC and its ecosystem, partners have

HPC Clusters: Best Practices and Performance Study Agenda HPC at HPE System

A k-means approach to clustering disease progressions Duc Thanh Anh Luong Varun Chandola

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li,

Strengthening th the management capacity of f th the Energy and Biomass Clu luster fr from

INTENTIONAL COMMUNITY CLUSTER DEVELOPMENTS INTENT: To allow a con fi gura on of small

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

Sambuz

Useful Links

Newsletter

Mail Us

Pentaho Business Analytics Evolves Pedro Alves Pentaho SVP Community & Product Designer,

Integrating New Visualizations with Pentaho Using the Viz API Nick Keune, Pentaho Embedded &

Farmer Clusters Pete Thompson Game & Wildlife Conservation Trust Biodiversity Adviser