Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - - PowerPoint PPT Presentation

big data technology ecosystem
SMART_READER_LITE
LIVE PREVIEW

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - - PowerPoint PPT Presentation

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key


slide-1
SLIDE 1

Big Data Technology Ecosystem

Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

slide-2
SLIDE 2

Agenda

  • End-to-End Data Delivery Platform
  • Ecosystem of Data Technologies
  • Mapping an End-to-End Solution
  • Case Studies
  • Pentaho Key Capabilities
  • Summary
  • Q&A
slide-3
SLIDE 3

End-to-End Data Delivery Platform

Ingest Process Report Publish

  • Data Agnostic
  • Metadata Driven Ingestion
  • Data Orchestration
  • Native Hadoop Integration
  • Scale Up & Scale Out
  • Blend Unstructured Data
  • Streamlined Data Refinery
  • Data Virtualization
  • Machine Learning
  • Production Reporting
  • Custom Dashboards
  • Self-Service Dashboards
  • Interactive Analysis
  • Embedded Analytics
slide-4
SLIDE 4

Delivering Insight

Ingest Process Report Publish

Consumers Data Analyst Data Scientists Data Engineers

Production Reporting Custom Dashboards Interactive Analysis Self-Service Dashboards Data Integration & Orchestration

slide-5
SLIDE 5

Big Data Ecosystem

Analytical Databases

1 4 7 2 5 8 3 6 9

SQL on Hadoop Relational Database NoSQL Database Message Streaming HDFS Map Reduce Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP)

slide-6
SLIDE 6

Volume

(Data Size)

Small Medium Large

Variety

(Data Type)

Structured Semi-Structured Unstructured

Velocity

(Processing)

Batch Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled Prompted Interactive

Data Source Attributes

Analytical Databases SQL on Hadoop Relational Database NoSQL Database Message Streaming Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP) HDFS Map Reduce

slide-7
SLIDE 7

Relational Database MSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume

(Data Size)

Small

Operational databases for OLTP apps that require high transaction loads and user

  • concurrency. Can “scale up” to data volumes but lack ability to easily “scale-out” for

large data processing.

Medium Large

Variety

(Data Type)

Structured

Structured schema of tables containing rows and columns of data emphasizing integrity and consistency over speed and scale. Structured data accessed with the SQL query language.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Rigid schemas with batch-oriented ingestion and SQL query processing are not designed for continuous streaming data

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

Optimized for frequent small CRUD queries (create, read, update, delete), not for analytic or interactive query workloads on large data

Prompted Interactive

Relational Database

Good Fit Not Optimal Not Recommended Core Competency

slide-8
SLIDE 8

Analytical Database Columnar, In-Memory, MPP, OLAP Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Volume

(Data Size)

Small

Data warehouse/mart databases to support BI and advanced analytics workloads. MPP architecture gives ability to “scale out” to large data volumes at a financial cost.

Medium Large

Variety

(Data Type)

Structured

Structured schema of tables containing rows and columns of data offering improved speed and scalability over RDBMS but still limited to structured data.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Rigid schemas with batch-oriented SQL queries are not designed for streaming applications.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query performance for analytic or interactive query workloads on large data.

Prompted Interactive

Analytical Database

Good Fit Not Optimal Not Recommended Core Competency

slide-9
SLIDE 9

NoSQL Database MongoDB, HBase, Cassandra, MarkLogic, Couchbase Volume

(Data Size)

Small

Good for web applications - less web app code to write, debug and maintain. Scale out

  • horizontal scaling w auto-sharding data to support millions of web app users.

Compromise on consistency (ACID transactions) in favor of scale & up-time.

Medium Large

Variety

(Data Type)

Structured

Hierarchical, key-value or document design to capture all types of data in a single location.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Schema-less design allows for rapid or continuous ingest at scale. Good storage option for high throughput, low latency requirements of streaming applications for real-time views of data. Seen as a key component to Lambda architecture.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing for reporting and analysis.

Prompted Interactive

NoSQL Database

Good Fit Not Optimal Not Recommended Core Competency

slide-10
SLIDE 10

HDFS Map Reduce Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Hitachi HSP, MSFT HDInsights Volume

(Data Size)

Small

Hadoop Distributed File System designed to distribute and replicate file blocks horizontally scaled across multiple commodity data nodes. MapReduce programming takes compute to the data for batch processing large data volumes.

Medium Large

Variety

(Data Type)

Structured

File system is schema-less allowing easy storage of any file type in multiple Hadoop file formats.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

HDFS and MapReduce designed for distributing batch processing workloads on large datasets, not for micro-batch or steaming use cases.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

MapReduce on HDFS lacks SQL support and report queries are slow and less appealing for reporting and analysis.

Prompted Interactive

HDFS MapReduce

Good Fit Not Optimal Not Recommended Core Competency

slide-11
SLIDE 11

SQL on Hadoop Batch-oriented, Interactive, and In-Memory Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Cloudera Impala, Pivotal HawQ, Spark SQL Volume

(Data Size)

Small

SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats such as HDFS and HBase.

Medium Large

Variety

(Data Type)

Structured

SQL was designed for structured data. Hadoop files may contain nested data, variable data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these forms of data to flat relational data and optimize queries (Impala/Drill)

Semi-Structured Unstructured

Velocity

(Processing)

Batch

SQL-on-Hadoop engines require smart and advanced workload managers for multi- user workloads designed for query processing not stream processing.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user

  • modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and

7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.

Prompted Interactive

SQL on Hadoop

Good Fit Not Optimal Not Recommended Core Competency

slide-12
SLIDE 12

Distributed Search ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Volume

(Data Size)

Small

Search engines have to deal with large systems with millions of documents and are designed for index and search query processing at scale with clustering and distributed architecture.

Medium Large

Variety

(Data Type)

Structured

XML, CSV, RDBMS, Word, PDF ,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Twitter.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

ES scalable to very large clusters with near real-time search. The demands of real time web applications require search results in near real time as new content is generated by users. Some contention handling concurrent search + index requests.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and

  • grouping. Good for interactive search queries but not interactive analytical reporting.

Prompted Interactive

Distributed Search

Good Fit Not Optimal Not Recommended Core Competency

slide-13
SLIDE 13

Message Streaming

Kafka, JMS, AMQP

Volume

(Data Size)

Small

Kafka is an excellent low latency messaging platform that brokers massive message streams for parallel ingestion into Hadoop

Medium Large

Variety

(Data Type)

Structured

Data sources, such as the internet of things, sensors, clickstream, and transactional systems.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Realtime streaming providing high throughput for both publishing and subscribing, with constant performance even with many terabytes of stored messages. Designed for streaming and can configure batch size for brokering micro batches of messages.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

Stream topics need to be processed by additional technology such as PDI, ESP, CEP, query processing engines for reporting.

Prompted Interactive

Message Streaming

Good Fit Not Optimal Not Recommended Core Competency

slide-14
SLIDE 14

Message Streaming

Apache Storm

Volume

(Data Size)

Small

Apache Storm is a distributed “event-at-a-time” stream processing system for processing large volumes in parallel with sub-second latency.

Medium Large

Variety

(Data Type)

Structured

Storm applications process 1 incoming event at a time as tuples of data; a tuple may can contain object of any type such as the internet of things, sensors, and transactional systems.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Storm is extremely fast, with the ability to process over a million messages per second per node. Compromises on fault tolerance by offering “at least once semantics” in favor of speed.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

ESP provides the most recent processed data for all types of reporting. Example ESP Use Case: Stock market tickers showing stock performances with a Green up arrow or Red down arrow in real time.

Prompted Interactive

Event Stream Processing (ESP)

Good Fit Not Optimal Not Recommended Core Competency

slide-15
SLIDE 15

Message Streaming

Spark, Flink

Volume

(Data Size)

Small

Spark and Flink are distributed “micro-batch” stream processing engines for processing large volumes of high-velocity data in parallel with a few seconds latency.

Medium Large

Variety

(Data Type)

Structured

Complex event processing for internet of things, sensors, and transactional systems. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations.

Semi-Structured Unstructured

Velocity

(Processing)

Batch

Micro-batch processing engines with few seconds latency that is not as fast as Storm, but has better fault tolerance guaranteeing “exactly once semantics” for stateful

  • computations. Great for machine learning computations.

Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled

CEP provides the most recent processed data for all types of reporting. Example CEP use case: user sets up alert to the stock market saying "let me know if GOOG stocks went up by 10% and stayed up for 3 hours or more".

Prompted Interactive

Complex Event Processing (CEP)

Good Fit Not Optimal Not Recommended Core Competency

slide-16
SLIDE 16

Big Data Ecosystem

Analytical Databases

1 4 7 2 5 8 3 6 9

SQL on Hadoop Relational Database NoSQL Database Message Streaming HDFS Map Reduce Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP)

slide-17
SLIDE 17

Mapping A Solution

slide-18
SLIDE 18

Relational Database Analytical Database NoSQL Database Hadoop File System (HDFS MR) SQL on Hadoop Distributed Search Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)

Volume

(Data Size)

Small Medium Large

Variety

(Data Type)

Structured Semi-Structured Unstructured

Velocity

(Processing)

Batch Micro-Batch RT Streaming

Latency

(Reporting)

Scheduled Prompted Interactive

Matrix for Analytics Performance (MAP)

Good Fit Not Optimal Core Competency Not Recommended

slide-19
SLIDE 19

PDI

PENTAHO DATA INTEGRATION BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA LAKE ANALYTIC DATASETS PENTAHO DATA INTEGRATION TRADITIONAL DATA PENTAHO DATA INTEGRATION DATA WAREHOUSE DATA MARTS LINE OF BUSINESS ANALYTICS

E X T R A N E T D E P L OY M E N T S E M B E D D E D A N A LY T I C S O N - D E M A N D D ATA M A R T S E L F - S E R V I C E A N A LY T I C S C E N T R A L I Z E D A N A LY T I C S AT S C A L E

Big Data Projects

slide-20
SLIDE 20

A Single Flow

Data Prep Data Engineering Analytics

Ingestion Processing Blending Data Delivery Data Discovery / Analysis Analysis & Dashboards

Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation

slide-21
SLIDE 21

Key Takeaways

  • Data architecture modernization involves many technologies
  • Understanding the ecosystem of data technologies
  • Mapping an end-to-end solution
  • Pentaho key capabilities
slide-22
SLIDE 22