Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - - PowerPoint PPT Presentation
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - - PowerPoint PPT Presentation
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key
Agenda
- End-to-End Data Delivery Platform
- Ecosystem of Data Technologies
- Mapping an End-to-End Solution
- Case Studies
- Pentaho Key Capabilities
- Summary
- Q&A
End-to-End Data Delivery Platform
Ingest Process Report Publish
- Data Agnostic
- Metadata Driven Ingestion
- Data Orchestration
- Native Hadoop Integration
- Scale Up & Scale Out
- Blend Unstructured Data
- Streamlined Data Refinery
- Data Virtualization
- Machine Learning
- Production Reporting
- Custom Dashboards
- Self-Service Dashboards
- Interactive Analysis
- Embedded Analytics
Delivering Insight
Ingest Process Report Publish
Consumers Data Analyst Data Scientists Data Engineers
Production Reporting Custom Dashboards Interactive Analysis Self-Service Dashboards Data Integration & Orchestration
Big Data Ecosystem
Analytical Databases
1 4 7 2 5 8 3 6 9
SQL on Hadoop Relational Database NoSQL Database Message Streaming HDFS Map Reduce Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP)
Volume
(Data Size)
Small Medium Large
Variety
(Data Type)
Structured Semi-Structured Unstructured
Velocity
(Processing)
Batch Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled Prompted Interactive
Data Source Attributes
Analytical Databases SQL on Hadoop Relational Database NoSQL Database Message Streaming Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP) HDFS Map Reduce
Relational Database MSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume
(Data Size)
Small
Operational databases for OLTP apps that require high transaction loads and user
- concurrency. Can “scale up” to data volumes but lack ability to easily “scale-out” for
large data processing.
Medium Large
Variety
(Data Type)
Structured
Structured schema of tables containing rows and columns of data emphasizing integrity and consistency over speed and scale. Structured data accessed with the SQL query language.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Rigid schemas with batch-oriented ingestion and SQL query processing are not designed for continuous streaming data
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
Optimized for frequent small CRUD queries (create, read, update, delete), not for analytic or interactive query workloads on large data
Prompted Interactive
Relational Database
Good Fit Not Optimal Not Recommended Core Competency
Analytical Database Columnar, In-Memory, MPP, OLAP Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Volume
(Data Size)
Small
Data warehouse/mart databases to support BI and advanced analytics workloads. MPP architecture gives ability to “scale out” to large data volumes at a financial cost.
Medium Large
Variety
(Data Type)
Structured
Structured schema of tables containing rows and columns of data offering improved speed and scalability over RDBMS but still limited to structured data.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Rigid schemas with batch-oriented SQL queries are not designed for streaming applications.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query performance for analytic or interactive query workloads on large data.
Prompted Interactive
Analytical Database
Good Fit Not Optimal Not Recommended Core Competency
NoSQL Database MongoDB, HBase, Cassandra, MarkLogic, Couchbase Volume
(Data Size)
Small
Good for web applications - less web app code to write, debug and maintain. Scale out
- horizontal scaling w auto-sharding data to support millions of web app users.
Compromise on consistency (ACID transactions) in favor of scale & up-time.
Medium Large
Variety
(Data Type)
Structured
Hierarchical, key-value or document design to capture all types of data in a single location.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Schema-less design allows for rapid or continuous ingest at scale. Good storage option for high throughput, low latency requirements of streaming applications for real-time views of data. Seen as a key component to Lambda architecture.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing for reporting and analysis.
Prompted Interactive
NoSQL Database
Good Fit Not Optimal Not Recommended Core Competency
HDFS Map Reduce Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Hitachi HSP, MSFT HDInsights Volume
(Data Size)
Small
Hadoop Distributed File System designed to distribute and replicate file blocks horizontally scaled across multiple commodity data nodes. MapReduce programming takes compute to the data for batch processing large data volumes.
Medium Large
Variety
(Data Type)
Structured
File system is schema-less allowing easy storage of any file type in multiple Hadoop file formats.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
HDFS and MapReduce designed for distributing batch processing workloads on large datasets, not for micro-batch or steaming use cases.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
MapReduce on HDFS lacks SQL support and report queries are slow and less appealing for reporting and analysis.
Prompted Interactive
HDFS MapReduce
Good Fit Not Optimal Not Recommended Core Competency
SQL on Hadoop Batch-oriented, Interactive, and In-Memory Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Cloudera Impala, Pivotal HawQ, Spark SQL Volume
(Data Size)
Small
SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats such as HDFS and HBase.
Medium Large
Variety
(Data Type)
Structured
SQL was designed for structured data. Hadoop files may contain nested data, variable data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these forms of data to flat relational data and optimize queries (Impala/Drill)
Semi-Structured Unstructured
Velocity
(Processing)
Batch
SQL-on-Hadoop engines require smart and advanced workload managers for multi- user workloads designed for query processing not stream processing.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user
- modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and
7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.
Prompted Interactive
SQL on Hadoop
Good Fit Not Optimal Not Recommended Core Competency
Distributed Search ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Volume
(Data Size)
Small
Search engines have to deal with large systems with millions of documents and are designed for index and search query processing at scale with clustering and distributed architecture.
Medium Large
Variety
(Data Type)
Structured
XML, CSV, RDBMS, Word, PDF ,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Twitter.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
ES scalable to very large clusters with near real-time search. The demands of real time web applications require search results in near real time as new content is generated by users. Some contention handling concurrent search + index requests.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and
- grouping. Good for interactive search queries but not interactive analytical reporting.
Prompted Interactive
Distributed Search
Good Fit Not Optimal Not Recommended Core Competency
Message Streaming
Kafka, JMS, AMQP
Volume
(Data Size)
Small
Kafka is an excellent low latency messaging platform that brokers massive message streams for parallel ingestion into Hadoop
Medium Large
Variety
(Data Type)
Structured
Data sources, such as the internet of things, sensors, clickstream, and transactional systems.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Realtime streaming providing high throughput for both publishing and subscribing, with constant performance even with many terabytes of stored messages. Designed for streaming and can configure batch size for brokering micro batches of messages.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
Stream topics need to be processed by additional technology such as PDI, ESP, CEP, query processing engines for reporting.
Prompted Interactive
Message Streaming
Good Fit Not Optimal Not Recommended Core Competency
Message Streaming
Apache Storm
Volume
(Data Size)
Small
Apache Storm is a distributed “event-at-a-time” stream processing system for processing large volumes in parallel with sub-second latency.
Medium Large
Variety
(Data Type)
Structured
Storm applications process 1 incoming event at a time as tuples of data; a tuple may can contain object of any type such as the internet of things, sensors, and transactional systems.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Storm is extremely fast, with the ability to process over a million messages per second per node. Compromises on fault tolerance by offering “at least once semantics” in favor of speed.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
ESP provides the most recent processed data for all types of reporting. Example ESP Use Case: Stock market tickers showing stock performances with a Green up arrow or Red down arrow in real time.
Prompted Interactive
Event Stream Processing (ESP)
Good Fit Not Optimal Not Recommended Core Competency
Message Streaming
Spark, Flink
Volume
(Data Size)
Small
Spark and Flink are distributed “micro-batch” stream processing engines for processing large volumes of high-velocity data in parallel with a few seconds latency.
Medium Large
Variety
(Data Type)
Structured
Complex event processing for internet of things, sensors, and transactional systems. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations.
Semi-Structured Unstructured
Velocity
(Processing)
Batch
Micro-batch processing engines with few seconds latency that is not as fast as Storm, but has better fault tolerance guaranteeing “exactly once semantics” for stateful
- computations. Great for machine learning computations.
Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled
CEP provides the most recent processed data for all types of reporting. Example CEP use case: user sets up alert to the stock market saying "let me know if GOOG stocks went up by 10% and stayed up for 3 hours or more".
Prompted Interactive
Complex Event Processing (CEP)
Good Fit Not Optimal Not Recommended Core Competency
Big Data Ecosystem
Analytical Databases
1 4 7 2 5 8 3 6 9
SQL on Hadoop Relational Database NoSQL Database Message Streaming HDFS Map Reduce Distributed Search Event Stream Processing (ESP) Complex Event Processing (CEP)
Mapping A Solution
Relational Database Analytical Database NoSQL Database Hadoop File System (HDFS MR) SQL on Hadoop Distributed Search Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)
Volume
(Data Size)
Small Medium Large
Variety
(Data Type)
Structured Semi-Structured Unstructured
Velocity
(Processing)
Batch Micro-Batch RT Streaming
Latency
(Reporting)
Scheduled Prompted Interactive
Matrix for Analytics Performance (MAP)
Good Fit Not Optimal Core Competency Not Recommended
PDI
PENTAHO DATA INTEGRATION BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA LAKE ANALYTIC DATASETS PENTAHO DATA INTEGRATION TRADITIONAL DATA PENTAHO DATA INTEGRATION DATA WAREHOUSE DATA MARTS LINE OF BUSINESS ANALYTICS
E X T R A N E T D E P L OY M E N T S E M B E D D E D A N A LY T I C S O N - D E M A N D D ATA M A R T S E L F - S E R V I C E A N A LY T I C S C E N T R A L I Z E D A N A LY T I C S AT S C A L E
Big Data Projects
A Single Flow
Data Prep Data Engineering Analytics
Ingestion Processing Blending Data Delivery Data Discovery / Analysis Analysis & Dashboards
Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation
Key Takeaways
- Data architecture modernization involves many technologies
- Understanding the ecosystem of data technologies
- Mapping an end-to-end solution
- Pentaho key capabilities