Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - PowerPoint PPT Presentation

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Agenda • End-to-End Data Delivery Platform • Ecosystem of Data Technologies • Mapping an End-to-End Solution • Case Studies • Pentaho Key Capabilities • Summary • Q&A

End-to-End Data Delivery Platform Ingest Process Publish Report • Data Agnostic • Native Hadoop Integration • Streamlined Data Refinery • Production Reporting • Metadata Driven Ingestion • Scale Up & Scale Out • Data Virtualization • Custom Dashboards • Data Orchestration • Blend Unstructured Data • Machine Learning • Self-Service Dashboards • Interactive Analysis • Embedded Analytics

Delivering Insight Custom Dashboards Data Integration & Orchestration Self-Service Dashboards Ingest Process Publish Report Interactive Analysis Production Reporting Data Engineers Data Scientists Data Analyst Consumers

Big Data Ecosystem 1 2 3 Relational Analytical NoSQL Database Database Databases 4 5 6 HDFS Map Reduce Distributed SQL on Hadoop Search 7 8 9 Event Stream Message Complex Event Processing (ESP) Streaming Processing (CEP)

Data Source Attributes Small Medium Large Volume (Data Size) Relational Analytical NoSQL Database Structured Semi-Structured Unstructured Database Databases Variety (Data Type) Batch Micro-Batch RT Streaming Velocity Distributed HDFS Map Reduce SQL on Hadoop Search (Processing) Scheduled Prompted Interactive Latency Event Stream Complex Event Message (Reporting) Processing (ESP) Processing (CEP) Streaming

Core Competency Good Fit Relational Database Not Optimal Not Recommended Relational MSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Database Small Operational databases for OLTP apps that require high transaction loads and user Volume concurrency. Can “scale up” to data volumes but lack ability to easily “scale-out” for Medium (Data Size) large data processing. Large Structured Structured schema of tables containing rows and columns of data emphasizing Variety integrity and consistency over speed and scale. Structured data accessed with the SQL Semi-Structured (Data Type) query language. Unstructured Batch Velocity Rigid schemas with batch-oriented ingestion and SQL query processing are not Micro-Batch designed for continuous streaming data (Processing) RT Streaming Scheduled Latency Optimized for frequent small CRUD queries (create, read, update, delete), not for Prompted (Reporting) analytic or interactive query workloads on large data Interactive

Core Competency Good Fit Analytical Database Not Optimal Not Recommended Analytical Columnar, In-Memory, MPP, OLAP Database Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Small Volume Data warehouse/mart databases to support BI and advanced analytics workloads. MPP Medium architecture gives ability to “scale out” to large data volumes at a financial cost. (Data Size) Large Structured Variety Structured schema of tables containing rows and columns of data offering improved Semi-Structured speed and scalability over RDBMS but still limited to structured data. (Data Type) Unstructured Batch Velocity Rigid schemas with batch-oriented SQL queries are not designed for streaming Micro-Batch applications. (Processing) RT Streaming Scheduled Latency All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query Prompted (Reporting) performance for analytic or interactive query workloads on large data. Interactive

Core Competency Good Fit NoSQL Database Not Optimal Not Recommended NoSQL MongoDB, HBase, Cassandra, MarkLogic, Couchbase Database Small Good for web applications - less web app code to write, debug and maintain. Scale out Volume - horizontal scaling w auto-sharding data to support millions of web app users. Medium (Data Size) Compromise on consistency (ACID transactions) in favor of scale & up-time. Large Structured Variety Hierarchical, key-value or document design to capture all types of data in a single Semi-Structured location. (Data Type) Unstructured Batch Schema-less design allows for rapid or continuous ingest at scale. Good storage option Velocity for high throughput, low latency requirements of streaming applications for real-time Micro-Batch (Processing) views of data. Seen as a key component to Lambda architecture. RT Streaming Scheduled Latency Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing Prompted (Reporting) for reporting and analysis. Interactive

Core Competency Good Fit HDFS MapReduce Not Optimal Not Recommended HDFS Map Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Reduce Hitachi HSP, MSFT HDInsights Small Hadoop Distributed File System designed to distribute and replicate file blocks Volume horizontally scaled across multiple commodity data nodes. MapReduce programming Medium (Data Size) takes compute to the data for batch processing large data volumes. Large Structured Variety File system is schema-less allowing easy storage of any file type in multiple Hadoop file Semi-Structured formats. (Data Type) Unstructured Batch Velocity HDFS and MapReduce designed for distributing batch processing workloads on large Micro-Batch datasets, not for micro-batch or steaming use cases. (Processing) RT Streaming Scheduled Latency MapReduce on HDFS lacks SQL support and report queries are slow and less appealing Prompted (Reporting) for reporting and analysis. Interactive

Core Competency Good Fit SQL on Hadoop Not Optimal Not Recommended Batch-oriented, Interactive, and In-Memory SQL on Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Hadoop Cloudera Impala, Pivotal HawQ, Spark SQL Small SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to Volume MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats Medium (Data Size) such as HDFS and HBase. Large Structured SQL was designed for structured data. Hadoop files may contain nested data, variable Variety data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these Semi-Structured (Data Type) forms of data to flat relational data and optimize queries (Impala/Drill) Unstructured Batch Velocity SQL-on-Hadoop engines require smart and advanced workload managers for multi- Micro-Batch user workloads designed for query processing not stream processing. (Processing) RT Streaming Scheduled Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user Latency modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and (Reporting) Prompted 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more. Interactive

Core Competency Good Fit Distributed Search Not Optimal Not Recommended Distributed ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Search Small Search engines have to deal with large systems with millions of documents and are Volume designed for index and search query processing at scale with clustering and Medium (Data Size) distributed architecture. Large Structured XML, CSV, RDBMS, Word, PDF ,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), Variety FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Semi-Structured (Data Type) Twitter. Unstructured Batch ES scalable to very large clusters with near real-time search. The demands of real time Velocity web applications require search results in near real time as new content is generated Micro-Batch (Processing) by users. Some contention handling concurrent search + index requests. RT Streaming Scheduled Latency Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and Prompted (Reporting) grouping. Good for interactive search queries but not interactive analytical reporting. Interactive

Core Competency Message Streaming Good Fit Not Optimal Not Recommended Message Kafka, JMS, AMQP Streaming Small Volume Kafka is an excellent low latency messaging platform that brokers massive message Medium streams for parallel ingestion into Hadoop (Data Size) Large Structured Variety Data sources, such as the internet of things, sensors, clickstream, and transactional Semi-Structured systems. (Data Type) Unstructured Batch Realtime streaming providing high throughput for both publishing and subscribing, Velocity with constant performance even with many terabytes of stored messages. Designed Micro-Batch (Processing) for streaming and can configure batch size for brokering micro batches of messages. RT Streaming Scheduled Latency Stream topics need to be processed by additional technology such as PDI, ESP, CEP, Prompted (Reporting) query processing engines for reporting. Interactive

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - PowerPoint PPT Presentation

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

2 Theory of Ecosystem Services Speaker Dr. Stephen Polasky 2011 ECOSYSTEM SERVICES SEMINAR

HUTAN HUTAN HARAPAN HUTAN HUTAN HARAPAN HARAPAN HARAPAN Ecosystem Restoration Ecosystem

5 Ecosystem Services in Practice: Market-Based Ecosystem Services - From Theory to Application

4 Policy and Management Tools for Ecosystem Services Speaker Pavan Sukhdev 2011 ECOSYSTEM

3 Valuation of Ecosystem Services Speaker Dr. James Boyd 2011 ECOSYSTEM SERVICES SEMINAR

MORPHotype EcOSystem design remote definition based on big data morphology and use ecosystem

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Completing the Big Data Ecosystem: Vines Our Big-Data sqrrl and Accumulo Perspective How

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

New York State Education Department Arthur O. Eve Higher Education Opportunity Program Data

Welcome to this presentation about Data Science which is an important topic in the master of

The Network of Care was established via a grant from the State of California in 2001. Its mission

Creative data presentation Introduction No matter how comprehensive and revealing your data set

Electronic Staffing Data Submission Payroll Based Journal (PBJ) Electronic submission of Staffing

Project Plan and Progress Presentation Project Partner: TexProtects Project Name: TexProtects

VTA HACKATHON Gather ideas for how to visualize and leverage real-time data Swiftly received

1 4 District Data District Scorecard Trends 2 5 State Growth Grade Level Data Metrics

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales - PowerPoint PPT Presentation

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

2 Theory of Ecosystem Services Speaker Dr. Stephen Polasky 2011 ECOSYSTEM SERVICES SEMINAR

HUTAN HUTAN HARAPAN HUTAN HUTAN HARAPAN HARAPAN HARAPAN Ecosystem Restoration Ecosystem

5 Ecosystem Services in Practice: Market-Based Ecosystem Services - From Theory to Application

4 Policy and Management Tools for Ecosystem Services Speaker Pavan Sukhdev 2011 ECOSYSTEM

3 Valuation of Ecosystem Services Speaker Dr. James Boyd 2011 ECOSYSTEM SERVICES SEMINAR

MORPHotype EcOSystem design remote definition based on big data morphology and use ecosystem

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Completing the Big Data Ecosystem: Vines Our Big-Data sqrrl and Accumulo Perspective How

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

New York State Education Department Arthur O. Eve Higher Education Opportunity Program Data

Welcome to this presentation about Data Science which is an important topic in the master of

The Network of Care was established via a grant from the State of California in 2001. Its mission

Creative data presentation Introduction No matter how comprehensive and revealing your data set

Electronic Staffing Data Submission Payroll Based Journal (PBJ) Electronic submission of Staffing

Project Plan and Progress Presentation Project Partner: TexProtects Project Name: TexProtects

VTA HACKATHON Gather ideas for how to visualize and leverage real-time data Swiftly received

1 4 District Data District Scorecard Trends 2 5 State Growth Grade Level Data Metrics

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data