Big Data: Where It Started, Where It Is, Where We Are Going Oliver - - PowerPoint PPT Presentation

▶

Nov 07, 2022 391 likes •569 views

Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director Services Solutions, Hitachi Vantara Agenda Big Data / Hadoop History Throw out all pre-conceived ideas and concepts The Dawn of Big Data

SLIDE 1

Big Data: Where It Started, Where It Is, Where We Are Going

Oliver Nielsen Pentaho Director – Services Solutions, Hitachi Vantara

SLIDE 2

Agenda

Big Data / Hadoop History

Throw out all pre-conceived ideas and concepts
The Dawn of Big Data
The Pivot to think bigger, broader, and deliver outcomes!
The landscape of tomorrow

SLIDE 3

The Space Shuttle and the Chariot

Arguably the worlds most advanced transportation system
Built in Utah by Thiokol. The engineers wanted them to be bigger! But…
Train Tracks - > testing facility -> 4 feet 8.5 inches wide ->
Built by engineers from England that had built tramways, so they used the same
gauge. That gauge came from the jigs that were used to build wagons.

SLIDE 4

The Space Shuttle and the Chariot

That gauge came from the jigs that were used to build wagons.
The wagon wheels were made to be a standard size so that on long-distance

trips they could use the same ruts in the roads.

Those wagons were based on the standard axle sizes from roman chariots
Roman chariots were built to accommodate 2 horses pulling that chariot!
So, the space shuttle rocket boosters were not made to engineering

specifications due to railroad tracks that are based on the width of two horses behinds!

This story is actually UNTRUE. But… the moral of the story is still the same.

SLIDE 5

The Space Shuttle and the Chariot

Sometimes you must throw out everything you know and start with a blank canvas

SLIDE 6

The Dawn of Hadoop

2003 – Google File System – Brin and Page

– Write Once File System – Break everything into chunks (64MB at the time) – Spread chunks across different servers (data nodes) – Only made to benefit large file sizes!

2004 – MapReduce

– Simplified Data Processing across a cluster of servers – Parallel, distributed, algorithm’s on data – Map – Filtering, sorting, and business rules in to key, value pairs – Reduce – Aggregating data by key – Uses the Split – Combine – Apply strategy for analysis of data

2005 - Doug Cutting and Mike Caferella created first Package and named Hadoop

SLIDE 7

The Dawn of Hadoop

2006 – Hadoop 0.1.0 released

– Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours – Yahoo deploys a 300 server Hadoop cluster in May – Yahoo deploys a 600 server Hadoop cluster in Oct.

2007 – First Adoptions

– Yahoo has 2 1000 node clusters by April – By June – 3 companies are “Powered by Hadoop” – HBASE Introduced- June – Pig Introduced– Built by Yahoo – October

2008 – Growth

– 20 companies now “Powered by Hadoop” – Limited to MapReduce in Java, Pig scripting (beta), Java Developers Cheer!

SLIDE 8

The Middle Ages of Hadoop and Pentaho

SLIDE 9

Light Bulb Moments, Science Projects, Frustrations

2012 - Big Data Is Hard!

– YARN replaces MapReduce – Vendors like Pentaho find traction in removing pain points! – SQL on Hadoop (Hive, Impala, Others) – Its all SLOW! – Hadoop Summits gain popularity – 500 people attend – 8 Different File systems now! – HDFS, GlusterFS, Quantcast, Ceph, etc.

2014 –

– Focus on SQL Performance – Storm, Spark, TEZ – Hadoop Summit – 3,200 attendees in San Jose – Hadoop in the Cloud – AWS, Azure, Google Cloud / BigQuery – Continued “all in with Big Data” outlook by Pentaho! – Data Science – R, Python, Scala

SLIDE 10

So Many Choices, So Little Time

Big Data Is Still Hard!

– File Formats, Compression Algorithms, Data Ingest, Data output for Analytics! – All these things have to be considered!

FOCUS ON OUTCOMES!!!

– Do not waste time on science projects – Find something that meets the 3 V’s

Volume
Velocity
Variety
The 4th “V” - Vision – You must have a forward looking vision

and an outcome you want to achieve! Without that you have no business working with Big Data Solutions right now.

SLIDE 11

Hadoop and Hitachi Vantara – What’s Next?

SLIDE 12

The Future Is Bright

Pentaho Adaptive Execution Layer

– Remove Logic from Execution engine – Start with Spark! No scala code, no python code.

Future-proof your investment with AEL

– What’s coming next? ********* – Flink?

Formerly Stratosphere
processing framework for distributed, high-performing, always-available, and accurate data

streaming applications

– Apex?

Apex is a Hadoop YARN native platform that unifies stream and batch processing. It processes

big data in-motion in a way that is highly scalable, highly performant, fault tolerant, stateful, secure, distributed, and easily operable.

Has a high Level API that may be able to be leveraged by Pentaho/PDI

SLIDE 13

What’s Next?

Calcite

– Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. SQL parser, JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins.

Beam

– A simple, flexible, and powerful system for distributed data processing at any scale. Beam provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Beam pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Many of the Proposals from Beam have been integrated into Spark 2.0

SLIDE 14

New Platforms

Current Trends are leaning towards Cloud-based Hadoop Deployments

– Easier To Scale – Easier To Manage – Easier To Tune – Specialized Distributions for different workloads (Analytic Queries, Streaming, Iot)

Who do we Work With Already?

– Google Cloud Platform – Azure – AWS

Under consideration by Hitachi Vantara

– Cloudera Altus – Snowflake

SLIDE 15

Hitachi Vantara Will Lead the Way!

As new technologies and Apache

projects come through the ecosystem, Hitachi Vantara will evaluate which technologies make sense to function as a new Adaptive Execution Engine, or as a plug-in, or integrate with an API.

2061: IO/Europa/Ganymede

SLIDE 16