Big Data: Where It Started, Where It Is, Where We Are Going Oliver - - PowerPoint PPT Presentation

big data where it started where it is where we are going
SMART_READER_LITE
LIVE PREVIEW

Big Data: Where It Started, Where It Is, Where We Are Going Oliver - - PowerPoint PPT Presentation

Big Data: Where It Started, Where It Is, Where We Are Going Oliver Nielsen Pentaho Director Services Solutions, Hitachi Vantara Agenda Big Data / Hadoop History Throw out all pre-conceived ideas and concepts The Dawn of Big Data


slide-1
SLIDE 1

Big Data: Where It Started, Where It Is, Where We Are Going

Oliver Nielsen Pentaho Director – Services Solutions, Hitachi Vantara

slide-2
SLIDE 2

Agenda

Big Data / Hadoop History

  • Throw out all pre-conceived ideas and concepts
  • The Dawn of Big Data
  • The Pivot to think bigger, broader, and deliver outcomes!
  • The landscape of tomorrow
slide-3
SLIDE 3

The Space Shuttle and the Chariot

  • Arguably the worlds most advanced transportation system
  • Built in Utah by Thiokol. The engineers wanted them to be bigger! But…
  • Train Tracks - > testing facility -> 4 feet 8.5 inches wide ->
  • Built by engineers from England that had built tramways, so they used the same
  • gauge. That gauge came from the jigs that were used to build wagons.
slide-4
SLIDE 4

The Space Shuttle and the Chariot

  • That gauge came from the jigs that were used to build wagons.
  • The wagon wheels were made to be a standard size so that on long-distance

trips they could use the same ruts in the roads.

  • Those wagons were based on the standard axle sizes from roman chariots
  • Roman chariots were built to accommodate 2 horses pulling that chariot!
  • So, the space shuttle rocket boosters were not made to engineering

specifications due to railroad tracks that are based on the width of two horses behinds!

  • This story is actually UNTRUE. But… the moral of the story is still the same.
slide-5
SLIDE 5

The Space Shuttle and the Chariot

Sometimes you must throw out everything you know and start with a blank canvas

slide-6
SLIDE 6

The Dawn of Hadoop

  • 2003 – Google File System – Brin and Page

– Write Once File System – Break everything into chunks (64MB at the time) – Spread chunks across different servers (data nodes) – Only made to benefit large file sizes!

  • 2004 – MapReduce

– Simplified Data Processing across a cluster of servers – Parallel, distributed, algorithm’s on data – Map – Filtering, sorting, and business rules in to key, value pairs – Reduce – Aggregating data by key – Uses the Split – Combine – Apply strategy for analysis of data

  • 2005 - Doug Cutting and Mike Caferella created first Package and named Hadoop
slide-7
SLIDE 7

The Dawn of Hadoop

  • 2006 – Hadoop 0.1.0 released

– Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours – Yahoo deploys a 300 server Hadoop cluster in May – Yahoo deploys a 600 server Hadoop cluster in Oct.

  • 2007 – First Adoptions

– Yahoo has 2 1000 node clusters by April – By June – 3 companies are “Powered by Hadoop” – HBASE Introduced- June – Pig Introduced– Built by Yahoo – October

  • 2008 – Growth

– 20 companies now “Powered by Hadoop” – Limited to MapReduce in Java, Pig scripting (beta), Java Developers Cheer!

slide-8
SLIDE 8

The Middle Ages of Hadoop and Pentaho

slide-9
SLIDE 9

Light Bulb Moments, Science Projects, Frustrations

  • 2012 - Big Data Is Hard!

– YARN replaces MapReduce – Vendors like Pentaho find traction in removing pain points! – SQL on Hadoop (Hive, Impala, Others) – Its all SLOW! – Hadoop Summits gain popularity – 500 people attend – 8 Different File systems now! – HDFS, GlusterFS, Quantcast, Ceph, etc.

  • 2014 –

– Focus on SQL Performance – Storm, Spark, TEZ – Hadoop Summit – 3,200 attendees in San Jose – Hadoop in the Cloud – AWS, Azure, Google Cloud / BigQuery – Continued “all in with Big Data” outlook by Pentaho! – Data Science – R, Python, Scala

slide-10
SLIDE 10

So Many Choices, So Little Time

  • Big Data Is Still Hard!

– File Formats, Compression Algorithms, Data Ingest, Data output for Analytics! – All these things have to be considered!

  • FOCUS ON OUTCOMES!!!

– Do not waste time on science projects – Find something that meets the 3 V’s

  • Volume
  • Velocity
  • Variety
  • The 4th “V” - Vision – You must have a forward looking vision

and an outcome you want to achieve! Without that you have no business working with Big Data Solutions right now.

slide-11
SLIDE 11

Hadoop and Hitachi Vantara – What’s Next?

slide-12
SLIDE 12

The Future Is Bright

  • Pentaho Adaptive Execution Layer

– Remove Logic from Execution engine – Start with Spark! No scala code, no python code.

  • Future-proof your investment with AEL

– What’s coming next? ********* – Flink?

  • Formerly Stratosphere
  • processing framework for distributed, high-performing, always-available, and accurate data

streaming applications

– Apex?

  • Apex is a Hadoop YARN native platform that unifies stream and batch processing. It processes

big data in-motion in a way that is highly scalable, highly performant, fault tolerant, stateful, secure, distributed, and easily operable.

  • Has a high Level API that may be able to be leveraged by Pentaho/PDI
slide-13
SLIDE 13

What’s Next?

  • Calcite

– Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. SQL parser, JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins.

  • Beam

– A simple, flexible, and powerful system for distributed data processing at any scale. Beam provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Beam pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Many of the Proposals from Beam have been integrated into Spark 2.0

slide-14
SLIDE 14

New Platforms

  • Current Trends are leaning towards Cloud-based Hadoop Deployments

– Easier To Scale – Easier To Manage – Easier To Tune – Specialized Distributions for different workloads (Analytic Queries, Streaming, Iot)

  • Who do we Work With Already?

– Google Cloud Platform – Azure – AWS

  • Under consideration by Hitachi Vantara

– Cloudera Altus – Snowflake

slide-15
SLIDE 15

Hitachi Vantara Will Lead the Way!

  • As new technologies and Apache

projects come through the ecosystem, Hitachi Vantara will evaluate which technologies make sense to function as a new Adaptive Execution Engine, or as a plug-in, or integrate with an API.

  • 2061: IO/Europa/Ganymede
slide-16
SLIDE 16