Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH - - PowerPoint PPT Presentation

building scalable big data pipelines
SMART_READER_LITE
LIVE PREVIEW

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH - - PowerPoint PPT Presentation

Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Ggi, Solution Architect 19.09.2013 AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice Recommendations


slide-1
SLIDE 1

Building Scalable Big Data Pipelines

Christian Gügi, Solution Architect

19.09.2013

NOSQL SEARCH ROADSHOW ZURICH

slide-2
SLIDE 2

AGENDA

  • Opportunities & Challenges
  • Integrating Hadoop
  • Lambda Architecture
  • Lambda in Practice
  • Recommendations
slide-3
SLIDE 3

ABOUT ME

  • Solution Architect @ YMC
  • Founder and organizer Swiss Big Data User Group
  • http://www.bigdata-usergroup.ch/
  • Contact
  • christian.guegi@ymc.ch
  • http://about.me/cguegi
  • @chrisgugi
slide-4
SLIDE 4

ABOUT YMC

  • Founded in 2001
  • Based in Kreuzlingen, Switzerland
  • Big Data Analytics, Web Solutions and

Mobile Applications

  • 24 experts
  • Consulting, creation, engineering
slide-5
SLIDE 5

OPPORTUNITIES &

slide-6
SLIDE 6
  • A. New sources and types from inside & outside organisations
  • “Internet of things”, sensors, RFID, intelligent devices, etc.
  • Unstructured information – documents, web logs, email, social

media, etc.

  • Trusted 3rd party sources – industry provider & aggregators,

governments “Open Data”, weather, etc.

  • B. Technology innovations to exploit new world of data
  • Low cost storage and process power (cloud, on-premise &

hybrid)

  • New software patterns to handle speed & volume, structured

and unstructured (In-memory computation, Hadoop, Mapreduce, etc.)

  • Revolution in user experience, analytics, recommendations

BIG DATA – WHAT IS THE BIG DEAL?

slide-7
SLIDE 7

BIG DATA – CHALLENGES

  • Align business

strategy

  • Data Management
  • Privacy protection
  • Lack of skilled and

experienced people

  • Volume
  • Velocity
  • Variety
  • Veracity

Character

  • f data

Overwhelming landscape & integration Available talent Organisational issues

slide-8
SLIDE 8

INTEGRATING

slide-9
SLIDE 9

TYPICAL RDBMS SZENARIO

Data Sources Data Systems Apps DWH RDBMS RDBMS NFS Others BI Web Mobile

ETL

slide-10
SLIDE 10

BIG DATA SZENARIO

Data Sources Data Systems Apps DWH RDBMS RDBMS NFS Logs BI Web Mobile Social Media Sensors Hadoop

1) Recommendations, etc. 1)

slide-11
SLIDE 11

HADOOP ECOSYSTEM

slide-12
SLIDE 12

LAMBDA

slide-13
SLIDE 13

ARCHITECTURE

  • Credits Nathan Marz
  • Former Engineer at Twitter
  • Storm, Cascalog, ElephantDB

LAMBDA

http://www.manning.com/marz/

slide-14
SLIDE 14

DESIGN PRINCIPLES

Lambda Architecture

  • Human fault-tolerance
  • Data immutability
  • Re-computation
slide-15
SLIDE 15

HUMAN FAULT-TOLERANCE

Lambda Architecture

  • Design for human error
  • Bugs in code
  • Accidental data loss
  • Data corruption
  • Protect good data, so you can always fix

what went wrong

slide-16
SLIDE 16

DATA IMMUTABILIY

Lambda Architecture

  • Store data in it’s rawest form
  • Create and read but no update
  • No data can be lost
  • To fix the system just delete bad data
  • Can always revert to a true state
slide-17
SLIDE 17

DATA IMMUTABILIY

Lambda Architecture

Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Alice Basel 2013/08/20 Name Location Alice Zurich Bob Lucerne Tom Bern Name Location Alice Basel Bob Lucerne Tom Bern

Capturing change traditionally (mutability) Capturing change (immutability)

slide-18
SLIDE 18

RE-COMPUTATION

Lambda Architecture

  • Always able to re-compute from historical

data

  • Basis for all data systems
  • query = function(all data)

All Data Pre-computed views Query

slide-19
SLIDE 19

LAYERS

Lambda Architecture

http://www.ymc.ch/en/lambda-architecture-part-1

slide-20
SLIDE 20

Lambda in Practice

slide-21
SLIDE 21

ONLINE MARKETING

  • Tracking and analytics solution
  • Improve customer targeting and

segmentation

  • Various reports
  • Real-time not required
slide-22
SLIDE 22

OVERVIEW

HBase FTP

HDFS

AdServer Flume log HDFS Campaign Database Sqoop csv csv Up- & Download Hive fs -put Aggregated Data Web Pig Impala DWH BI apps Oozie ZooKeeper Cloudera Manager

slide-23
SLIDE 23

DATA PIPELINE

FTP

HDFS

AdServer Flume log HDFS Campaign Database Sqoop csv csv fs -put M/R Avro Avro Avro Extracting Transformation M/R M/R Loading Tracking Profiles Bulk Importer DWH

slide-24
SLIDE 24

ADVANTAGES

  • Extensible – easily add speed layer later on
  • Complements existing DWH/BI system
  • ETL phases are decoupled
  • Reliable
  • Infrastructure
  • Each step can be replayed
  • Scalable
  • Storage
  • Processing
  • Highly available
  • Ad-hoc analysis right from the beginning
slide-25
SLIDE 25

RECOMMENDATIONS

slide-26
SLIDE 26

RECOMMENDATIONS

  • Not a fixed, one-size-fits-all approach
  • Adopt to your needs/requirements
  • Hadoop complements existing systems
  • How real-time do I need to be?
  • Immutability and pre-computation are just good

ideas!

  • Store information in rawest format possible
  • Use a serialization framework (Avro, Thrift, Protocol

Buffers)

slide-27
SLIDE 27

THANK YOU!

slide-28
SLIDE 28

YMC AG Sonnenstrasse 4 CH-8280 Kreuzlingen Switzerland

Photo Credits: Slide 05: Success opportunity achieve by Stephen McCulloch Slide 08: Matrix by Gamaliel Espinoza Macedo. Slide 12: Layers by Katelyn Leblanc Slide 20: Mining For Information by JD Hancock Slide 27: Warning Question by longzijun

@chrisgugi

CONTACT US

christian.guegi@ymc.ch

  • Tel. +41 (0)71 508 24 76

www.ymc.ch