Big Data Compete by asking bigger questions $$$... $ ??? SLA - - PowerPoint PPT Presentation

big data
SMART_READER_LITE
LIVE PREVIEW

Big Data Compete by asking bigger questions $$$... $ ??? SLA - - PowerPoint PPT Presentation

A World of Data Gizillions of mobile Thingsternet transactions Living online Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the Daaaay!! But its not always easy to tame an


slide-1
SLIDE 1

A World of Data

“Thingsternet” Compete by asking bigger questions Living

  • nline

Big Data

“Gizillions” of mobile transactions

slide-2
SLIDE 2

$ $$$...

slide-3
SLIDE 3

???

slide-4
SLIDE 4

SLA

slide-5
SLIDE 5
slide-6
SLIDE 6

Yaaaay – Hadoop to Save the Daaaay!!

  • But it’s not always easy to tame an elephant…
slide-7
SLIDE 7
slide-8
SLIDE 8

CUSTOMERS WEB CLIENT WEB SHOP BACKEND WEB SHOP DATA BASE ~100GB Product and Customer Transaction Data

Introducing “DataCo”

“We don’t really have a big data problem…”

slide-9
SLIDE 9

> 6 months? CUSTOMERS WEB CLIENT WEB SHOP BACKEND WEB SHOP DATA BASE Mobile App Data Web App Click Stream Data IT/Ops and InfoSec Data Product and Customer Transaction Data

Introducing “DataCo”

slide-10
SLIDE 10

Hive

Active Archive / Self Serve Ad-hoc BI

  • Top sold products last 6, 12, and 18 months?

SQL

HDFS Impala

slide-11
SLIDE 11

Using Sqoop to Ingest Data from MySQL

  • Sqoop is a bi-directional structured data ingest tool
  • Simple UI in Hue, more commonly used from the shell

$ sqoop import-all-tables -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba

  • -password=yow!2014 --compression-codec=snappy --as-avrodatafile
  • -warehouse-dir=/user/hive/warehouse

$ sqoop import -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db

  • -username=dataco_dba --password=yow!2014
  • -table my_cool_table --hive-import --as-parquetfile
slide-12
SLIDE 12

Create Tables in Hive

  • Hive is a batch query tool, but also the keeper of table

structures

  • Remember: structure is stored _separate_ from data

hive> CREATE EXTERNAL TABLE products > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 'hdfs:///user/hive/warehouse/products' > TBLPROPERTIES ('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

slide-13
SLIDE 13

Use Impala via Hue to Query

slide-14
SLIDE 14

$ $$$...

slide-15
SLIDE 15

Correlate Multi-type Data Sets

  • Top viewed products last 6, 12, and 18 months?

Hive

SQL

HDFS Impala Flume

slide-16
SLIDE 16

Ingest Data Using Flume

  • Pub/sub ingest framework
  • Flexible multi-level (mini-transformation) pipeline

FLUME SOURCE FLUME SINK Continuously generated events, e.g. syslog, tweets Flume Agent, HDFS, HBase, Solr, or other destination Optional Logic FLUME AGENT

slide-17
SLIDE 17

Create Hive Tables over Log Data

  • New use case, new data
  • Create new tables over semi-structured log data

CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION '/user/hive/warehouse/original_access_logs'; CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

slide-18
SLIDE 18

Use Impala and Hue to Query

Missing!!! 2 8 5 7 1 6 3 4 9

slide-19
SLIDE 19

$ $$$...

slide-20
SLIDE 20

!!!

slide-21
SLIDE 21

Solr

Multi-Use-Case Data Hub

  • Why is sales dropping over the last 3 days?

HDFS

Search Queries

Flume

slide-22
SLIDE 22

Create your Index

  • Create an empty Solr index configuration directory
  • Edit the Solr Schema file to have the fields you want

to search over

$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir

… <field name="_version_" type="long" indexed="true" stored="true" multiValued="false" /> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="ip" type="text_general" indexed="true" stored="true"/> <field name="request_date" type="date" indexed="true" stored="true"/> …

slide-23
SLIDE 23

Create your Index cont.

  • Upload your configuration for a collection to

ZooKeeper

  • Tell Solr to start serving up a collection and start

indexing data for it

$ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4 $ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs ./live_logs_dir

slide-24
SLIDE 24

Flume and Morphline Pipeline

slide-25
SLIDE 25

Flume with Morphlines Configured

  • Configure Flume to use your Morphlines and post

parsed data to Solr

…. # Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000 agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile = /opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline agent1.sinks.solrSink.threadCount = 1 …..

slide-26
SLIDE 26

Dynamic Search UI in Hue

slide-27
SLIDE 27

Shared Storage!!

slide-28
SLIDE 28
slide-29
SLIDE 29

Benefits

  • Ad-hoc and

faster insight

  • Reduced

asthma related ICU visits

  • Total license

fees < 3 processor licenses for EDW

Solution

  • 50GB monitor

data per week

  • 2TB capacity
  • Sqoop, Solr,

Impala, HDFS

Challenges

  • Only 3 days’ of

monitoring data capacity

  • No ability to

correlate large research data sets

  • No ability to ad-

hoc study environment impact

How Do We Improve Healthcare?

slide-30
SLIDE 30

How Do We Feed The World?

Global Warming Changes Conditions

How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

slide-31
SLIDE 31

Benefits

  • Streamlined

processes

  • Time to results

reduced from years to months!!!

Solution

  • PB-scale
  • HBase, HDFS,

Solr, MapReduce, Sqoop, Impala, …

Challenges

  • Time to market

for each new product: 5-10 years

  • 1,000+

scientists working in silos

  • Data processing

bottlenecks slow development

How Do We Feed The World?

slide-32
SLIDE 32

Challenges

  • 100-200 B

events/month

  • Real-time multi-

type event correlation complex

  • No way to do

ad-hoc game analytics

Benefits

  • Ad-hoc insight
  • n feature

trends

  • Significant TTR

reduction

  • ROI realized in

the 1st week

Solution

  • ~20 nodes
  • 256GB RAM

servers

  • Flume, Solr,

Impala, HDFS

slide-33
SLIDE 33

Learn More?

  • Stop by the Cloudera booth today! 
  • Play on your own: cloudera.com/live
  • Get training:

http://cloudera.com/content/cloudera/en/training.html

  • Join the Community: cdh-user@cloudera.org
  • Connect with me: @EvaAndreasson
slide-34
SLIDE 34

Hope You Enjoyed This Talk!

Don’t forget to VOTE!!!

slide-35
SLIDE 35
slide-36
SLIDE 36

Bonus Track…

slide-37
SLIDE 37

My Advice for the Road…

slide-38
SLIDE 38

Try Something Simple First…

slide-39
SLIDE 39

Decide what to Cook!

slide-40
SLIDE 40

Collect All Ingredients

slide-41
SLIDE 41

Use the Right Tool for the Right Task

slide-42
SLIDE 42

Prepare All Ingredients

slide-43
SLIDE 43

Don’t Forget the Importance of Visualization!

slide-44
SLIDE 44
slide-45
SLIDE 45

Benefits

  • Faster, cheaper

genome sequencing

  • Searchable index
  • f variant call

data for biologists to explore

Solution

  • Integration &

storage of multi- structured experimental data

  • Data access &

exploration via Impala, R, HBase, Solr, Hive

Challenges

  • Tons of

information locked away in medical records & scientific studies

  • Different sources

& systems can’t “talk” to each

  • ther
slide-46
SLIDE 46

Using Sqoop to Ingest Data from MySQL

  • View your imported “tables”
  • View all Avro files constituting a table

$ hadoop fs -ls /user/hive/warehouse/ $ hadoop fs -ls /user/hive/warehouse/mytablename/

slide-47
SLIDE 47

Hadoop - A New Approach to Data Management

Schema on Read Distributed Storage Distributed Processing Active Archive Cost-Efficient Offload Flexible Analytics

slide-48
SLIDE 48

Hadoop: Storage & Batch Processing

The Birth of the Data Lake

slide-49
SLIDE 49
  • Core Hadoop
  • Core Hadoop
  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Flume
  • Avro
  • Sqoop
  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Flume
  • Avro
  • Sqoop
  • Bigtop
  • Oozie

2006 2007 2008 2009 2010 2011

  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Flume
  • Avro
  • Sqoop
  • Bigtop
  • Oozie
  • Hue
  • Impala
  • Parquet

2012 2013 2014

  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Flume
  • Avro
  • Sqoop
  • Bigtop
  • Oozie
  • Hue
  • Impala
  • Parquet
  • Solr
  • Sentry
  • Core Hadoop
  • Hbase
  • ZooKeeper
  • Mahout
  • Pig
  • Hive
  • Flume
  • Avro
  • Sqoop
  • Bigtop
  • Oozie
  • Hue
  • Impala
  • Parquet
  • Solr
  • Setnry
  • Spark
  • Kafka

A Rapidly Growing Ecosystem

slide-50
SLIDE 50

The Rise of an Enterprise Data Hub

Applications

slide-51
SLIDE 51

HDFS

2005-2007 – Hadoop

MapReduce

slide-52
SLIDE 52

HDFS

2008 – HBase, ZooKeeper, Mahout

MapReduce HBase ZooKeeper Mahout

slide-53
SLIDE 53

HDFS

2009 – Hive, Pig

MapReduce HBase ZooKeeper Mahout Hive Pig

slide-54
SLIDE 54

HDFS

2010 – Flume, Sqoop, Avro

MapReduce HBase ZooKeeper Mahout Hive Pig Flume DB Avro

slide-55
SLIDE 55

HDFS

2011 – Oozie, Hue

MapReduce HBase ZooKeeper Mahout Hive Pig Flume DB Avro Oozie Hue

slide-56
SLIDE 56

HDFS

2012 – YARN, Impala, Parquet

MapReduce HBase ZooKeeper Mahout Hive Pig Flume DB Avro Oozie Hue Parquet Impala YARN

slide-57
SLIDE 57

HDFS

2013 – Solr, Sentry

MapReduce HBase ZooKeeper Mahout Hive Pig Flume DB Avro Oozie Hue Parquet Impala Solr YARN Sentry

slide-58
SLIDE 58

HDFS

2014 – Spark, Kafka

MapReduce HBase ZooKeeper Mahout Hive Pig Flume DB Avro Oozie Hue Parquet Impala Solr YARN Sentry Spark Kafka

slide-59
SLIDE 59

Inter- active SQL

Distributed File System (Scalable Storage)

The Hadoop Ecosystem – Explained!

Event-based data ingest Batch Processing KeyValue Store SQL

Proc. Oriented Query

Machine Learning Process Mgmt Workflow Mgmt GUI Resource Management and Scheduling Free- Text Search

Real Time Proces sing

Access Control DB

slide-60
SLIDE 60

Common Use Cases

  • Threat detection
  • Active archive / accessible global knowledge base
  • Data accuracy
  • Streamlined cross-data type aggregation
  • Richer customer profiling / ecommerce experience
  • Interactive market segmenting / customer

identification

  • Expedited data modeling
  • ….
slide-61
SLIDE 61

The Right Tool For the Right Task

Tool Workload Use Case Result Ordering Hive Batch SQL, Analytics & Joins Structured Pig Batch

  • Proc. Oriented SQL, Analytics &

Joins Structured Impala Interactive SQL, Analytics & Joins Structured Solr Interactive Fuzzy, Phonetic, Polygon, GEO- special Relevance- based HBase Real Time Random key-lookups over sparsely populated columnar data Scan-order Spark NRT Advanced analytics & ML Sorted

slide-62
SLIDE 62

When to use what?

  • Real Time Query (e.g. Impala)
  • I want to do BI reports or interactive analytical aggregations but

not wait hours for the response

  • Batch Query (e.g. Pig, Hive)
  • I have nightly batch query jobs as part of a workflow
  • Real Time Search (e.g. SolrCloud)
  • I have unstructured data I want to free text over
  • My SQL queries are getting more and more complex as they

need to contain 15+ “like” conditions

  • Real time key lookups (e.g. Hbase)
  • I want random access to sparsely populated table-like data
  • I want to compare user profiles or behavior in real time
slide-63
SLIDE 63

When to use what?

  • Spark
  • I want to implement analytics algorithms over my data, and

my data sets fit into memory

  • I have real time streaming data I want to analyze in real

time

  • MapReduce
  • I want to do fail-safe large ETL processing workloads
  • My data does not fit into memory and I want to batch

process it with my custom logic – no real time needs

slide-64
SLIDE 64