Big Data Compete by asking bigger questions $$$... $ ??? SLA - PowerPoint PPT Presentation

A World of Data “ Gizillions ” of mobile “ Thingsternet ” transactions Living online Big Data Compete by asking bigger questions

$$$... $

Yaaaay – Hadoop to Save the Daaaay!! • But it’s not always easy to tame an elephant…

Introducing “DataCo” WEB SHOP WEB SHOP CUSTOMERS WEB CLIENT BACKEND DATA BASE ~100GB Product and Customer Transaction Data “We don’t really have a big data problem…”

Introducing “DataCo” WEB SHOP WEB SHOP CUSTOMERS WEB CLIENT BACKEND DATA BASE > 6 months? Web App Product and Mobile App IT/Ops and Click Stream Customer Data InfoSec Data Data Transaction Data

Active Archive / Self Serve Ad-hoc BI • Top sold products last 6, 12, and 18 months? SQL Hive Impala HDFS

Using Sqoop to Ingest Data from MySQL • Sqoop is a bi-directional structured data ingest tool • Simple UI in Hue, more commonly used from the shell $ sqoop import -m 12 – connect jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba --password=yow!2014 --table my_cool_table --hive-import --as-parquetfile $ sqoop import-all-tables -m 12 – connect jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba --password=yow!2014 --compression-codec=snappy --as-avrodatafile --warehouse-dir=/user/hive/warehouse

Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored _separate_ from data hive> CREATE EXTERNAL TABLE products > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 'hdfs:///user/hive/warehouse/products' > TBLPROPERTIES ('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

Use Impala via Hue to Query

$$$... $

Correlate Multi-type Data Sets • Top viewed products last 6, 12, and 18 months? SQL Hive Flume Impala HDFS

Ingest Data Using Flume • Pub/sub ingest framework • Flexible multi-level (mini-transformation) pipeline FLUME AGENT Continuously Flume Agent, FLUME FLUME Optional generated events, HDFS, HBase, SOURCE SINK Logic e.g. syslog, tweets Solr, or other destination

Create Hive Tables over Log Data • New use case, new data • Create new tables over semi-structured log data CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION '/user/hive/warehouse/original_access_logs'; CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

Use Impala and Hue to Query Missing!!! 2 8 5 7 1 6 3 4 9

$$$... $

Multi-Use-Case Data Hub • Why is sales dropping over the last 3 days? Search Queries Solr Flume HDFS

Create your Index • Create an empty Solr index configuration directory $ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir • Edit the Solr Schema file to have the fields you want to search over … <field name="_version_" type="long" indexed="true" stored="true" multiValued="false" /> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="ip" type="text_general" indexed="true" stored="true"/> <field name="request_date" type="date" indexed="true" stored="true"/> …

Create your Index cont. • Upload your configuration for a collection to ZooKeeper $ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs ./live_logs_dir • Tell Solr to start serving up a collection and start indexing data for it $ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4

Flume and Morphline Pipeline

Flume with Morphlines Configured • Configure Flume to use your Morphlines and post parsed data to Solr …. # Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000 agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile = /opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline agent1.sinks.solrSink.threadCount = 1 …..

Dynamic Search UI in Hue

Shared Storage!!

How Do We Improve Healthcare? Challenges Solution Benefits • Only 3 days’ of • 50GB monitor • Ad-hoc and monitoring data data per week faster insight • 2TB capacity • Reduced capacity • No ability to • Sqoop, Solr, asthma related correlate large Impala, HDFS ICU visits • Total license research data sets fees < 3 • No ability to ad- processor hoc study licenses for environment EDW impact

How Do We Feed The World? Global Warming Changes Conditions How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

How Do We Feed The World? Benefits Challenges Solution • Streamlined • Time to market • PB-scale • HBase, HDFS, processes for each new • Time to results product: 5-10 Solr, reduced from years MapReduce, • 1,000+ years to Sqoop, Impala, months!!! scientists … working in silos • Data processing bottlenecks slow development

Solution Challenges Benefits • ~20 nodes • 100-200 B • Ad-hoc insight • 256GB RAM events/month on feature • Real-time multi- servers trends • Flume, Solr, • Significant TTR type event Impala, HDFS correlation reduction • ROI realized in complex the 1 st week • No way to do ad-hoc game analytics

Learn More? • Stop by the Cloudera booth today!  • Play on your own: cloudera.com/live • Get training: http://cloudera.com/content/cloudera/en/training.html • Join the Community: cdh-user@cloudera.org • Connect with me: @EvaAndreasson

Hope You Enjoyed This Talk! Don’t forget to VOTE!!!

Bonus Track…

My Advice for the Road…

Try Something Simple First…

Decide what to Cook!

Collect All Ingredients

Use the Right Tool for the Right Task

Prepare All Ingredients

Don’t Forget the Importance of Visualization!

Challenges Solution Benefits • Tons of • Integration & • Faster, cheaper information storage of multi- genome locked away in structured sequencing • Searchable index medical records experimental & scientific data of variant call • Data access & studies data for • Different sources exploration via biologists to & systems can’t Impala, R, explore “talk” to each HBase, Solr, Hive other

Using Sqoop to Ingest Data from MySQL • View your imported “tables” $ hadoop fs -ls /user/hive/warehouse/ • View all Avro files constituting a table $ hadoop fs -ls /user/hive/warehouse/mytablename/

Hadoop - A New Approach to Data Management Distributed Distributed Schema on Storage Processing Read Active Cost-Efficient Flexible Archive Offload Analytics

The Birth of the Data Lake Hadoop: Storage & Batch Processing

2006 2007 2008 2009 2010 2011 • • • • • Core Hadoop • Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop • • • • Hbase Hbase Hbase Hbase • • • • ZooKeeper ZooKeeper ZooKeeper ZooKeeper • • • • Mahout Mahout Mahout Mahout • Pig • • Pig Pig • • Hive • Hive Hive • • Flume Flume • • Avro Avro 2012 2013 2014 • • Sqoop Sqoop • Bigtop • • • Core Hadoop Core Hadoop Core Hadoop • Oozie • • • Hbase Hbase Hbase • • • ZooKeeper ZooKeeper ZooKeeper • • • Mahout Mahout Mahout • • • Pig Pig Pig • • • Hive Hive Hive • • • Flume Flume Flume A Rapidly Growing • • • Avro Avro Avro • • • Sqoop Sqoop Sqoop • Ecosystem • • Bigtop Bigtop Bigtop • • • Oozie Oozie Oozie • • • Hue Hue Hue • • • Impala Impala Impala • • • Parquet Parquet Parquet • • Solr Solr • • Setnry Sentry • Spark • Kafka

Big Data Compete by asking bigger questions $$$... $ ??? SLA - PowerPoint PPT Presentation

A World of Data Gizillions of mobile Thingsternet transactions Living online Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the Daaaay!! But its not always easy to tame an

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Scripting for Multimedia LECTURE 9: WORKING WITH TABLES Tables in HTML A table displays a

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Compete by asking bigger questions $$$... $ ??? SLA - PowerPoint PPT Presentation

A World of Data Gizillions of mobile Thingsternet transactions Living online Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the Daaaay!! But its not always easy to tame an

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Scripting for Multimedia LECTURE 9: WORKING WITH TABLES Tables in HTML A table displays a

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data