CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3 August August - - PowerPoint PPT Presentation

cc 2 0 by william brawley http flic kr p 7pdup3
SMART_READER_LITE
LIVE PREVIEW

CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3 August August - - PowerPoint PPT Presentation

CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3 August August 31, 31, 2012 2012 Why Hadoop and HBase? 2 Social Media Monitoring Prospective Search and Coprocessors Challenges & Lessons Learned Resources to get


slide-1
SLIDE 1

CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3

slide-2
SLIDE 2

2

Agenda

  • Why Hadoop and HBase?
  • Social Media Monitoring
  • Prospective Search and Coprocessors
  • Challenges & Lessons Learned
  • Resources to get started

August August 31, 31, 2012 2012

slide-3
SLIDE 3

3

About Sentric

  • Spin-off of MeMo News AG, the

leading provider for Social Media Monitoring & Analytics in Switzerland

  • Big Data expert, focused on Hadoop,

HBase and Solr

  • Objective: Transforming data into

insights

August August 31, 31, 2012 2012

slide-4
SLIDE 4

CC 2.0 by Editor B| h"p://flic.kr/p/bcU5aD1

slide-5
SLIDE 5

5

Social Media Monitoring Process

Why Hadoop and HBase?

August August 31, 31, 2012 2012

Information Gathering Information Processing Analysis & Interpretation Insight Presentation

slide-6
SLIDE 6

6

Requirements

Why Hadoop and HBase?

August August 31, 31, 2012 2012

SMM

Cost effective High scalable RT Alerting Analytical capabilities Reliable Freshness

slide-7
SLIDE 7

7

Hadoop

  • HDFS + MapReduce
  • Based on Google Papers
  • Distributed Storage and Computation

Framework

  • Affordable Hardware, Free Software
  • Significant Adoption

Why Hadoop and HBase?

August August 31, 31, 2012 2012

slide-8
SLIDE 8

8

HBase

  • Non-Relational, Distributed Database
  • Column-Oriented
  • Multi-Dimensional
  • High Availability
  • High Performance
  • Build on top of HDFS as storage layer

Why Hadoop and HBase?

August August 31, 31, 2012 2012

slide-9
SLIDE 9

9

Technology Stack

Why Hadoop and HBase?

August August 31, 31, 2012 2012

HBase /HDFS

Storage

Hadoop Mahout

Analytics

Solr

Search

HBase RowLog

Event mechanism (MQ)

Prospective search

Real-time alerting

slide-10
SLIDE 10

CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf

slide-11
SLIDE 11

11

Overview

Social Media Monitoring

August August 31, 31, 2012 2012

Search Agents Downloaded Articles Output match? RT Alerts Reports Web-UI

Icons by http://dryicons.com

slide-12
SLIDE 12

12

Solution Architecture

Social Media Monitoring

August August 31, 31, 2012 2012

REST

n News Agents MySQL Solr Web-UI RT Alerts

Coprocessor

HBase

Icons by http://dryicons.com

slide-13
SLIDE 13

13

Prospective Search with Coprocessors

Social Media Monitoring

August August 31, 31, 2012 2012

Processing HRegionServer HRegion Put operations Prospective Search RT Alerts

Icons by http://dryicons.com

slide-14
SLIDE 14

14

Key Figures

  • Monthly growth
  • Index: 200GB
  • 50 Mio. docs/month
  • HBase: 600 GB
  • Raw data, meta data and extracted

data

  • A few 1000 map-reduce jobs/

month

Social Media Monitoring

August August 31, 31, 2012 2012

slide-15
SLIDE 15

CC 2.0 by saebaryo | h"p://flic.kr/p/5T4t5L

slide-16
SLIDE 16

16

1 Benchmarks - workloads 2 Supervision 3 Keys and shards – Schema design /LG 4 Timestamps, the 4th dimension 5 Short ColumnFamily names-> 6 File handles. OS 7 JVM Tuning, GC !!! 8 Scaling region servers, data locality! 9 Automatic vs manual splits, compaction 10 Do not use HBase as rock solid in prod 11 Forget feuerwehr aktionen, it takes some time 12 Use Hbase for a apropriate use case 13 Tune and tweak – it‘s not a project – it‘s a process 14 You need devops in production 15 Huge know-how curve, you need to know the hole ecosystem 16 Use a distribution, ist packed, tested and supports migration, enterprise grade 17 Virtualisierung, Hardware 18 Dont struggle to much, there is a good community 19 Share your knowledge 20 It‘s early state, many tools around, a few still missing

Challenges & Lessons Learned

Augus Augus t 31, t 31, 2012 2012

slide-17
SLIDE 17

17

Challenges

  • Everyone is still learning
  • Some issues only appear at scale
  • At scale, nothing works as advertised
  • Production cluster configuration
  • Hardware issues
  • Tuning cluster configuration to our work

loads

  • HBase stability
  • Monitoring health of HBase

Challenges & Lessons Learned

August August 31, 31, 2012 2012

slide-18
SLIDE 18

18

Lessons - General

  • Do not rely on HBase as frontend

storage layer. It’s not going to be rock solid

  • Don’t struggle to much, there is a

good community

  • Share your knowledge
  • It‘s early stage, many tools around, a

few still missing

Challenges & Lessons Learned

August August 31, 31, 2012 2012

slide-19
SLIDE 19

19

Lessons - Planning

  • Use HBase for an appropriate use case
  • Use a distribution, its packed, tested and

supports migration, enterprise grade

  • Benchmarks – know your workloads &

query patterns

  • YCSB
  • Schema & Key Design
  • What’s queried together should be stored

together

  • Scaling region servers, data locality!
  • Virtualization vs. Real Hardware

Challenges & Lessons Learned

August August 31, 31, 2012 2012

slide-20
SLIDE 20

20

Lessons - Performance Tuning

  • Number of CF < 10
  • Compaction + Flushing I/O intensive
  • Short ColumnFamily names
  • HFile index size occupying aloc RAM (storefileindexSize)
  • OS file handles
  • ulimit –n 32768
  • JVM Tuning, GC !!!
  • HMaster 1024 MB
  • RegionServer 8192 MB
  • XX:+UseConcMarkSweepGC
  • XX:+CMSIncrementalMode
  • Automatic vs. manual splits
  • Be careful with expensive operations in coprocessors
  • Play with all the configurations and benchmark for tuning

Challenges & Lessons Learned

August August 31, 31, 2012 2012

slide-21
SLIDE 21

21

Lessons - Operation

  • Monitoring/Operational tooling is most

important

  • Forget “emergency actions”, it takes

some time

  • Tune and tweak – it‘s not a project – it‘s

a process

  • You need DevOps in production
  • Huge know-how curve, you need to

know the whole ecosystem

  • Hadoop, HDFS, MapRed

Challenges & Lessons Learned

August August 31, 31, 2012 2012

slide-22
SLIDE 22

22

Resources to get started

  • http://hbase.apache.org/book.html
  • http://www.sentric.ch/blog/best-

practice-why-monitoring-hbase-is- important

  • http://www.sentric.ch/blog/hadoop-
  • verview-of-top-3-distributions
  • http://www.sentric.ch/blog/hadoop-

best-practice-cluster-checklist

  • http://outerthought.org/blog/465-
  • t.html

August August 31, 31, 2012 2012

slide-23
SLIDE 23

23

Thank you!

Questions? Questions?

Christian Gügi, christian.guegi@sentric.ch Jean-Pierre König, jean-pierre.koenig@sentric.ch

NoSQL Roadshow Basel

August August 31, 31, 2012 2012

slide-24
SLIDE 24

24

Cluster

Masters

Augus Augus t 31, t 31, 2012 2012

slide-25
SLIDE 25

25

Cluster

Worker

Augus Augus t 31, t 31, 2012 2012