Building the Brickhouse Jerome Banks Confidential Overview of - - PowerPoint PPT Presentation

building the brickhouse jerome banks
SMART_READER_LITE
LIVE PREVIEW

Building the Brickhouse Jerome Banks Confidential Overview of - - PowerPoint PPT Presentation

Building the Brickhouse Jerome Banks Confidential Overview of Brickhouse Custom UDFs for Hadoop Hive Provides missing pieces Tools to build scalable/robust data pipelines Increases Data engineers productivity


slide-1
SLIDE 1

Confidential

Building the Brickhouse Jerome Banks

slide-2
SLIDE 2

Confidential

Overview of Brickhouse

  • Custom UDF’s for Hadoop Hive
  • Provides “missing pieces”
  • Tools to build scalable/robust data pipelines
  • Increases Data engineer’s productivity
  • Supports MR Design Patterns/Best Practices
slide-3
SLIDE 3

Confidential

History

Spring 2012 - Maxwell Project - New Klout Score

  • Needed to generate large number of features
  • Lots of exploratory Data Science needed
  • Needed to move to production fairly quickly
  • Legacy score was traditional Hadoop

Mappers/Reducers ○ Hard to develop ○ Hard to re-use code

slide-4
SLIDE 4

Confidential

History

Spring 2012 - Maxwell Project - New Klout Score Solution !!! Implement in Hive !!!

  • Proven technology
  • Able to prototype quickly
  • Semantics fairly well-understood

But !!!

  • Functionality was missing in Hive
  • Naive queries were inefficient

○ Generated too many MR steps ○ Multiple passes of the same data ○ Attempting to “sort the world”

slide-5
SLIDE 5

Confidential

History

Spring 2012 - Maxwell Project - New Klout Score

Warehouse developed internally at Klout

  • Maxwell Score
  • Klout for Business
  • Topic Thunder

Early 2013 - Open-sourced as “Brickhouse”

  • Spread to other Hadoop Hive shops
  • Expanded functionality and code quality
  • 2014 - Sponsorship by Tagged
slide-6
SLIDE 6

Confidential

Functionality across broad areas

  • collect
  • json
  • sketch_set
  • distributed_cache
  • hbase
  • timeseries
  • bloom
  • hll
slide-7
SLIDE 7

Confidential

Array/Map operations

  • collect
  • collect_max
  • cast_array
  • map_key_values
  • map_filter_keys
  • join_array
  • map_union
  • truncate_array
slide-8
SLIDE 8

Confidential

collect

Opposite of UDTF Avoid “self-join” Anti-pattern

select a.id, a.value as a_val, b.value as b_val from ( select * from mytable where type=’A’) a join ( select * from mytable where type=’B’) b

  • n ( a.id = b.id );

select col_map[‘A’] as a_val, col_map[‘B’] as b_val from ( select id, collect( type,value) as col_map from mytable group by id ) cm;

slide-9
SLIDE 9

Confidential

collect_max

Similar to collect but returns map with top 20 values

select ks_uid, combined_score from maxwell_score

  • rder by combined_score

limit 20; select collect_max( ks_uid, combined_score, 20 ) from maxwell_score;

slide-10
SLIDE 10

Confidential

to_json,from_json

Serialize to and from JSON Avoid ugly, error-prone string concats Guaranteed valid JSON output select concat("{\"kscore\":", kscore,",\" moving_avg\":", avg, ",\" start_date\":",start, ",\"end_date\":", end,"}") from mytable; select to_json( named_struct(“kscore”,kscore, “moving_avg”, avg, “start_date”, start, “end_date”, end) ) from mytable;

slide-11
SLIDE 11

Confidential

to_json,from_json

Serialize to and from JSON Parse arbitrarily complex schema create view parse_json as select ks_uid, from_json( json, named_struct( “kscore”, 0.0, “moving_avg”, array(0.0), “start_date”, “”, “end_date”, “”) ) from json_table;

slide-12
SLIDE 12

Confidential

sketch_set

KMV (K-min value) sketch implementation Estimate number of uniques in large sets with fixed amount of space. select count(distinct ks_uid) as reach from actor_action where some_condition() = true; select estimated_reach( sketch_set(ks_uid)) from actor_action where some_condition() = true;

slide-13
SLIDE 13

Confidential

sketch_set

Easy to do set unions. Can aggregate incremental results. insert overwrite table daily_sketch partition(dt=’20140323’) select sketch_set(ks_uid) ss from actor_action; select estimated_reach( union_sketch(ss)) from daily_sketch where dt>=days_add(today(),-30 );

slide-14
SLIDE 14

Confidential

distributed_map

Uses distributed-cache to access values in-memory. Avoids join/resort of large datasets. select bt.ks_uid, bt.my_value from big_table bt join ( select * from celeb where is_celeb=true)cb

  • n

( bt.ks_uid = cb.ks_uid); add file ‘celeb_map’; select * from big_table bt where distributed_map( ks_uid, ‘celeb_map’) is not null;

slide-15
SLIDE 15

Confidential

Future Roadmap

  • Continued support/maintenance/cleanup
  • More streaming UDF’s

○ Top K ○ Representative sample

  • More “Big Data Science-ey” UDFs

○ Machine Learning ○ Bag-O’-Words UDFs ○ Text Analysis, NLP

  • Ideas ??? Contributions ???
slide-16
SLIDE 16

Thank you!

http://github.com/klout/brickhouse http://brickhouseconfessions.wordpress.com