building the brickhouse jerome banks
play

Building the Brickhouse Jerome Banks Confidential Overview of - PowerPoint PPT Presentation

Building the Brickhouse Jerome Banks Confidential Overview of Brickhouse Custom UDFs for Hadoop Hive Provides missing pieces Tools to build scalable/robust data pipelines Increases Data engineers productivity


  1. Building the Brickhouse Jerome Banks Confidential

  2. Overview of Brickhouse ● Custom UDF’s for Hadoop Hive ● Provides “missing pieces” ● Tools to build scalable/robust data pipelines ● Increases Data engineer’s productivity ● Supports MR Design Patterns/Best Practices Confidential

  3. History Spring 2012 - Maxwell Project - New Klout Score ● Needed to generate large number of features ● Lots of exploratory Data Science needed ● Needed to move to production fairly quickly ● Legacy score was traditional Hadoop Mappers/Reducers ○ Hard to develop ○ Hard to re-use code Confidential

  4. History Spring 2012 - Maxwell Project - New Klout Score Solution !!! Implement in Hive !!! ● Proven technology ● Able to prototype quickly ● Semantics fairly well-understood But !!! ● Functionality was missing in Hive ● Naive queries were inefficient ○ Generated too many MR steps ○ Multiple passes of the same data ○ Attempting to “sort the world” Confidential

  5. History Spring 2012 - Maxwell Project - New Klout Score Warehouse developed internally at Klout ● Maxwell Score ● Klout for Business ● Topic Thunder Early 2013 - Open-sourced as “Brickhouse” ● Spread to other Hadoop Hive shops ● Expanded functionality and code quality ● 2014 - Sponsorship by Tagged Confidential

  6. Functionality across broad areas ● collect ● json ● sketch_set ● distributed_cache ● hbase ● timeseries ● bloom ● hll Confidential

  7. Array/Map operations ● collect ● collect_max ● cast_array ● map_key_values ● map_filter_keys ● join_array ● map_union ● truncate_array Confidential

  8. collect Opposite of UDTF Avoid “self-join” Anti-pattern select a.id, select col_map[‘A’] as a_val, a.value as a_val, col_map[‘B’] as b_val b.value as b_val from ( from ( select id, select * from mytable collect( type,value) where type=’A’) a as col_map join ( from mytable select * from mytable group by id ) cm; where type=’B’) b on ( a.id = b.id ); Confidential

  9. collect_max Similar to collect but returns map with top 20 values select ks_uid, select collect_max( combined_score ks_uid, from maxwell_score combined_score, order by combined_score 20 ) limit 20; from maxwell_score; Confidential

  10. to_json,from_json Serialize to and from JSON Avoid ugly, error-prone string concats Guaranteed valid JSON output select select to_json( concat("{\"kscore\":", named_struct(“kscore”,kscore, kscore,",\" “moving_avg”, avg, moving_avg\":", “start_date”, start, avg, ",\" “end_date”, end) ) start_date\":",start, from mytable; ",\"end_date\":", end,"}") from mytable; Confidential

  11. to_json,from_json Serialize to and from JSON Parse arbitrarily complex schema create view parse_json as select ks_uid, from_json( json, named_struct( “kscore”, 0.0, “moving_avg”, array(0.0), “start_date”, “”, “end_date”, “”) ) from json_table; Confidential

  12. sketch_set KMV (K-min value) sketch implementation Estimate number of uniques in large sets with fixed amount of space. select select estimated_reach( count(distinct ks_uid) sketch_set(ks_uid)) as reach from from actor_action actor_action where where some_condition() = true; some_condition() = true; Confidential

  13. sketch_set Easy to do set unions. Can aggregate incremental results. insert overwrite table select estimated_reach( daily_sketch union_sketch(ss)) partition(dt=’20140323’) from select daily_sketch sketch_set(ks_uid) ss where from dt>=days_add(today(),-30 ); actor_action; Confidential

  14. distributed_map Uses distributed-cache to access values in-memory. Avoids join/resort of large datasets. select bt.ks_uid, add file ‘celeb_map’; bt.my_value from big_table bt select * join from big_table bt ( select * where distributed_map( from celeb ks_uid, ‘celeb_map’) where is_celeb=true)cb is not null; on ( bt.ks_uid = cb.ks_uid); Confidential

  15. Future Roadmap ● Continued support/maintenance/cleanup ● More streaming UDF’s ○ Top K ○ Representative sample ● More “Big Data Science-ey” UDFs ○ Machine Learning ○ Bag-O’-Words UDFs ○ Text Analysis, NLP ● Ideas ??? Contributions ??? Confidential

  16. Thank you! http://github.com/klout/brickhouse http://brickhouseconfessions.wordpress.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend