Building Multi-Model Big Data Platform for Real Estate Analytics - PowerPoint PPT Presentation

Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big Data - 05/18/2017 1

Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation Overview 4 Platform Layers 5 Lessons Learned 6 What’s next? 7 2

Ten-X At a Glance 3

Who am I? Karthik Karuppaiya Sr. Engineering Manager, Data and Analytics @karthikkrk https://www.linkedin.com/in/karthikkrk/ kkaruppaiya@ten-x.com WE ARE HIRING! 4

CUSTOMER CENTRICITY: BUILDING A 360 DEGREE VIEW Bringing multiple data sets together to truly understand the customer BEHAVIORAL COMMUNICATIONS TRANSACTIONAL PREFERENCES / RELATIONSHIP 3RD PARTY DATA DATA DEMOGRAPHICS DATA ACTIVITY DATA What are they What are all the What are they buying How does this Who are their What has this person doing on interactions they or selling on Ten-X? person prefer relationships with done Ten-X? have with us to work with us? other entities (e.g. online/offline/ and people? bought/sold/own Where are they phone? outside of Ten-X?) doing it? 6 6

Data Sets Data Set Typical Data Format Behavioral Data (Instrumentation) Files/API Transactional Data (OLTP) Kafka/RDBMS Communications(AdWords/Marketo/SalesForce) API Preferences/Demographics (OLTP) RDBMS Relationship Data (OLTP/Third Party) RDBMS, Files, API 3 rd Party activity Data (MLS, REIS, etc..) Files, API, RDBMS Documents and Objects (Pictures, Binary Files, Spreadsheets, etc.. PDFs, etc..) 7

Data Platform Challenges Support mass storage of both structured and unstructured data • Support both batch and real-time streaming data • Support Machine Learning across multiple data sets • Enable discovery and modeling of complex relationships • Ability to join and resolve duplicates across massive amounts of datasets. • 8

Platform Design Goals Private Data Center • All Open Source Tools • Extremely easy for Business teams to analyze data • Multi-tenant – one platform for all lines of business and all teams • Easily scalable • Keep it Simple, Stupid • 10

The Technologies That Power Our Data Platform! 11 11

Platform Design Hue Tableau SprintBoot OOZIE Ambari Metrics Atlas TinkerPop Ranger Gremlin SPARK PIG HIVE AMBARI Tez MapRed Janus Graph YARN ANSIBLE ELK Elastic Cassandra HDFS HBase Search 12

Data Pipeline Architecture Analysis Layer Raw Layer Clean Layer Derived Layer 14

Raw Layer All the data lands here. • Data is never exposed from the Raw Layer. • Data is mostly stored in it’s original format (mostly Text). • 16

Clean Layer All the cleansing rules are applied to the data. • Example: Standardize the Gender data to Male/Female/Unknown for all the records • Example: Standardize the Updated_Timestamp column on all the tables • Data is optimized for storage and querying. • Mostly use ORC file format. • Create External Hive Tables for all the datasets. • Platform users typically have access to the data in the Clean Layer for exploratory purposes. • 17

Derived Layer Data is de-normalized for faster analytics queries. • Helps with cluster resource usage, so same joins are not run repeatedly. • Multiple sources are joined together for a unified view of a customer. • Example: Join Omniture data with the User Profile data to get a complete view • 18

Analysis Layer This is the consumable Layer for APIs, BI Dashboards and Reports. • All the aggregations are performed in this layer. • The relationships and entity resolutions happen in this layer. • Data in Analysis layer is served from appropriate stores, based on the need • JanusGraph: For data that is optimized for graph data model • Cassandra/Hbase: For Key Value type data sets • Elastic Search: For fast searches • 19

API Layer Create Read Only APIs that serves the data to the rest of the organization • Use Mesos and Docker for scalable API layer • Use Spring Boot for faster API Development • Also consumes directly from Kafka for real-time needs • Publishes feedback information to Kafka that goes through the pipeline again • 20

Lessons Learned Monitor Namenode Check-pointing Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. • Checkpointing with Namenode HA is the recommended way. • If there is no Namenode HA – Secondary namenode is used to do the checkpointing. • When checkpointing fails to happen – the disks get filled up and the namenode gets crashed. • There is potential to lose data, if the edits file gets corrupted for any reason. There are some bugs in the older • versions of HDFS, that might cause edits file to get corrupted. 22

Lessons Learned… Take Backups of all the critical metadata Hive Metastore (MySQL/Postgres) • Namenode fsimage/edits files • Ambari DB (MySQL/Postgres) • Oozie DB (MySQL/Postgres) • Ranger DB (MySQL/Postgres) • 23

Lessons Learned… Monitor the logs regularly and set alerts for runtime errors. Helps identify jobs that fails due to runtime errors, so we • can optimize them – helps with cluster resource usage. Make sure the YARN queues are rightly defined and policies set appropriately Make sure one bad query does not affect rest of the • cluster Group teams together • 24

Monitoring and Alerting 25

Monitoring and Alerting.. 26

Monitoring and Alerting.. 27

Data Governance & MDM - Atlas Most important questions that gets asked by the platform users? • What is the source of this data? • What does this column exactly mean? • Who is responsible for populating this data set? • Hive Metastore –> Apache Atlas • 29

Data Exploration - Zeppelin A web-based notebook that enables interactive data analytics. • Let People use their choice of language. • Easy to create Charts and Graphs • 30

Thank you! Q & A @karthikkrk https://www.linkedin.com/in/karthikkrk/ kkaruppaiya@ten-x.com WE ARE HIRING! 31

Building Multi-Model Big Data Platform for Real Estate Analytics - PowerPoint PPT Presentation

Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big Data - 05/18/2017 1 Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation

Real Estate Centers Real Estate Centers Hampton Roads Real Estate Hampton Roads Real Estate

BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

The Most Powerful Real Estate Investment Program Ever Produced! mortgages real estate Your Road

HOUSING JAPAN Innovation in Japanese Real Estate Tokyo Real Estate Report 3 TOKYO REAL ESTATE

PROFESSIONALISM 1 Micheal Noseworthy Superintendent of Real Estate 2 Real Estate Regulators

Vancouver Real Estate Forum 0 8 Vancouver Real Estate Forum 0 8 Resort & Luxury Real

MAR HIGHLY COMMENDED BEST REAL ESTATE REAL ESTATE AGENCY REAL ESTATE AGENCY AGENCY SPAIN

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Ultimate The REAL ESTATE BUYING GUIDE Why We Do Real Estate OUR MISSION We provide home sellers

Kennedy Wilson Europe Real Estate Plc 2014 Results to 31 December 2014 Kennedy Wilson Europe Real

Real Estate Valuation An International Perspective Nick French Professor in Real Estate

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Real Estate Search Process and Recommendation May 19-20, 2016 Tom Masthay, Director of Real

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

in Journaling File Systems Yongseok Son Chung-Ang University Contents Motivation and

Debugging operating systems with Debugging operating systems with time-traveling virtual machines

memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Checkpointing as a Powerful Tool for CUDA Development Max Grossman, Vivek Sarkar Rice University

Intelligent Water Systems: A Smart Start November 2, 2016 Moderated by: Fidan Karimova Water

iFPGA Team sdmay20-38 Justin Sung - Embedded Systems Engineer Zixuan Guo - Systems Diagram

The Molecular Sciences Software Institute a nexus for science, education, and cooperation for

Cloud Performance Resource Allocation and Scheduling Issues Eleni D. Karatza Department of

Building Multi-Model Big Data Platform for Real Estate Analytics - PowerPoint PPT Presentation

Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big Data - 05/18/2017 1 Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation

Real Estate Centers Real Estate Centers Hampton Roads Real Estate Hampton Roads Real Estate

BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

The Most Powerful Real Estate Investment Program Ever Produced! mortgages real estate Your Road

HOUSING JAPAN Innovation in Japanese Real Estate Tokyo Real Estate Report 3 TOKYO REAL ESTATE

PROFESSIONALISM 1 Micheal Noseworthy Superintendent of Real Estate 2 Real Estate Regulators

Vancouver Real Estate Forum 0 8 Vancouver Real Estate Forum 0 8 Resort &amp; Luxury Real

MAR HIGHLY COMMENDED BEST REAL ESTATE REAL ESTATE AGENCY REAL ESTATE AGENCY AGENCY SPAIN

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Ultimate The REAL ESTATE BUYING GUIDE Why We Do Real Estate OUR MISSION We provide home sellers

Kennedy Wilson Europe Real Estate Plc 2014 Results to 31 December 2014 Kennedy Wilson Europe Real

Real Estate Valuation An International Perspective Nick French Professor in Real Estate

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Real Estate Search Process and Recommendation May 19-20, 2016 Tom Masthay, Director of Real

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

in Journaling File Systems Yongseok Son Chung-Ang University Contents Motivation and

Debugging operating systems with Debugging operating systems with time-traveling virtual machines

memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Checkpointing as a Powerful Tool for CUDA Development Max Grossman, Vivek Sarkar Rice University

Intelligent Water Systems: A Smart Start November 2, 2016 Moderated by: Fidan Karimova Water

iFPGA Team sdmay20-38 Justin Sung - Embedded Systems Engineer Zixuan Guo - Systems Diagram

The Molecular Sciences Software Institute a nexus for science, education, and cooperation for

Cloud Performance Resource Allocation and Scheduling Issues Eleni D. Karatza Department of

Vancouver Real Estate Forum 0 8 Vancouver Real Estate Forum 0 8 Resort & Luxury Real