1
Building Multi-Model Big Data Platform for Real Estate Analytics - - PowerPoint PPT Presentation
Building Multi-Model Big Data Platform for Real Estate Analytics - - PowerPoint PPT Presentation
Building Multi-Model Big Data Platform for Real Estate Analytics Karthik Karuppaiya ApacheCon Big Data - 05/18/2017 1 Agenda Company Overview 1 Overview of Ten-X Datasets 2 Ten-X Data Platform Architecture 3 Data-pipeline Implementation
2
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
3
Ten-X At a Glance
4
Who am I?
@karthikkrk https://www.linkedin.com/in/karthikkrk/ kkaruppaiya@ten-x.com Karthik Karuppaiya
- Sr. Engineering Manager, Data and Analytics
WE ARE HIRING!
5
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
6
6
CUSTOMER CENTRICITY: BUILDING A 360 DEGREE VIEW
BEHAVIORAL DATA What are they doing on Ten-X? Where are they doing it?
Bringing multiple data sets together to truly understand the customer
COMMUNICATIONS What are all the interactions they have with us
- nline/offline/
phone? PREFERENCES / DEMOGRAPHICS How does this person prefer to work with us? 3RD PARTY ACTIVITY DATA What has this person done (e.g. bought/sold/own
- utside of Ten-X?)
RELATIONSHIP DATA Who are their relationships with
- ther entities
and people? TRANSACTIONAL DATA What are they buying
- r selling on Ten-X?
7
Data Sets
Data Set Typical Data Format Behavioral Data (Instrumentation) Files/API Transactional Data (OLTP) Kafka/RDBMS Communications(AdWords/Marketo/SalesForce) API Preferences/Demographics (OLTP) RDBMS Relationship Data (OLTP/Third Party) RDBMS, Files, API 3rd Party activity Data (MLS, REIS, etc..) Files, API, RDBMS Documents and Objects (Pictures, PDFs, etc..) Binary Files, Spreadsheets, etc..
8
Data Platform Challenges
- Support mass storage of both structured and unstructured data
- Support both batch and real-time streaming data
- Support Machine Learning across multiple data sets
- Enable discovery and modeling of complex relationships
- Ability to join and resolve duplicates across massive amounts of datasets.
9
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
10
Platform Design Goals
- Private Data Center
- All Open Source Tools
- Extremely easy for Business teams to analyze data
- Multi-tenant – one platform for all lines of business and all teams
- Easily scalable
- Keep it Simple, Stupid
11
11
The Technologies That Power Our Data Platform!
12
Ambari Metrics ELK AMBARI ANSIBLE Ranger SprintBoot Hue Tableau YARN PIG HIVE SPARK HDFS HBase Janus Graph Gremlin Cassandra Elastic Search TinkerPop Atlas Tez MapRed
Platform Design
OOZIE
13
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
14
Raw Layer Clean Layer Derived Layer Analysis Layer
Data Pipeline Architecture
15
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
16
Raw Layer
- All the data lands here.
- Data is never exposed from the Raw Layer.
- Data is mostly stored in it’s original format (mostly Text).
17
Clean Layer
- All the cleansing rules are applied to the data.
- Example: Standardize the Gender data to Male/Female/Unknown for all the records
- Example: Standardize the Updated_Timestamp column on all the tables
- Data is optimized for storage and querying.
- Mostly use ORC file format.
- Create External Hive Tables for all the datasets.
- Platform users typically have access to the data in the Clean Layer for exploratory purposes.
18
Derived Layer
- Data is de-normalized for faster analytics queries.
- Helps with cluster resource usage, so same joins are not run repeatedly.
- Multiple sources are joined together for a unified view of a customer.
- Example: Join Omniture data with the User Profile data to get a complete view
19
Analysis Layer
- This is the consumable Layer for APIs, BI Dashboards and Reports.
- All the aggregations are performed in this layer.
- The relationships and entity resolutions happen in this layer.
- Data in Analysis layer is served from appropriate stores, based on the need
- JanusGraph: For data that is optimized for graph data model
- Cassandra/Hbase: For Key Value type data sets
- Elastic Search: For fast searches
20
API Layer
- Create Read Only APIs that serves the data to the rest of the organization
- Use Mesos and Docker for scalable API layer
- Use Spring Boot for faster API Development
- Also consumes directly from Kafka for real-time needs
- Publishes feedback information to Kafka that goes through the pipeline again
21
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
22
Monitor Namenode Check-pointing
- Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage.
- Checkpointing with Namenode HA is the recommended way.
- If there is no Namenode HA – Secondary namenode is used to do the checkpointing.
- When checkpointing fails to happen – the disks get filled up and the namenode gets crashed.
- There is potential to lose data, if the edits file gets corrupted for any reason. There are some bugs in the older
versions of HDFS, that might cause edits file to get corrupted.
Lessons Learned
23
Take Backups of all the critical metadata
- Hive Metastore (MySQL/Postgres)
- Namenode fsimage/edits files
- Ambari DB (MySQL/Postgres)
- Oozie DB (MySQL/Postgres)
- Ranger DB (MySQL/Postgres)
Lessons Learned…
24
Monitor the logs regularly and set alerts for runtime errors.
- Helps identify jobs that fails due to runtime errors, so we
can optimize them – helps with cluster resource usage. Make sure the YARN queues are rightly defined and policies set appropriately
- Make sure one bad query does not affect rest of the
cluster
- Group teams together
Lessons Learned…
25
Monitoring and Alerting
26
Monitoring and Alerting..
27
Monitoring and Alerting..
28
Company Overview
1
Overview of Ten-X Datasets
2
Ten-X Data Platform Architecture
3
Data-pipeline Implementation Overview
4
Platform Layers
5
Lessons Learned
6
Agenda
What’s next?
7
29
Data Governance & MDM - Atlas
- Most important questions that gets asked by the platform users?
- What is the source of this data?
- What does this column exactly mean?
- Who is responsible for populating this data set?
- Hive Metastore –> Apache Atlas
30
Data Exploration - Zeppelin
- A web-based notebook that enables interactive data analytics.
- Let People use their choice of language.
- Easy to create Charts and Graphs
31