Extending the Enterprise Data Warehouse with Hadoop Robert - - PowerPoint PPT Presentation

extending the enterprise data warehouse with
SMART_READER_LITE
LIVE PREVIEW

Extending the Enterprise Data Warehouse with Hadoop Robert - - PowerPoint PPT Presentation

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team rlancaster@orbitz.com @rob1lancaster Organizer of Chicago Machine Learning


slide-1
SLIDE 1

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012

slide-2
SLIDE 2
  • Robert Lancaster
  • Solutions Architect, Hotel Supply Team
  • rlancaster@orbitz.com
  • @rob1lancaster
  • Organizer of Chicago Machine Learning Study Group
  • Co-organizer of Chicago Big Data.

page 2

Who I Am

slide-3
SLIDE 3

page 3

Launched in 2001 Over 160 million bookings

slide-4
SLIDE 4

page 4

Some History…

slide-5
SLIDE 5
  • The Machine Learning team is formed to improve site performance.

For example, improving hotel search results.

  • This required access to large volumes of behavioral data for analysis.
  • Fortunately, the required data was collected in session data stored in web

analytics logs.

page 5

In 2009…

slide-6
SLIDE 6
  • The only archive of the required data went back about two weeks.

page 6

The Problem…

Transactional data (e.g. bookings) and aggregated Non- transactional data

Data Warehouse

Non-transactional Data (e.g. searches)

slide-7
SLIDE 7

page 7

Hadoop Provided a Solution…

Data Warehouse

Detailed non- transactional data (what every user sees, clicks, etc.)

Hadoop

Transactional data (e.g. bookings) and aggregated Non- transactional data

slide-8
SLIDE 8
  • Distributed file system and parallel processing platform.
  • Open source Apache project created by Doug Cutting.
  • Modeled on papers published by Google on the Google File System

and MapReduce.

  • Intended to run on a cluster of relatively inexpensive machines (aka

commodity hardware).

  • Bring processing to the data.

page 8

What is Hadoop?

slide-9
SLIDE 9

page 9

The Hadoop Ecosystem

Hadoop Distributed File System Hive Pig HBase Sqoop & Flume Zookeeper & Oozie MapReduce

slide-10
SLIDE 10

page 10

Deploying Hadoop Enabled Multiple Applications…

2.78%

34.30% 31.87% 71.67%

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Queries Searches

slide-11
SLIDE 11
  • page 11

And Useful Analyses…

slide-12
SLIDE 12
  • Most of these efforts are driven by development teams.
  • The challenge now is unlocking the value of this data for non-

technical users.

  • Support for Hadoop via traditional BI/reporting tools still meager.

page 12

But Brought New Challenges…

slide-13
SLIDE 13

page 13

BI Vendors Are Working on Hadoop Integration

Both big (relatively)…

slide-14
SLIDE 14

page 14

And small…

slide-15
SLIDE 15
  • Big Data team is formed under Business Intelligence team at Orbitz

Worldwide.

  • Allows the Big Data team to work more closely with the data

warehouse and BI teams.

  • Reflects the importance of big data to the future of the company.
  • Our production cluster has grown 40-fold since it was launched.

page 15

In 2011& 2012

slide-16
SLIDE 16

“We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW…” “…but that promise is still three to five years from fruition.”* *James Kobielus, Forrester Research, “Hadoop, Is It Soup Yet?”

page 16

A View Shared Beyond Orbitz…

slide-17
SLIDE 17
  • Extraction and transformation of data for loading into the data

warehouse – “ETL”.

  • Off-loading of analysis from the data warehouse.

page 17

Two Primary Ways We Use Hadoop to Complement the EDW

slide-18
SLIDE 18

Proposed Processing

page 18

ETL Example

Raw logs Hadoop

Dimensional model

slide-19
SLIDE 19

Previous Processing in Data Warehouse

page 19

ETL Example: Click Data Processing

Web Server

Logs

ETL DW Data Cleansing (Stored procedure) DW

Web Server Web Servers

Several hours of processing ~20% original data size

slide-20
SLIDE 20
  • Moving to Hadoop:
  • Removed load from the data warehouse.
  • Facilitated adding additional attributes for processing.
  • Allowed processing to be run more frequently.

page 20

ETL Example: Click Data Processing

Web Server

Logs

HDFS Data Cleansing (MapReduce) DW

Web Server Web Servers

Processing in Hadoop

slide-21
SLIDE 21
  • Facilitated analysis that allows for more personalized ad content.
  • Allowed marketing team to analyze over a years worth of search

data.

  • Provided analysis that was difficult to perform in the data warehouse.

page 21

Analysis Example: Geo-Targeting Ads

slide-22
SLIDE 22

page 22

Example Processing Pipeline for Web Analytics Data

slide-23
SLIDE 23

page 23

Example Use Case: Selection Errors

slide-24
SLIDE 24

page 24

Use Case – Selection Errors: Introduction

  • Multiple points of entry.
  • Multiple paths through site.
  • Goal: tie events together to

form picture of customer behavior.

slide-25
SLIDE 25

page 25

Use Case – Selection Errors: Processing

slide-26
SLIDE 26

page 26

Use Case – Selection Errors: Visualization

slide-27
SLIDE 27

page 27

Example Use Case: Beta Data

slide-28
SLIDE 28

page 28

Use Case – Beta Data: Introduction

  • Hotel Sort Optimization
  • Compare A vs. B
  • Web Analytics Data
  • What user saw.
  • How user behaved
  • Server Log Data
  • Sorting behavior used.
slide-29
SLIDE 29

page 29

Use Case – Beta Data Processing

slide-30
SLIDE 30

page 30

Use Case – Beta Data: Visualization

slide-31
SLIDE 31

page 31

Example Use Case: RCDC

slide-32
SLIDE 32
  • Understand and improve cache behavior.
  • Improve “coverage”
  • Traditionally search 1 page of hotels at a time.
  • Get “just enough” information to present to consumers.
  • Increase amount of availability information we have when consumer

performs a search.

  • Data needed to support needs beyond reporting.

page 32

Use Case – RCDC: Introduction

slide-33
SLIDE 33

page 33

Use Case – RCDC: Processing

slide-34
SLIDE 34

page 34

Use Case – RCDC: Visualization

slide-35
SLIDE 35
  • Hadoop market is still immature, but growing quickly. Better tools are
  • n the way.
  • Look beyond the usual (enterprise) suspects. Many of the most interesting

companies in the big data space are small startups.

  • Hadoop won’t replace your EDW, but any organization with a large

EDW should at least be exploring Hadoop as a complement to their BI infrastructure.

page 35

Conclusions

slide-36
SLIDE 36
  • Work closely with your existing data management teams.
  • Your idea of what constitutes “big data” might quickly diverge from theirs.
  • The flip-side to this is that Hadoop can be an excellent tool to off-load

resource-consuming jobs from your data warehouse.

page 36

Conclusions

slide-37
SLIDE 37

Thank you! Questions?

page 37