Large scale data processing pipelines at trivago: a use case
2016-11-15, Sevilla, Spain Clemens Valiente
Large scale data processing pipelines at trivago: a use case - - PowerPoint PPT Presentation
Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente Clemens Valiente Senior Data Engineer trivago Dsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years
2016-11-15, Sevilla, Spain Clemens Valiente
Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years
Clemens Valiente
Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel
used by our Content Marketing & Communication Department (CMC) to write stories and articles.
3
Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel
used by our Content Marketing & Communication Department (CMC) to write stories and articles.
4
Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel
used by our Content Marketing & Communication Department (CMC) to write stories and articles.
5
6
7
Java Software Engineering
8
Java Software Engineering Business Intelligence
9
Java Software Engineering Business Intelligence CMC
10
Price dimensions
180 days in advance
years
11
Price dimensions
180 days in advance
years Restrictions
European visitors
minutes
website and arrival date per day
price per key wins
12
Price dimensions
180 days in advance
years Restrictions
European visitors
minutes
website and arrival date per day
price per key wins Size of data
billion prices in those five years
pipeline in early 2015 on average around 100 million prices per day were written to BI
13
Java Software Engineering Business Intelligence CMC
14
Java Software Engineering Business Intelligence CMC
15
Java Software Engineering Business Intelligence CMC
16
Java Software Engineering Business Intelligence CMC
17
Java Software Engineering Business Intelligence CMC
18
19
Düsseldorf
20
Düsseldorf
21
San Francisco Düsseldorf Hong Kong
22
23
24
25
CMC
26
Cluster specifications
used
per machine)
27
Cluster specifications
used
per machine) Data Size (price log)
collected so far
28
Cluster specifications
used
per machine) Data Size (price log)
collected so far
Data processing
data in 10 minute intervals
stage in Hive runs in 30 minutes with 5 days of CPU time spent
>100 GB of result tables usually done within a few seconds
29
30
Uses for price information
hotel market
detection
marketing
development and delivering price alerts to website visitors
31
Uses for price information
hotel market
detection
marketing
development and delivering price alerts to website visitors Other data sources and usage
app
performance analysis, product tests, invoice generation etc
32
Uses for price information
hotel market
detection
marketing
development and delivering price alerts to website visitors Other data sources and usage
app
performance analysis, product tests, invoice generation etc Status quo
logic runs on and through the kafka – hadoop pipeline
metrics delivered by hadoop
not do their job without hadoop data
33
CMC
34
CMC
Message format: CSV Protobuf / Avro
35
CMC
Message format: CSV Protobuf / Avro Stream processing Kafka Streams Streaming SQL
36
Kafka Connect
CMC
Message format: CSV Protobuf / Avro Stream processing Kafka Streams Streaming SQL
37
Kafka Connect
CMC
Message format: CSV Protobuf / Avro Stream processing Kafka Streams Streaming SQL
38
Kafka Connect
CMC
Message format: CSV Protobuf / Avro Stream processing Kafka Streams Streaming SQL
Kylin / Hbase
39
CMC
Message format: CSV Protobuf / Avro Stream processing Kafka Streams Streaming SQL
40
CMC
Streams local state
* https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
41
Mastering hadoop
messages correctly
and how to use them to solve problem
denormalised Hive tables in parquet format and nested data types
42
Mastering hadoop
messages correctly
and how to use them to solve problem
denormalised Hive tables in parquet format and nested data types Using hadoop
to users (Impala / Hive JDBC with visualisation tools)
write good code, strict guidelines and code review
jenkins deploys git repository with oozie definitions and hive scripts to hdfs
43
Mastering hadoop
messages correctly
and how to use them to solve problem
denormalised Hive tables in parquet format and nested data types Using hadoop
to users (Impala / Hive JDBC with visualisation tools)
write good code, strict guidelines and code review
jenkins deploys git repository with oozie definitions and hive scripts to hdfs Bad parts
coordinators in xml, not through the Hue interface
in Hive & Impala
& Hue: Failed queries are not always closed properly
Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years
Clemens Valiente
Questions and comments?
45
interactive-queries-in-apache-kafka/
%3A+Getting+Started
Thanks to Jan Filipiak for his brainpower behind most projects, giving me the opportunity to present them