large scale data processing pipelines at trivago a use
play

Large scale data processing pipelines at trivago: a use case - PowerPoint PPT Presentation

Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente Clemens Valiente Senior Data Engineer trivago Dsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years


  1. Large scale data processing pipelines at trivago: a use case 2016-11-15, Sevilla, Spain Clemens Valiente

  2. Clemens Valiente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente

  3. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 3

  4. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 4

  5. Data driven PR and External Communication Price information collected from the various booking websites and shown to our visitors also gives us a thorough overview over trends and development of hotel prices. This knowledge then is used by our Content Marketing & Communication Department (CMC) to write stories and articles. 5

  6. The past: Data pipeline 2010 – 2015 6

  7. The past: Data pipeline 2010 – 2015 Java Software Engineering 7

  8. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence 8

  9. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 9

  10. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years 10

  11. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions - - Around one million hotels Only single night stays - - 250 booking websites Only prices from - Travellers search for up to European visitors - 180 days in advance Prices cached up to 30 - Data collected over five minutes - years One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins 11

  12. The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions Restrictions Size of data - - - Around one million hotels Only single night stays We collected a total of 56 - - 250 booking websites Only prices from billion prices in those five - Travellers search for up to European visitors years - - 180 days in advance Prices cached up to 30 Towards the end of this - Data collected over five minutes pipeline in early 2015 on - years One price per hotel, average around 100 million website and arrival date prices per day were written per day to BI - “Insert ignore”: The first price per key wins 12

  13. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 13

  14. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 14

  15. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 15

  16. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 16

  17. The past: Data pipeline 2010 – 2015 Java Software Business Engineering Intelligence CMC 17

  18. Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT 18

  19. Present data pipeline 2016 – ingestion Düsseldorf 19

  20. Present data pipeline 2016 – ingestion Düsseldorf 20

  21. Present data pipeline 2016 – ingestion San Francisco Düsseldorf Hong Kong 21

  22. Present data pipeline 2016 – processing Camus 22

  23. Present data pipeline 2016 – processing Camus 23

  24. Present data pipeline 2016 – processing Camus 24

  25. Present data pipeline 2016 – processing Camus CMC 25

  26. Present data pipeline 2016 – facts & figures Cluster specifications - 51 machines - 1.7 PB disc space, 60% used - 3.6 TB memory in Yarn - 1440 VCores (24-32 Cores per machine) 26

  27. Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) - - 51 machines 2.6 trillion messages - 1.7 PB disc space, 60% collected so far - used 7 billion messages/day - - 3.6 TB memory in Yarn 160 TB of data - 1440 VCores (24-32 Cores per machine) 27

  28. Present data pipeline 2016 – facts & figures Cluster specifications Data Size (price log) Data processing - - - 51 machines 2.6 trillion messages Camus: 30 mappers writing - 1.7 PB disc space, 60% collected so far data in 10 minute intervals - - used 7 billion messages/day First aggregation/filtering - - 3.6 TB memory in Yarn 160 TB of data stage in Hive runs in 30 - 1440 VCores (24-32 Cores minutes with 5 days of per machine) CPU time spent - Impala Queries across >100 GB of result tables usually done within a few seconds 28

  29. Present data pipeline 2016 – results after one and a half years in production • Very reliable, barely any downtime or service interuptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more ressources to process it • CMC team is very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI 29

  30. Present data pipeline 2016 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors 30

  31. Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and - Monitoring price parity in usage hotel market - Clicklog information from - Anomaly and fraud our website and mobile detection app - Price feed for online - Used for marketing marketing performance analysis, - Display of price product tests, invoice development and generation etc delivering price alerts to website visitors 31

  32. Present data pipeline 2016 – use cases & status quo Uses for price information Other data sources and Status quo - - Monitoring price parity in Our entire BI business usage hotel market logic runs on and through - Clicklog information from - Anomaly and fraud the kafka – hadoop our website and mobile detection pipeline app - - Price feed for online Almost all departments rely - Used for marketing marketing on data, insights and performance analysis, - Display of price metrics delivered by product tests, invoice development and hadoop generation etc - delivering price alerts to Most of the company could website visitors not do their job without hadoop data 32

  33. Future data pipeline 2016/2017 Camus CMC 33

  34. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus CMC 34

  35. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Camus Stream processing Kafka Streams CMC Streaming SQL 35

  36. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 36

  37. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Stream processing Kafka Streams CMC Streaming SQL 37

  38. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Kafka Connect or Gobblin Kylin / Hbase Stream processing Kafka Streams CMC Streaming SQL 38

  39. Future data pipeline 2016/2017 Message format: CSV Protobuf / Avro Stream processing Kafka Streams CMC Streaming SQL 39

  40. Future data pipeline 2016/2017 CMC Streams local state * https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ 40

  41. Key challenges and learnings Mastering hadoop - Finding your log files - Interpreting error messages correctly - Understanding settings and how to use them to solve problem - Store data in wide, denormalised Hive tables in parquet format and nested data types 41

  42. Key challenges and learnings Mastering hadoop Using hadoop - - Finding your log files Offer easy hadoop access - Interpreting error to users (Impala / Hive messages correctly JDBC with visualisation - Understanding settings tools) - and how to use them to Educate users on how to solve problem write good code, strict - Store data in wide, guidelines and code denormalised Hive tables review - in parquet format and deployment process: nested data types jenkins deploys git repository with oozie definitions and hive scripts to hdfs 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend