PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 - PowerPoint PPT Presentation

Introduction Our Environments Conclusions PostgreSQL as a Big Data Platform Chris Travers May 10, 2019

Introduction Our Environments Conclusions About Adjust Adjust provides mobile advertisement attribution and analytics on mobile application events. On average we track advertisement performance for 10 applications for each smart phone in use. We focus on fairness and transparency.

Introduction Our Environments Conclusions About Me • Long time software developoer • New Contributor to PostgreSQL • Working with PostgreSQL since 1999 • Head of PostgreSQL team at Adjust GmbH

Introduction Our Environments Conclusions My Team • Research and Development • Database environments supporting diverse products • PB scale deployments

Introduction Our Environments Conclusions Assumptions in Relational Model • Optimized for mathematical manipulations via Set Theory • Real implementations fall short (bags vs sets etc) • Data modelled as tuples (ideally short) where a tuple element is assumed to be atomic. • Think accounting or order management software.

Introduction Our Environments Conclusions Traditional business intelligence (relational) • Same approach mathematically as OLTP • Typically data is historical meaning that it can be inserted periodically. • Sales over time per country is a standard example. • PostgreSQL struggles a bit with BI due to table structure.

Introduction Our Environments Conclusions Industry Trends • Wider variety of source data for analysis (Variety) • Real-time analytics on streams of events (Velocity) • More data than can be managed gracefully on one machine (Volume) Collectively these are known as ”Big Data” and might include analysis of published research, facebook posts, internet-of-things events, or MMO game data.

Introduction Our Environments Conclusions What Big Data Is/Is Not Big Data Is Not Big Data Is: • A set of products • About V3 Problems • A set of technologies • A set of techniques • About following recipes. • About Attention to Detail At Adjust we apply big data techniques to large, high velocity data sets using vanilla Postgres and a lot of our custom software.

Introduction Our Environments Conclusions Introducing our KPI Service Pipeline • Environment Approaching 1PB • Delivers near-realtime analytics on user behavior • 100-300k requests a second • Delivers to dashboard and external API users • Different pieces have different availability considerations

Introduction Our Environments Conclusions Big Data Characteristics • High Volume and Velocity • High availability requirements for Ingestion • Distributed data warehouse queries • Data has has very large clusters of values, making ordinary sharding difficult.

Introduction Our Environments Conclusions Engineering Approach • Pipeline of Data • Highly redundant initial processing nodes • Modestly redundant customer-facing shards • Data moves through a pipeline.

Introduction Our Environments Conclusions Architecture • Initial processing systems log their results • MapReduce to customer-facing shard databases • MapReduce again in delivering data to client • Covered in ”PostgreSQL At 20TB and Beyond”

Introduction Our Environments Conclusions PostgreSQL Challenges • PostgreSQL FDW too latency sensitive to use between datacenters. • Multiple inheritance used for some advanced features makes data schema changes difficult. • Our shards’ WAL traffic is measured in the TB/day.

Introduction Our Environments Conclusions Introducing Bagger • Elastic Search Replacement • High velocity ingestion (1M+ data points/sec) • Very high volume (10PB) • Free form data (JSON documents) • Retention for Limited Time

Introduction Our Environments Conclusions Big Data Characteristics • Very high velocity (up to 1M items per second ingestion) • Very high volume (10PB of data, capped by quantity currently). • Could include all kinds of new data at any time, so must handle variety of semi-structured data quite gracefully.

Introduction Our Environments Conclusions Engineering Approach • Optimize for bulk storage and linear writes • Use PostgreSQL JSONB and similar indexes • Data partitioned by hour and dropped when disks are near full. • Client-side sharding, so dedicated client

Introduction Our Environments Conclusions Architecture • Data arrives by Kafka, partitioned for the dbs • Data partitioned by query pattern and hour • Partitions tracked on master databases • Client written in Perl, which queries appropriate partitions and concatenates data

Introduction Our Environments Conclusions PostgreSQL Challenges • Marshalling JSONB can be expensive • System catalogs on ZFS on spinning disk are slow • Requires significant custom C code in triggers to keep the system fast. • Exception handling in PostgreSQL has been a source of bugs in the past.

Introduction Our Environments Conclusions Introducing Audience Builder • Retargetting platform (describe use case) • Only 12TB but expect to grow • Non-typical query and access patterns • Feature requests that could push this into PB range

Introduction Our Environments Conclusions Big Data Characteristics • High enough velocity that saturation of NVME storage is a concern. • Queries touch very large amounts of data • Expect these issues to become far worse.

Introduction Our Environments Conclusions Engineering Approach • Separation of storage and query • Settled on Parquet as storage format • Columnar data storage useful but most software does not support our access patterns well. • Evaluated a few of alternatives to PostgreSQL here. • Prioritized predictability and extensibility over peak performance

Introduction Our Environments Conclusions Architecture • Parquet files on CephFS • PostgreSQL as query engine only • Wrote Parquet FDW for PostgreSQL • With tuning and optimization, as fast as native files. • Data arrives via Kafka and is written to Parquet files and these are registered with the database. • Pluggable storage might be of interest here. https://github.com/zilder/parquet fdw

Introduction Our Environments Conclusions PostgreSQL Challenges • Hundreds of thousands of tables • Performance requires tables to be physically sorted • Heavy reliance on streaming APIs (COPY)

Introduction Our Environments Conclusions Key Takeaways • Big Data is about technique, not technology • PostgreSQL is quite capable in this area • Careful attention to requirements is important • Every big data system is different.

Introduction Our Environments Conclusions Major Open Source Software We Use • Apache Kafka • PostgreSQL • Redis • Go • Apache Flink • CephFS • Apache Spark • Gentoo Linux

Introduction Our Environments Conclusions Thank You Thanks for coming. Any questions? chris.travers@adjust.com

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 - PowerPoint PPT Presentation

Introduction Our Environments Conclusions PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments Conclusions About Adjust Adjust provides mobile advertisement attribution and analytics on mobile

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com FOSDEM 2019 February 3,

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PostgreSQL Who, What, When, Where, Why, How? 1 QUIS? Who's involved with PostgreSQL? Core

PostgreSQL SQL-MED Ibrar Ahmed Senior Software Engineer @ Percona PostgreSQL Consultant What?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts pgDay Paris 2019

PostgreSQL Provider The PostgreSQL provider gives the ability to deploy and congure resources

Look It Up: Practical PostgreSQL Indexing Christophe Pettus PostgreSQL Experts

PostgreSQL Replication Christophe Pettus PostgreSQL Experts PerconaLive, April 25, 2018

Securing PostgreSQL Christophe Pettus PostgreSQL Experts, Inc. PGDay FOSDEM 2018 Greetings!

Hosted PostgreSQL: An Objective Look Christophe Pettus PostgreSQL Experts, Inc. FOSDEM PGDay

PostgreSQL on FreeBSD Some news, observations and speculation Thomas Munro, BSDCan 2020

PostgreSQL for developers Dimitri Fontaine PostgreSQL Major Contributor P O S T G R E S Q L M A

Distributed PostgreSQL Santa Clara, California | April 23th 25th, 2018 Simon Riggs CTO,

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Distributed Computing on PostgreSQL Marco Slot <marco@citusdata.com> Small data

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 - PowerPoint PPT Presentation

Introduction Our Environments Conclusions PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments Conclusions About Adjust Adjust provides mobile advertisement attribution and analytics on mobile

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com FOSDEM 2019 February 3,

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PostgreSQL Who, What, When, Where, Why, How? 1 QUIS? Who's involved with PostgreSQL? Core

PostgreSQL SQL-MED Ibrar Ahmed Senior Software Engineer @ Percona PostgreSQL Consultant What?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts pgDay Paris 2019

PostgreSQL Provider The PostgreSQL provider gives the ability to deploy and congure resources

Look It Up: Practical PostgreSQL Indexing Christophe Pettus PostgreSQL Experts

PostgreSQL Replication Christophe Pettus PostgreSQL Experts PerconaLive, April 25, 2018

Securing PostgreSQL Christophe Pettus PostgreSQL Experts, Inc. PGDay FOSDEM 2018 Greetings!

Hosted PostgreSQL: An Objective Look Christophe Pettus PostgreSQL Experts, Inc. FOSDEM PGDay

PostgreSQL on FreeBSD Some news, observations and speculation Thomas Munro, BSDCan 2020

PostgreSQL for developers Dimitri Fontaine PostgreSQL Major Contributor P O S T G R E S Q L M A

Distributed PostgreSQL Santa Clara, California | April 23th 25th, 2018 Simon Riggs CTO,

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Distributed Computing on PostgreSQL Marco Slot &lt;marco@citusdata.com&gt; Small data

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Distributed Computing on PostgreSQL Marco Slot <marco@citusdata.com> Small data