Data at Marfeel Addressing complexity at scale with the latest - PowerPoint PPT Presentation

Data at Marfeel Addressing complexity at scale with the latest technologies Alessandro Pregnolato Head of Data

What does Marfeel do? Optimize. Engage. Monetize. → A lightning fast, responsive mobile web → A sophisticated monetization layer …& more to come

700M Visits ➢ 1.75Bn Page views ➢ Tonnes of data, each month. 4 Bn Ad requests ➢ 20Tb of logs data ➢ GA Java App Insight SSPs Mongo Prometheus DFP Other Sources Logs Elastic Search Kibana

Analytics

Monitoring

Some great achievements so far... A single source of truth ● Monitoring & alerts on most KPIs ● A data-driven culture (to some extent) ● “We don’t know much about our tenants” “Not enough flexibility” “I cannot join collections, nor cross them with other data” “I cannot count articles published per day” “A big proportion of this data is not being used” “I’d like to create my own visualizations and dashboards” “We could segment by tenants’ attributes such as “We cannot look at yearly trends because there’s no historical” “I need different granularities” vertical, content type (news/evergreen), “I’d like to cross this with content and tenant data” “I cannot perform complex operations (weighted averages, keywords/tags, topics (ML), wordcount, images, “Cannot compare tenants or YoY” running totals, etc.)” video, etc.” “The tools are dictating the events modeling” “Cannot export data” “We could create audiences” “Activation and QBR reports are very limited”

What’s a Data-driven culture?

Five building blocks of a data-driven “Having clean, high-quality data, from a central source, and culture with clear metadata, is ineffective if staff can’t access it” Carl Anderson, Michael Li 1. Single source of truth 2. Data dictionary 3. Broad data access 4. Data Literacy 5. Data-driven decision making

Broad Data Access Tenants PI Monetization CS ??? … means NO BARRIERS Product SEO Dev Platform Technical (DWH Modeling & SQL) ➢ Functional (Business knowledge) ➢

Assumptions We can only report to our tenants Traffic Metrics consistent with their own ● data ( Google Analytics ) We can only report to our tenants Revenue Metrics from SSPs & ADX ● These source provide limited granularity ● Granular, accurate data requires access to paid tools (such as Google ● AdManager Premium, GA 360, etc.) whose cost is prohibitive at our scale

Assumptions Implications High-level, accurate data → fit for reporting but not for analytics ➢ Granular, approximated data → fit for analytics but not for reporting ➢ Exploiting the available data currently requires such a degree of technical ➢ and business knowledge that’s unreasonable to expect from our stakeholders

What to do?

REAL Data Accessibility (options) Tenants PI Monetization Option # 1 A troop of data Monkeys CS ??? Product SEO Dev Platform

REAL Data Accessibility (options) Tenants PI Monetization Option # 2 An army of BI developers CS ??? Product SEO Dev Platform

REAL Data Accessibility (options) Tenants PI Monetization Option # 3 Technology CS ??? Product SEO Dev Platform ? A logical layer in which to store table ➢ relationships and business rules Flexible access to the DWH data ➢ (writes SQL for you)

BI Layer - Tech Comparison Looker supported DB ● Amazon Aurora Tableau Qlikview Looker Amazon Redshift ● Apache Spark 1.5+ ● Low Low Mid Latency (in-memory) (in-memory) (from DB) ● Apache Spark 2.0 ● Aster Data Loading & Pre-processing Required Required Not Required Clustrix ● Mid Mid Low Maintenance Exasol (if managed) (if managed) (centralized) ● ● Google BigQuery Legacy SQL Development & Deployment High High Low time (dashboards) (dashboards) (data model only) ● Google BigQuery Standard SQL Google Cloud PostgreSQL ● Logical Model N N Y Google Cloud SQL ● SQL Engine N N Y ● IBM Netezza ETL Layer Y Y Not required ● MariaDB OLAP Layer Y Y Limited MemSQL ● Visualization capability High Mid/High Mid Microsoft Azure PostgreSQL ● Mobile Adaptiveness Low Mid High ● Microsoft Azure SQL Data Warehouse ● Microsoft Azure SQL Database Learning Curve High High Mid (Developers) Microsoft SQL Server 2005 ● Microsoft SQL Server 2008+ ● Learning Curve High High Low (Business Users) ● MySQL ● Oracle Flexibility Low Low High PostgreSQL ● Price Mid Mid Mid PrestoDB ● True Self-Service ● Qubole Presto Fast (In-memory) Embeds business logic ● Qubole Presto Service Fast (In-memory) Pros OLAP Layer Only one model to build OLAP Layer SAP HANA ● Advanced Visualization Restricts data interactions Snowflake ● Unlimited scaling ● Teradata Rigid ● Vector Not as powerful Requires Dashboards Development Vertica 7.1+ Cons Slower ● Limited self-exploration (Relies mostly on DB) Limited scaling (Discarded all non-distributed & high-end corporate solutions) 16

DWH Layer - Tech Comparison AWS Google Snowflake Clickhouse Redshift Big Query Speed Mid/high Mid/high Mid/high High (?) Maintenance Mid Low Mid Mid/High Dynamic resizing Limited Y Y N Concurrency Low High Mid Mid Indexes Sort/Dist Key Self-tuning Self-tuning Sort Key (primary only) Real-time Ingestion AWS Kinesis Firehose Snowpipe Y Y Complex Types N Json/XML Nested Struct Types Array Join on Array/Nested DS N (UDF?) Y (JOIN on Json/XML) Hive-like (Explode) Hive-like (Explode) Approximated Calculations N N Y Y Transactions Y Y Y N Replication N Y Y Y Fault Resistance Backup Distributed (replication) Distributed (replication) Distributed (replication) Subqueries Y Y Y N Window Functions Y Y Y N UDF (Python, JS, etc.) Y Y Y N Connectivity Extensive Extensive Extensive JDBC/ODBC only Tableau Connectivity Y Y Y JDBC/ODBC only Looker Connectivity Y Y Y N LogStash Output S3 only S3 only Y N Google Analytics integration N N Native N Cost Mid Mid/High Mid/High Low Self-tuning, fully elastic, Fully managed, linear self-scaling, Highly Tunable, on AWS, widely Pros high-concurrency, Json/XML high-concurrency, Json/XML Very Fast. Open Source adopted, previous experience support, cheap storage support, Logstash/GA integration Storage and computing are High maintenance coupled Non-standard SQL,, No UDF, No Cons Speed? Price? Pricing model Not great at handling Window Functions, No Looker concurrency connectivity

DWH Layer Pros AWS, established, widely adopted ○ Highly Tunable ○ It works ○ Concerns Storage and computing are coupled (Spectrum doesn’t quite cut it) ○ Not great at handling concurrency ○ Didn’t evolve much since 2013. Outdated (?) ○

DWH Layer Pros ○ Google (strong relationship) ○ Fully managed, linear self-scaling, high-concurrency ○ Json/XML support, GA integration Concerns Awkward pricing model - Pay per query (flat rates start from 10K per month) ○ Quite Hadoop-like. More complex to use? ○

DWH Layer Pros ○ Open Source ○ Allegedly very fast ○ Some prestigious adopters (E.G. CloudFlare) Concerns High-maintenance (concerns about the Total Cost of Ownership ) ○ Even if it was cheaper, do we need to process SO much data? ○ Is it worth the trade off? ○ Non-standard SQL ■ ■ No Subqueries No Analytic functions ■ ■ No UDF No Looker connectivity ■

DWH Layer Pros ○ Separates storage and computing. Storage is cheap ○ Handles concurrency very well ○ Semi-structured data support (Json, XML) ○ Virtual warehouses (pay per usage, predictable cost) ○ Lots of advanced, handy functionality Concerns Will it be fast enough? ○ Cost ○

Coordination Layer Apache Airflow Open Source, great community ○ Well established ○ Extremely versatile ○ Powerful ○ Distributed ○ ...& it could do some of the heavy-lifting if Snowflake turned out to be too slow, or expensive

Hi-level data architecture proposal GA Java Day/Tenant/Platform Tenants App (2M records p.m.) Insight PI SSPs Monetization Mongo CS Prometheus Tenant ??? Master data DFP Config Jsons Product SEO Other Dev Apache Aggregated Sources Logs Elastic Metrics* Platform Airflow Search DSP Articles Kibana Session level tracking Deployments (sampled?) Mongo Tickets (Content) Github Looker JIRA Query VW (Snowflake) DWH Storage (Snowflake) Segment ELT VW (Snowflake) Tenants Integrations Tracking (Intercom, Salesforce, Amplitude, etc.) (MPress, Leroy, Insight usage, etc.)

Proof of Concept (PoC) - Objectives Before proceeding, we must validate that: Looker did a good job at: ○ Resolving the complexity and fragmentation of our data sources ■ Removing (most) barriers to Data Accessibility & Literacy by modeling the ■ required technical and business knowledge into its logical layer Snowflake could scale (both in terms of performance and costs) ○

Proof of Concept (PoC) CEO Product Monetization PI Looker dashboard and data exploration POC Looker Model Pageviews & Visits (varys) Data Warehouse Staging and processing (ELT) Logical layer Analytics data model Raw Json Data GA Connect, SSP, ADX, etc. Data PI

THE RESULT

Looking back...

Data at Marfeel Addressing complexity at scale with the latest - PowerPoint PPT Presentation

Data at Marfeel Addressing complexity at scale with the latest technologies Alessandro Pregnolato Head of Data What does Marfeel do? Optimize. Engage. Monetize. A lightning fast, responsive mobile web A sophisticated monetization

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Google Analytics: Reporting & Metrics That Matter Presenter: Moby Siddique Moby Siddique

Google Analytics Joe Knox & Nick Manning Drupal Camp 2016 Hello. Joe Knox Nick Manning

10 Analytics & Optimization From Code to Product gidgreen.com/course Lecture 10

GOOGLE GOOGLE ANAL ANALYTICS YTICS 101 101 #WOWW WOWWEBIN EBINAR AR Private and

Good Morning! MCS2273/MJR2204 Introduction to Web Design January 2017 Ulrich Werner Three Ways

GA Technical Setup ACCount Installation & Google Tag Manager Views Tracking code

Interpreting Analytics Data-Informed Decision Making Ashley Martin Executive Director,

Social Media Analytics: Do you know your numbers? Quick Intro Cody Damon, President of Media

Sambuz

Useful Links

Newsletter

Mail Us

Data at Marfeel Addressing complexity at scale with the latest - PowerPoint PPT Presentation

Data at Marfeel Addressing complexity at scale with the latest technologies Alessandro Pregnolato Head of Data What does Marfeel do? Optimize. Engage. Monetize. A lightning fast, responsive mobile web A sophisticated monetization

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Google Analytics: Reporting &amp; Metrics That Matter Presenter: Moby Siddique Moby Siddique

Google Analytics Joe Knox &amp; Nick Manning Drupal Camp 2016 Hello. Joe Knox Nick Manning

10 Analytics &amp; Optimization From Code to Product gidgreen.com/course Lecture 10

GOOGLE GOOGLE ANAL ANALYTICS YTICS 101 101 #WOWW WOWWEBIN EBINAR AR Private and

Good Morning! MCS2273/MJR2204 Introduction to Web Design January 2017 Ulrich Werner Three Ways

GA Technical Setup ACCount Installation &amp; Google Tag Manager Views Tracking code

Interpreting Analytics Data-Informed Decision Making Ashley Martin Executive Director,

Social Media Analytics: Do you know your numbers? Quick Intro Cody Damon, President of Media

Sambuz

Useful Links

Newsletter

Mail Us

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Google Analytics: Reporting & Metrics That Matter Presenter: Moby Siddique Moby Siddique

Google Analytics Joe Knox & Nick Manning Drupal Camp 2016 Hello. Joe Knox Nick Manning

10 Analytics & Optimization From Code to Product gidgreen.com/course Lecture 10

GA Technical Setup ACCount Installation & Google Tag Manager Views Tracking code