Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - - PowerPoint PPT Presentation

women in big data x pinterest welcome
SMART_READER_LITE
LIVE PREVIEW

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - - PowerPoint PPT Presentation

Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager Goku: Pinterests in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest Pinterest Discover new


slide-1
SLIDE 1
slide-2
SLIDE 2

Women in Big Data x Pinterest

slide-3
SLIDE 3

Welcome!

Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Goku: Pinterest’s in house Time-Series Database

Tian-Ying Chang

  • Sr. Staff Engineer Manager

Pinterest

slide-7
SLIDE 7 Confidential
  • Discover new ideas and find

inspiration to do the things they love ○ 300M+ MAU, billions pins

  • Metrics for monitoring site health

○ Latency, QPS, CPU, memory

  • Metric about product quality

○ MAU, Impression, etc

  • Monitoring service needs to be fast,

reliable and scalable

Pinterest

7
slide-8
SLIDE 8 Confidential
  • Graphite

○ Easy to setup at small scale ○ Down sampling support long range query well ○ Hard to scale ○ Deprecated at Pinterest’s current scale

  • OpenTSDB

○ Rich query, tagging support ○ Easy to scale horizontally with underlying HBase cluster ○ Long latency for high cardinality data ○ Long latency for query over longer time range ■ No down sampling ○ Heavy GC worsened by combined heavy write QPS and long range scan

Monitoring at Pinterest

8
slide-9
SLIDE 9 Confidential
  • HBase Schema

○ Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes) ○ Column qualifier: <delta to row key timestamp(up to 4 bytes)>

  • Unnecessary Scan

○ Query: m1{rpc=delete} [t1 to t2] ○ <m1><t1><host=h1><rpc=delete> ○ <m1><t1><host=h1><rpc=get> ○ <m1><t1><host=h1><rpc=put> ○ <m1><t2><host=h2><rpc=delete>

  • Data size

○ 20 bytes per data point

  • Aggregation

○ Read data onto one opentsdb and aggregate ○

  • Ex. ostrich.gauges.singer.processor.stuck_processors{host=*}
  • Serialization

  • Json. Super slow when there are many many data points to return

Why OpenTSDB is not good fit

9

HBase RS HBase RS HBase RS

OpenTSDB

slide-10
SLIDE 10 Confidential

Goku is here to save

slide-11
SLIDE 11

Statsboard (Read Client) Ingestor (Write Client) Kafka

OpenTSDB

HBase RS HBase RS HBase RS OpenTSDB

  • Read/|Write requests are sent to a

random selected OpenTSDB box, and then routed to corresponding RS based

  • n row key range
  • Reads: raw data is read from individual

HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client

Write Read

slide-12
SLIDE 12

Statsboard (Read Client) Ingestor (Write Client) Kafka Goku Goku Goku Goku

Goku cluster

  • A Goku box is not only storage

engine, but also: ○ Proxy that route requests ○ Aggregation engine

  • Client can send requests to any Goku

box who will route requests ○ Scatter and Gather

Write Read

slide-13
SLIDE 13

Two level sharding

  • Group# hashed from metric name

○ E.g tc.metrics.rpc_latency

  • Shard# hashed from metric + set
  • f tagk and tagv

○ E.g.

tc.metrics.rpc_latency{rpc=put,host=m1}
  • Control read fanout while easy to

scale out individual group

G1:S1 G1:S2 G2:S1 G2:S2 G3:S1 G1:S3 G4:S1 G4:S2 G3:S1 G1:S3 1.Requests sent to a random goku box G3:S3 4.Retrieve data and local aggregate 5.another aggregation 6.return response

Goku

2.comput sharding to G2: S1 and S2, then look up shard config

Shard config

3.route requests
slide-14
SLIDE 14 Confidential

Goku #1. Time Series Database based on Beringei

slide-15
SLIDE 15

Beringei

  • In-memory key value store

○ Key: string ○ Value: list of <timestamp, value> pairs

  • Gorilla compression

○ Delta-of-Delta encoding on timestamps ○ Delta encoding on values

  • Stores most recent 24 hours data

(configurable)

  • One level of sharding to distribute
  • Datapoint size reduced

○ from 20 bytes to 1.37 bytes

Bucket Bucket Bucket ts ts ts ts ts ts ts ts

Disk

Gorilla Encode Gorilla Decode

Beringei

Write Read

Shard

slide-16
SLIDE 16 Confidential

Goku #2. Query Engine -- Inverted Index

slide-17
SLIDE 17

Inverted Index

  • A map from search term to its bitset
  • Built along with processing incoming data

points

  • Fast lookup when serve query
  • Support query filters
○ ExactMatch: metricname{host=h1,api=get). => intersect bitsets of metricname, host=h1, api=get ○ Or: metricname{host=h1|h2}. => union bitsets of host=h1 and host=h2 and intersect with bitset of metricname ○ Nor: metricname{host=not_literal_or(h1|h2)}. => remove bitsets of host=h1 and host=h2 from bitset of metricname ○ Wildcard: a. metricname{host=*} => intersect bitsets of metricname and host=*; b.metricname{host=h*} => convert to regex filter ○ Regex: metricname{host=h[1|2].*, api=get, az=us-east-1} => apply other filters first. Then build a regex pattern based
  • n the filter values and then iterate corresponding full
metric names of all ids after applying other filters. Bucket Bucket Bucket ts ts ts ts ts ts ts ts

DISK

Inverted Index Gorilla Encode Gorilla Decode

Goku Phase #1

Write Read

Shard

slide-18
SLIDE 18 Confidential

Goku #3. Query Engine -- Aggregation

slide-19
SLIDE 19

Aggregation

  • Post-process after retrieving all relevant time

series

  • Mimic OpenTSDB’s aggregation layer
  • Support basic aggregators, including SUM,

AVG, MAX, MIN, COUNT, DEV and Downsampling

  • Versus OpenTSDB

○ OpenTSDB does aggregation on a single instance since HBase RS don’t know how to aggregate ○ Goku does aggregation in two phases. First on each leaf goku node, and second on the routing goku node ○ Distribute the computation and save data on the wire

Bucket Bucket Bucket ts ts ts ts ts ts ts ts

DISK

Inverted Index

Aggregation

Gorilla Encode Gorilla Decode

Goku Phase #1

Write Read

Shard

slide-20
SLIDE 20 Confidential

AWS EFS

slide-21
SLIDE 21

AWS EFS

  • Store log and data files to recovery
  • Posix compliant
  • Data durability
  • Operate it asynchronously, so latency isn’t an

issue

  • Easy to move shard
  • Easy to use on AWS
Bucket Bucket Bucket ts ts ts ts ts ts ts ts

AWS EFS

Inverted Index

Aggregation

Gorilla Encode Gorilla Decode

Goku Phase #1

Write Read

Shard

slide-22
SLIDE 22 Confidential

Phase #2 Disk based Goku

slide-23
SLIDE 23 Bucket Bucket Bucket ts ts ts ts ts ts ts ts

AWS EFS

Inverted Index

Aggregation

Gorilla Encode Gorilla Decode

Group

Goku Phase #2

Write Read

Shard Distributed KV store(Rock Store) Hadoop job

Goku Phase #2 -- Disk based

  • Hadoop job constantly

runs to compact data into disk with downsample

  • Data stored into S3 for

better availability and low cost

  • RocksDB is used for
  • nline serving data

S3

slide-24
SLIDE 24 Confidential
  • Replication

○ Currently dule write to two clusters for fault tolerance ○ Replication to improve availability and consistency

  • More query possibilities

○ TopK ○ Percentile

  • Analytics use case

○ Another big consumer of Time Series data

Next step for Goku

24
slide-25
SLIDE 25

Thanks!

slide-26
SLIDE 26

Scheduling Asynchronous Tasks at Pinterest

Isabel Tallam Data (Core Services) Team Pinterest

slide-27
SLIDE 27

Why asynchronous tasks? Asynchronous task processing service Design considerations

slide-28
SLIDE 28

Why asynchronous tasks?

SPAM S P A M % $ # *

%$#*

%$#* SPAM SPAM SPAM

slide-29
SLIDE 29

Why asynchronous tasks? Asynchronous task processing service Design considerations

slide-30
SLIDE 30

Pinlater

Asynchronous Task Processing Service

Pinlater features

  • High throughput
  • Easily create new tasks
  • At-least-once guarantee
  • Strict ack mechanism
  • Metrics and debugging support
  • Different task priorities
  • Scheduling future tasks
  • Python, Java support
slide-31
SLIDE 31

Pinlater

Asynchronous Task Processing Service

Pinlater Servers Pinlater Servers

Pinlater Servers

Clients Pinlater Workers Pinlater Workers

Pinlater Workers

Storage

Master Slave

Storage

Master Slave

Storage

Master Slave

Clients

Clients

Pinlater components

insert request /ack

slide-32
SLIDE 32

Pinlater

Asynchronous Task Processing Service

Pinlater Stats ~1000 different tasks defined ~8 billion task instances processed per day ~3000 Pinlater hosts

slide-33
SLIDE 33

Why asynchronous tasks? Asynchronous task processing service Design considerations

slide-34
SLIDE 34

Storage

Pinlater

Asynchronous Task Processing Service

Pinlater Servers Pinlater Servers

Pinlater Servers Storage

Master Slave

Storage Layer

Master Slave Storage

Master Slave

Cache

slide-35
SLIDE 35

Pinlater

Asynchronous Task Processing Service

Pinlater Servers

Pinlater Servers

Clients Pinlater Workers

Pinlater Workers Pinlater Workers

Clients

Clients

insert request

Handling failures in the system

/ack timeout monitor

Storage

Master Slave

Pinlater Servers

slide-36
SLIDE 36

Pinlater

Asynchronous Task Processing Service

Thank You!

slide-37
SLIDE 37

Experimentation at Pinterest

Lu Yang Data (Data Analytics - Core Product Data) Team Pinterest

slide-38
SLIDE 38

Outline

1 Background 2 Platform 3 Architecture

slide-39
SLIDE 39

What is an a/b experiment?

It is a method to compare two (or more) variations of something to determine which

  • ne performs better against your target metrics

OR

slide-40
SLIDE 40

With Experiment Mindset

Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch A randomized, controlled trial with measurement

Existing code

  • CONTROL

Changed code

  • ENABLED

Not in experiment All Users

slide-41
SLIDE 41

Number of Experiments Over Time

slide-42
SLIDE 42

New Experiments/wk

7

Languages and platforms

800+

Different Metrics

<5min

Experiment setup

8.3/10

Developer satisfaction score

140+

Experimentation by the Numbers

slide-43
SLIDE 43

Outline

1 Background 2 Platform 3 Architecture

slide-44
SLIDE 44

Typical Experiment Timeline

Experiments vary, but here is a typical timeline

Idea Form a hypothesis Analyze results Make Decision and Iterate Experiment Setup and Launch Iterate Feature Development

slide-45
SLIDE 45

Experimentor

Platform

slide-46
SLIDE 46

Gatekeeper

Platform

Languages: python, java, scala, go, … Platform: web, ios, android, ...

slide-47
SLIDE 47

Experiment Dashboard

Platform

slide-48
SLIDE 48

Outline

1 Background 2 Platform 3 Architecture

slide-49
SLIDE 49
slide-50
SLIDE 50

Batch processing Real-time analytics

Experiment Data Pipeline

Hbase

slide-51
SLIDE 51

Airflow at Pinterest

Indy Prentice Data (Big Data Compute) Team Pinterest

slide-52
SLIDE 52

What is a workflow?

Define

What to run, dependencies, etc. When to run it

Schedule

slide-53
SLIDE 53 Confidential

Scheduled with Pinball

*check it out at https://github.com/pinterest/pinball!

coordinator worker

train the visual search model make the search index rank the ads free spot!

worker

download all pin images count number of Pinterest users (get it?) find great recommendations calculate experiment metrics

slide-54
SLIDE 54

Fixed number of workers and slots

Unfortunately…

slide-55
SLIDE 55 Confidential

free spot!

Fixed number of workers and slots

coordinator worker

train the visual search model free spot!

worker

free spot! free spot! free spot! free spot! calculate experiment metrics

slide-56
SLIDE 56 Confidential

Fixed number of workers and slots

coordinator worker

train the visual search model make the search index rank the ads find duplicate pin images

worker

calculate experiment metrics find great recommendations download all pin images count number of Pinterest users

slide-57
SLIDE 57

Fixed number of workers and slots Shared environment

Unfortunately…

slide-58
SLIDE 58 Confidential

Shared host resources

coordinator worker worker

generate experiment metrics index the pins download all pin images count number of Pinterest users train the visual search model make the search index rank the ads free spot!

slide-59
SLIDE 59 Confidential

Shared codebase

coordinator worker worker

generate experiment metrics index the pins download all pin images count number of Pinterest users train the visual search model make the search index rank the ads free spot!

slide-60
SLIDE 60

Fixed number of workers and slots Shared environment Implementation doesn’t scale

Unfortunately…

slide-61
SLIDE 61
  • Industry + community support
  • Will it support our scale?

○ We think so

  • Will it solve the problems with Pinball?

○ With Kubernetes executor

slide-62
SLIDE 62

Enter Spinner

(get it???)

Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s

Physical machine (8 CPUs, 10 GBs of disk space) Can’t touch this

Train the visual search model (4 CPUs, 4 GBs of disk space) Download all images (10 GBs of disk space) Generate experiment metrics (2 CPUs, 1 GB of disk space)

slide-63
SLIDE 63

Enter Spinner

(get it???)

Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s

train the visual search model make the search index rank the ads generate experiment metrics index the pins write 10TB of data to disk! count number of Pinterest users

Tasks Servers

slide-64
SLIDE 64

Enter Spinner

(get it???)

Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s

slide-65
SLIDE 65

Autoscaling happens at k8s level Jobs run in isolated containers Lightweight maintenance

On Airflow...

slide-66
SLIDE 66
  • Integration with Pinterest k8s

infrastructure

  • Scheduler scalability
  • Integration with existing

composers

Airflow @ Pinterest

Adoption challenges

slide-67
SLIDE 67

Machine Learning and Big Data on K8S at Pinterest

June Liu Infrastructure (Core Cloud - Cloud Management Platform) Team Pinterest

slide-68
SLIDE 68
  • Motivation
  • Architecture
  • Journey to K8S
  • The Future

Agenda

slide-69
SLIDE 69

Motivation

  • Unify orchestration and big data

infrastructure

○ simplify tech stack and reduce operation

  • verhead
  • Trending in community support of ML

and BD workloads on K8S

  • Better interfaces and richer feature

including CNI, CSI, Autoscaling, etc

slide-70
SLIDE 70

Architecture

slide-71
SLIDE 71
slide-72
SLIDE 72

Journey to Kubernetes

slide-73
SLIDE 73

Sidecar

No good support for sidecar lifecycle of run-to-finish Pod

  • A lot of sidecars to nanny workloads

○ Security ○ Metadata ○ Logs ○ Metrics ○ Traffic, Service discovery ○ ...

slide-74
SLIDE 74

Sidecar

No good support for sidecar lifecycle of run-to-finish Pod

  • Kill Pod?

○ Pod recreated due to reconcile

  • Force mark Pod state?

○ Confuses scheduler and Kubelet

  • Docker kill sidecar?

○ Messed up with Restart Policy

  • Inter container signals?

○ Not scalable in operation

slide-75
SLIDE 75

Sidecar

No good support for sidecar lifecycle of run-to-finish Pod

  • Write our own and contribute!

○ Main container obeys restart policy, sidecar always restart ○ Kill sidecar after main containers quits ○ Pod phase computed based on main container exit code

https://github.com/zhan849/kubernetes/tree/pinterest-sidecar-1.14.5

slide-76
SLIDE 76

Volume

Is PVC really an option for serving models?

  • Ideally

○ Flexibility to select storage medium ○ Data isolation ○ Dynamic provisioning saves money ○ ...

slide-77
SLIDE 77

Volume

Is PVC really an option for serving models?

  • Actually…

○ EBS provisioner is not efficient, nor is EC2 ○ ~100% 500s with batch EBS provisioning

slide-78
SLIDE 78

Volume

Is PVC really an option for serving models?

  • DescribeInstance: by name -> by ID (#78140)
  • Optimize EBS provisioner cloud provider

calls (#78276)

Batch Size Total Calls Peak QPS Original Optimized Original Optimized 50 1360 116 52 8 75 1464 139 70 10 100 2427 157 75 8 150 4384 209 93 11

slide-79
SLIDE 79

AutoScaling

Node scales slower than Pod

  • This is expected or container becomes

meaningless

  • Long tail scheduling Parameter Servers

and cause job to fail

  • Use bogus pods with low priority as

cluster buffers to scale in advance

slide-80
SLIDE 80

Our Future

slide-81
SLIDE 81

Future Works

Node scales slower than Pod

  • Gang scheduling
  • Federation layer takes care of data /

network locality

  • Fine-grained preemption, task queuing
  • More caching storage options (PVC/EBS)

○ https://github.com/kubeflow/community/pull/263