Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - - PowerPoint PPT Presentation
Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter - - PowerPoint PPT Presentation
Women in Big Data x Pinterest Welcome! Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager Goku: Pinterests in house Time-Series Database Tian-Ying Chang Sr. Staff Engineer Manager Pinterest Pinterest Discover new
Women in Big Data x Pinterest
Welcome!
Regina Karson, WiBD Chapter Director Tian-Ying Chang, Engineering Manager
Goku: Pinterest’s in house Time-Series Database
Tian-Ying Chang
- Sr. Staff Engineer Manager
- Discover new ideas and find
inspiration to do the things they love ○ 300M+ MAU, billions pins
- Metrics for monitoring site health
○ Latency, QPS, CPU, memory
- Metric about product quality
○ MAU, Impression, etc
- Monitoring service needs to be fast,
reliable and scalable
- Graphite
○ Easy to setup at small scale ○ Down sampling support long range query well ○ Hard to scale ○ Deprecated at Pinterest’s current scale
- OpenTSDB
○ Rich query, tagging support ○ Easy to scale horizontally with underlying HBase cluster ○ Long latency for high cardinality data ○ Long latency for query over longer time range ■ No down sampling ○ Heavy GC worsened by combined heavy write QPS and long range scan
Monitoring at Pinterest
8- HBase Schema
○ Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes) ○ Column qualifier: <delta to row key timestamp(up to 4 bytes)>
- Unnecessary Scan
○ Query: m1{rpc=delete} [t1 to t2] ○ <m1><t1><host=h1><rpc=delete> ○ <m1><t1><host=h1><rpc=get> ○ <m1><t1><host=h1><rpc=put> ○ <m1><t2><host=h2><rpc=delete>
- Data size
○ 20 bytes per data point
- Aggregation
○ Read data onto one opentsdb and aggregate ○
- Ex. ostrich.gauges.singer.processor.stuck_processors{host=*}
- Serialization
○
- Json. Super slow when there are many many data points to return
Why OpenTSDB is not good fit
9HBase RS HBase RS HBase RS
OpenTSDB
Goku is here to save
Statsboard (Read Client) Ingestor (Write Client) Kafka
OpenTSDB
HBase RS HBase RS HBase RS OpenTSDB
- Read/|Write requests are sent to a
random selected OpenTSDB box, and then routed to corresponding RS based
- n row key range
- Reads: raw data is read from individual
HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client
Write Read
Statsboard (Read Client) Ingestor (Write Client) Kafka Goku Goku Goku Goku
Goku cluster
- A Goku box is not only storage
engine, but also: ○ Proxy that route requests ○ Aggregation engine
- Client can send requests to any Goku
box who will route requests ○ Scatter and Gather
Write Read
Two level sharding
- Group# hashed from metric name
○ E.g tc.metrics.rpc_latency
- Shard# hashed from metric + set
- f tagk and tagv
○ E.g.
tc.metrics.rpc_latency{rpc=put,host=m1}- Control read fanout while easy to
scale out individual group
G1:S1 G1:S2 G2:S1 G2:S2 G3:S1 G1:S3 G4:S1 G4:S2 G3:S1 G1:S3 1.Requests sent to a random goku box G3:S3 4.Retrieve data and local aggregate 5.another aggregation 6.return responseGoku
2.comput sharding to G2: S1 and S2, then look up shard configShard config
3.route requestsGoku #1. Time Series Database based on Beringei
Beringei
- In-memory key value store
○ Key: string ○ Value: list of <timestamp, value> pairs
- Gorilla compression
○ Delta-of-Delta encoding on timestamps ○ Delta encoding on values
- Stores most recent 24 hours data
(configurable)
- One level of sharding to distribute
- Datapoint size reduced
○ from 20 bytes to 1.37 bytes
Bucket Bucket Bucket ts ts ts ts ts ts ts tsDisk
Gorilla Encode Gorilla DecodeBeringei
Write ReadShard
Goku #2. Query Engine -- Inverted Index
Inverted Index
- A map from search term to its bitset
- Built along with processing incoming data
points
- Fast lookup when serve query
- Support query filters
- n the filter values and then iterate corresponding full
DISK
Inverted Index Gorilla Encode Gorilla DecodeGoku Phase #1
Write ReadShard
Goku #3. Query Engine -- Aggregation
Aggregation
- Post-process after retrieving all relevant time
series
- Mimic OpenTSDB’s aggregation layer
- Support basic aggregators, including SUM,
AVG, MAX, MIN, COUNT, DEV and Downsampling
- Versus OpenTSDB
○ OpenTSDB does aggregation on a single instance since HBase RS don’t know how to aggregate ○ Goku does aggregation in two phases. First on each leaf goku node, and second on the routing goku node ○ Distribute the computation and save data on the wire
Bucket Bucket Bucket ts ts ts ts ts ts ts tsDISK
Inverted IndexAggregation
Gorilla Encode Gorilla DecodeGoku Phase #1
Write ReadShard
AWS EFS
AWS EFS
- Store log and data files to recovery
- Posix compliant
- Data durability
- Operate it asynchronously, so latency isn’t an
issue
- Easy to move shard
- Easy to use on AWS
AWS EFS
Inverted IndexAggregation
Gorilla Encode Gorilla DecodeGoku Phase #1
Write ReadShard
Phase #2 Disk based Goku
AWS EFS
Inverted IndexAggregation
Gorilla Encode Gorilla DecodeGroup
Goku Phase #2
Write ReadShard Distributed KV store(Rock Store) Hadoop job
Goku Phase #2 -- Disk based
- Hadoop job constantly
runs to compact data into disk with downsample
- Data stored into S3 for
better availability and low cost
- RocksDB is used for
- nline serving data
S3
- Replication
○ Currently dule write to two clusters for fault tolerance ○ Replication to improve availability and consistency
- More query possibilities
○ TopK ○ Percentile
- Analytics use case
○ Another big consumer of Time Series data
Next step for Goku
24Thanks!
Scheduling Asynchronous Tasks at Pinterest
Isabel Tallam Data (Core Services) Team Pinterest
Why asynchronous tasks? Asynchronous task processing service Design considerations
Why asynchronous tasks?
SPAM S P A M % $ # *
%$#*
%$#* SPAM SPAM SPAM
Why asynchronous tasks? Asynchronous task processing service Design considerations
Pinlater
Asynchronous Task Processing Service
Pinlater features
- High throughput
- Easily create new tasks
- At-least-once guarantee
- Strict ack mechanism
- Metrics and debugging support
- Different task priorities
- Scheduling future tasks
- Python, Java support
Pinlater
Asynchronous Task Processing Service
Pinlater Servers Pinlater Servers
Pinlater Servers
Clients Pinlater Workers Pinlater Workers
Pinlater Workers
Storage
Master Slave
Storage
Master Slave
Storage
Master Slave
Clients
Clients
Pinlater components
insert request /ack
Pinlater
Asynchronous Task Processing Service
Pinlater Stats ~1000 different tasks defined ~8 billion task instances processed per day ~3000 Pinlater hosts
Why asynchronous tasks? Asynchronous task processing service Design considerations
Storage
Pinlater
Asynchronous Task Processing Service
Pinlater Servers Pinlater Servers
Pinlater Servers Storage
Master Slave
Storage Layer
Master Slave Storage
Master Slave
Cache
Pinlater
Asynchronous Task Processing Service
Pinlater Servers
Pinlater Servers
Clients Pinlater Workers
Pinlater Workers Pinlater Workers
Clients
Clients
insert request
Handling failures in the system
/ack timeout monitor
Storage
Master Slave
Pinlater Servers
Pinlater
Asynchronous Task Processing Service
Thank You!
Experimentation at Pinterest
Lu Yang Data (Data Analytics - Core Product Data) Team Pinterest
Outline
1 Background 2 Platform 3 Architecture
What is an a/b experiment?
It is a method to compare two (or more) variations of something to determine which
- ne performs better against your target metrics
OR
With Experiment Mindset
Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch A randomized, controlled trial with measurement
Existing code
- CONTROL
Changed code
- ENABLED
Not in experiment All Users
Number of Experiments Over Time
New Experiments/wk
7
Languages and platforms
800+
Different Metrics
<5min
Experiment setup
8.3/10
Developer satisfaction score
140+
Experimentation by the Numbers
Outline
1 Background 2 Platform 3 Architecture
Typical Experiment Timeline
Experiments vary, but here is a typical timeline
Idea Form a hypothesis Analyze results Make Decision and Iterate Experiment Setup and Launch Iterate Feature Development
Experimentor
Platform
Gatekeeper
Platform
Languages: python, java, scala, go, … Platform: web, ios, android, ...
Experiment Dashboard
Platform
Outline
1 Background 2 Platform 3 Architecture
Batch processing Real-time analytics
Experiment Data Pipeline
Hbase
Airflow at Pinterest
Indy Prentice Data (Big Data Compute) Team Pinterest
What is a workflow?
Define
What to run, dependencies, etc. When to run it
Schedule
Scheduled with Pinball
*check it out at https://github.com/pinterest/pinball!
coordinator worker
train the visual search model make the search index rank the ads free spot!
worker
download all pin images count number of Pinterest users (get it?) find great recommendations calculate experiment metrics
Fixed number of workers and slots
Unfortunately…
free spot!
Fixed number of workers and slots
coordinator worker
train the visual search model free spot!
worker
free spot! free spot! free spot! free spot! calculate experiment metrics
Fixed number of workers and slots
coordinator worker
train the visual search model make the search index rank the ads find duplicate pin images
worker
calculate experiment metrics find great recommendations download all pin images count number of Pinterest users
Fixed number of workers and slots Shared environment
Unfortunately…
Shared host resources
coordinator worker worker
generate experiment metrics index the pins download all pin images count number of Pinterest users train the visual search model make the search index rank the ads free spot!
Shared codebase
coordinator worker worker
generate experiment metrics index the pins download all pin images count number of Pinterest users train the visual search model make the search index rank the ads free spot!
Fixed number of workers and slots Shared environment Implementation doesn’t scale
Unfortunately…
- Industry + community support
- Will it support our scale?
○ We think so
- Will it solve the problems with Pinball?
○ With Kubernetes executor
Enter Spinner
(get it???)
Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s
Physical machine (8 CPUs, 10 GBs of disk space) Can’t touch this
Train the visual search model (4 CPUs, 4 GBs of disk space) Download all images (10 GBs of disk space) Generate experiment metrics (2 CPUs, 1 GB of disk space)
Enter Spinner
(get it???)
Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s
train the visual search model make the search index rank the ads generate experiment metrics index the pins write 10TB of data to disk! count number of Pinterest users
Tasks Servers
Enter Spinner
(get it???)
Jobs run in containers Containers are scheduled by kubernetes Scheduler submits jobs to run on k8s
Autoscaling happens at k8s level Jobs run in isolated containers Lightweight maintenance
On Airflow...
- Integration with Pinterest k8s
infrastructure
- Scheduler scalability
- Integration with existing
composers
Airflow @ Pinterest
Adoption challenges
Machine Learning and Big Data on K8S at Pinterest
June Liu Infrastructure (Core Cloud - Cloud Management Platform) Team Pinterest
- Motivation
- Architecture
- Journey to K8S
- The Future
Agenda
Motivation
- Unify orchestration and big data
infrastructure
○ simplify tech stack and reduce operation
- verhead
- Trending in community support of ML
and BD workloads on K8S
- Better interfaces and richer feature
including CNI, CSI, Autoscaling, etc
Architecture
Journey to Kubernetes
Sidecar
No good support for sidecar lifecycle of run-to-finish Pod
- A lot of sidecars to nanny workloads
○ Security ○ Metadata ○ Logs ○ Metrics ○ Traffic, Service discovery ○ ...
Sidecar
No good support for sidecar lifecycle of run-to-finish Pod
- Kill Pod?
○ Pod recreated due to reconcile
- Force mark Pod state?
○ Confuses scheduler and Kubelet
- Docker kill sidecar?
○ Messed up with Restart Policy
- Inter container signals?
○ Not scalable in operation
Sidecar
No good support for sidecar lifecycle of run-to-finish Pod
- Write our own and contribute!
○ Main container obeys restart policy, sidecar always restart ○ Kill sidecar after main containers quits ○ Pod phase computed based on main container exit code
https://github.com/zhan849/kubernetes/tree/pinterest-sidecar-1.14.5
Volume
Is PVC really an option for serving models?
- Ideally
○ Flexibility to select storage medium ○ Data isolation ○ Dynamic provisioning saves money ○ ...
Volume
Is PVC really an option for serving models?
- Actually…
○ EBS provisioner is not efficient, nor is EC2 ○ ~100% 500s with batch EBS provisioning
Volume
Is PVC really an option for serving models?
- DescribeInstance: by name -> by ID (#78140)
- Optimize EBS provisioner cloud provider
calls (#78276)
Batch Size Total Calls Peak QPS Original Optimized Original Optimized 50 1360 116 52 8 75 1464 139 70 10 100 2427 157 75 8 150 4384 209 93 11
AutoScaling
Node scales slower than Pod
- This is expected or container becomes
meaningless
- Long tail scheduling Parameter Servers
and cause job to fail
- Use bogus pods with low priority as
cluster buffers to scale in advance
Our Future
Future Works
Node scales slower than Pod
- Gang scheduling
- Federation layer takes care of data /
network locality
- Fine-grained preemption, task queuing
- More caching storage options (PVC/EBS)
○ https://github.com/kubeflow/community/pull/263