Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution - - PowerPoint PPT Presentation

scaling pinterest
SMART_READER_LITE
LIVE PREVIEW

Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution - - PowerPoint PPT Presentation

Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution Scaling Pinterest Growth March 2010 Page views per day RackSpace 1 small Web Engine 1 small MySQL DB 1 Engineer + 2 Founders Mar 2010 Jan 2011 Jan 2012 May 2012


slide-1
SLIDE 1

Scaling Pinterest

Marty Weiner Level 83 Interwebz Geek

slide-2
SLIDE 2 Scaling Pinterest

Evolution

slide-3
SLIDE 3 Scaling Pinterest

March 2010

Growth

· RackSpace · 1 small Web Engine · 1 small MySQL DB · 1 Engineer + 2 Founders

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

slide-4
SLIDE 4 Scaling Pinterest

March 2010

Growth

slide-5
SLIDE 5 Scaling Pinterest

January 2011

Growth

· Amazon EC2 + S3 +

CloudFront

· 1 NGinX, 4 Web Engines · 1 MySQL DB + 1 Read Slave · 1 Task Queue + 2 Task

Processors

· 1 MongoDB · 2 Engineers + 2 Founders

Mar 2010 Jan 2011 Jan 2012

Page views per day

slide-6
SLIDE 6 Scaling Pinterest
slide-7
SLIDE 7 Scaling Pinterest

September 2011

Growth

· Amazon EC2 + S3 + CloudFront · 2 NGinX, 16 Web Engines + 2 API

Engines

· 5 Functionally Sharded MySQL DB +

9 read slaves

· 4 Cassandra Nodes · 15 Membase Nodes (3 separate

clusters)

· 8 Memcache Nodes · 10 Redis Nodes · 3 Task Routers + 4 Task Processors · 4 Elastic Search Nodes · 3 Mongo Clusters · 3 Engineers (8 Total)

Mar 2010 Jan 2011 Jan 2012 May 2012

Page views per day

slide-8
SLIDE 8 Scaling Pinterest

It will fail. Keep it simple.

slide-9
SLIDE 9 Scaling Pinterest

If you’re the biggest user of a technology, the challenges will be greatly amplified

slide-10
SLIDE 10 Scaling Pinterest

January 2012

Growth

slide-11
SLIDE 11 Scaling Pinterest

April 2012

Growth

Mar 2010

· Amazon EC2 + S3 + Edge Cast · 135 Web Engines + 75 API Engines · 10 Service Instances · 80 MySQL DBs (m1.xlarge) + 1 slave

each

· 110 Redis Instances · 60 Memcache Instances · 2 Redis Task Manager + 60 Task

Processors

· 3rd party sharded Solr Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

· 12 Engineers · 1 Data Infrastructure · 1 Ops · 2 Mobile · 8 Generalists · 10 Non-Engineers

slide-12
SLIDE 12 Scaling Pinterest Scaling Pinterest
slide-13
SLIDE 13 Scaling Pinterest

April 2013

Growth

· Amazon EC2 + S3 + Edge Cast · 400+ Web Engines + 400+ API

Engines

· 70+ MySQL DBs (hi.4xlarge on SSDs)

+ 1 slave each

· 100+ Redis Instances · 230+ Memcache Instances · 10 Redis Task Manager + 500 Task

Processors

· 65+ Engineers (130+ total) Page views per day

April 2012 April 2013

· 8 services (80 instances) · Sharded Solr · 20 HBase · 12 Kafka + Azkabhan · 8 Zookeeper Instances · 12 Varnish · 65+ Engineers · 7 Data Infrastructure + Science · 7 Search and Discovery · 9 Business and Platform · 6 Spam, Abuse, Security · 9 Web · 9 Mobile · 2 growth · 10 Infrastructure · 6 Ops · 65+ Non-Engineers

slide-14
SLIDE 14 Scaling Pinterest
slide-15
SLIDE 15 Scaling Pinterest
slide-16
SLIDE 16 Scaling Pinterest

Technologies

slide-17
SLIDE 17 Scaling Pinterest

ELB

Routing & Filtering (Varnish)

All connection pairings managed by ZooKeeper Puppet StatsD

API (Python) Web App (Python / JS / HTML) Task Processing (PinLater) MySQL Service (Java/Finagle) Memcache Mux (Nutcracker) Follower Service (Python/Thrift) Feed Service (Python/Thrift)

Sharded MySQL Memcache Redis HBase (Zen)

Search Service (Python/Thrift) Spam Service (Python/Thrift)

Arch Overview

Pin Images (S3) CDN
slide-18
SLIDE 18 Scaling Pinterest

Data Pipeline

Spam Processing Qubole S3

API App (Python) Task Processing

Kafka Secor Pinball

Web App (Python)

Redshift

slide-19
SLIDE 19 Scaling Pinterest

Our MySQL Sharding?

  • http://www.infoq.com/presentations/

Pinterest

slide-20
SLIDE 20 Scaling Pinterest

Choosing Your Tech

Questions to ask

  • Does it meet your needs?
  • How mature is the product?
  • Is it commonly used? Can you hire people who have

used it?

  • Is the community active?
  • How robust is it to failure?
  • How well does it scale? Will you be the biggest user?
  • Does it have a good debugging tools? Profiler? Backup

software?

  • Is the cost justified?
slide-21
SLIDE 21 Scaling Pinterest

Maturity = Blood and Sweat Complexity

slide-22
SLIDE 22 Scaling Pinterest

Choosing Your Tech

Questions to ask

  • Does it meet your needs?
  • How mature is the product?
  • Is it commonly used? Can you hire people who have

used it?

  • Is the community active?
  • How robust is it to failure?
  • How well does it scale? Will you be the biggest user?
  • Does it have a good debugging tools? Profiler? Backup

software?

  • Is the cost justified?
slide-23
SLIDE 23 Scaling Pinterest

Hosting

Why Amazon Web Services (AWS)?

  • Variety of servers running Linux
  • Very good peripherals: load balancing, DNS,

map reduce, basic security, and more

  • Good reliability
  • Very active dev community
  • Not cheap, but...
  • New instances ready in seconds
slide-24
SLIDE 24 Scaling Pinterest

Hosting

AWS Usage

  • Route 53 for DNS
  • ELB for 1st tier load balance
  • EC2 Ubuntu Linux
  • Varnish layer
  • All web, API, background appliances
  • All services
  • All databases and caches
  • S3 for images, logs
slide-25
SLIDE 25 Scaling Pinterest

Code

Why Python?

  • Extremely mature
  • Well known and well liked
  • Solid active community
  • Very good libraries specifically targeted to web

development

  • Effective rapid prototyping
  • Open Source

Some Java and Go...

  • Faster, lower variance response time
slide-26
SLIDE 26 Scaling Pinterest

Code

Python Usage

  • All web backend, API, and related business

logic

  • Most services

Java and Go Usage

  • Varnish plugins
  • Search indexers
  • High frequency services (e.g., MySQL service)
slide-27
SLIDE 27 Scaling Pinterest

Production Data

Why MySQL and Memcache?

  • Extremely mature
  • Well known and well liked
  • (MySQL) Rarely catastrophic loss of data
  • Response time to request rate increases linearly
  • Very good software support: XtraBackup, Innotop,

Maatkit

  • Solid active community
  • Open Source
slide-28
SLIDE 28 Scaling Pinterest

Production Data

MySQL and Memcache Usage

  • Storage / Caching of core data
  • Users, boards, pins, comments, domains
  • Mappings (e.g., users to boards, user likes, repin

info)

  • Legal compliance data
slide-29
SLIDE 29 Scaling Pinterest

Why Redis?

  • Well known and well liked
  • Active community
  • Consistently good performance
  • Variety of convenient and efficient data

structures

  • 3 Flavors of Persistence: Now, Snapshot, Never
  • Open Source

Production Data

slide-30
SLIDE 30 Scaling Pinterest

Redis Usage

  • Follower data
  • Configurations
  • Public feed pin IDs
  • Caching of various core mappings (e.g., board

to pins)

Production Data

slide-31
SLIDE 31 Scaling Pinterest

Why HBase?

  • Small, but growing loyal community
  • Difficult to hire for, but...

  • Non-volatile, O(1), extremely fast and efficient

storage

  • Strong Hadoop integration
  • Consistently good performance
  • Used by Facebook (bigger than us)
  • Seems to work well
  • Open Source

Production Data

slide-32
SLIDE 32 Scaling Pinterest

HBase Usage

  • User feeds (pin IDs are pushed to feeds)
  • Rich pin details
  • Spam features
  • User relationships to pins

Production Data

slide-33
SLIDE 33 Scaling Pinterest

What happened to Cassandra, Mongo, ES, and Membase?

Production Data

  • Does it meet your needs?
  • How mature is the product?
  • Is it commonly used? Can you hire people who have

used it?

  • Is the community active? Can you get help?
  • How robust is it to failure?
  • How well does it scale? Will you be the biggest user?
  • Does it have a good debugging tools? Profiler? Backup

software?

  • Is the cost justified?
slide-34
SLIDE 34 Scaling Pinterest

A 2nd chance...

slide-35
SLIDE 35 Scaling Pinterest

Stuff we could have done better

  • Logging on day 1 (StatsD, Kafka, Map Reduce)
  • Log every request, event, signup
  • Basic analytics
  • Recovery from data corruption or failure
  • Alerting on day 1

A 2nd Chance

slide-36
SLIDE 36 Scaling Pinterest

Stuff we could have done better

  • Shard our MySQL storage much earlier
  • Once you start relying on read slaves, start

the timebomb countdown

  • We also fell into the NoSQL trap (Membase,

Cassandra, Mongo, etc)

  • Pyres for background tasks day 1
  • Hire technical operations eng earlier
  • Chef / Puppet earlier
  • Unit testing earlier (Jenkins for builds)

A 2nd Chance

slide-37
SLIDE 37 Scaling Pinterest

Stuff we could have done better

  • A/B testing earlier
  • Decider on top of Zookeeper WATCH
  • Progressive roll out
  • Kill switches

A 2nd Chance

slide-38
SLIDE 38 Scaling Pinterest

Looking Forward

  • Beyond 400 Pinployees
  • Continually improve Pinner experience
  • Help Pinners discover more of the things

they love

  • Build better and faster
  • Continually improve collaboration and build

bigger, better, faster products

What’s next?

slide-39
SLIDE 39 Scaling Pinterest

Have fun

slide-40
SLIDE 40 Scaling Pinterest

No Seriously, Have fun

slide-41
SLIDE 41

marty@pinterest.com pinterest.com/martaaay