SCALING INSTAGRAM INFRA
Lisa Guo— March 7th, 2017 lguo@instagram.com
SCALING INSTAGRAM INFRA Lisa Guo March 7th, 2017 lguo@instagram.com - - PowerPoint PPT Presentation
SCALING INSTAGRAM INFRA Lisa Guo March 7th, 2017 lguo@instagram.com INSTAGRAM HISTORY 2010 2014/1 2012/4/9 2017 joined Facebook 600M users/month INSTAGRAM EVERYDAY 400 Million Users 4+ Billion likes 100 Million photo/video uploads
Lisa Guo— March 7th, 2017 lguo@instagram.com
2017
2010 2012/4/9 joined Facebook 2014/1 600M users/month
400 Million Users 4+ Billion likes 100 Million photo/video uploads Top account: 110 Million followers
Scale out
Scale up
Scale dev team
memcache RabbitMQ PostgreSQL Cassandra Celery Other Services
Django
user, media, friendship etc
user, media, friendship etc
Master Replica Replica Django Write Read
user, media, friendship etc
Master Replica Replica Django Write Read DC1 DC2 DC3
user feeds, activities etc
Replica Replica Replica Write - 2 Read - 1
user feeds, activities etc
Replica Replica Replica Write - 2 Read - 1
Django RabbitMQ
PostgreSQL Cassandra
Celery Django RabbitMQ
PostgreSQL Cassandra
Celery
memcache
DC1 DC2
memcache
feed
get
Django
User R DC1
Django
PostgreSQL
memcache
User C
comment
set insert
Django memcache PostgreSQL
User C comment insert set DC1
Django memcache PostgreSQL
User R feed get DC2 replication
Django memcache PostgreSQL
User C comment insert set DC1
Django memcache PostgreSQL
User R feed DC2 replication Cache invalidate Cache invalidate get
select count(*) from user_likes_media where media_id=12345; 100s ms
select count from media_likes where media_id=12345;
10s us
Cache invalidated All djangos try to access DB
d1 d2 memcache db time
lease-get fill lease-get wait or use stale read from DB lease-set lease-get hit
Django RabbitMQ PostgreSQL Cassandra Celery memcache Django RabbitMQ PostgreSQL Cassandra Celery memcache
DC1 DC2
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
User growth Server growth
Use as few CPU instructions as possible Use as few servers as possible
Use as few servers as possible
Monitor Optimize Analyze
struct perf_event_attr pe; pe.type = PERF_TYPE_HARDWARE; pe.config = PERF_COUNT_HW_INSTRUCTIONS; fd = perf_event_open(&pe, 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
Follow Feed Explore
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
With new feature Without new feature
Monitor Optimize Analyze
import cProfile, pstats, StringIO pr = cProfile.Profile()
pr.enable() # ... do something ... pr.disable()
s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()
continuous profiling
generate_profile explore --start <start-time> --duration <minutes>
continuous profiling
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
Caller Callee Callee
Monitor Optimize Analyze
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s150x150/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s400x600/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s200x200/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg 150x150 400x600 200x200
Use as few CPU instructions as possible
Process 1 Shared Memory Private Memory Process N
Synchronous processing model with long latency ===> Worker starvation and fewer CPU instr executed
Stories Feed
Django
Feed Stories Suggested Users
ASYNC IO
Use as few CPU instructions as possible
Scale up
30% engineers joined in last 6 months Bootcampers - 1 week Hack-A-Month - 4 weeks Intern - 12 weeks
Comment Filtering Self-harm Prevention Windows App Multiple media in
Video View Notification Saved Posts First Story Notification Instagram Live Instagram Stories
Which server? NewTable
What Index? Should I cache it? Will I lock up DB?
USER1 USER2 USER3 media posted posted by likes liked by likes liked by
Comment Filtering Self-harm Prevention Windows App Multiple media in
Video View Notification Saved Posts First Story Notification Instagram Live Instagram Stories
Master Live Direct
With branches
Master Live Direct
Master Live Direct
No branches
Engineers Employees Dogfooder Some demographics World
Once a 40-60 rollouts per day day diff week? !!
Code review unittest Code accepted committed Canary To the Wild
Scale out
Scale up
Scale dev team
Scaling is everybody’s responsibility Scaling is continuous effort Scaling is multi-dimensional