SCALING INSTAGRAM INFRA
Lisa Guo— Nov 7th, 2016 lguo@instagram.com
SCALING INSTAGRAM INFRA Lisa Guo Nov 7th, 2016 lguo@instagram.com - - PowerPoint PPT Presentation
SCALING INSTAGRAM INFRA Lisa Guo Nov 7th, 2016 lguo@instagram.com INSTAGRAM HISTORY 2012/4/3 2010 Android 2014/1 release 2012/4/9 2011 Facebook 14M users acquisition INSTAGRAM EVERYDAY 300 Million Users 4.2 Billion likes 95 Million
Lisa Guo— Nov 7th, 2016 lguo@instagram.com
2010 2011 14M users 2012/4/3 Android release 2012/4/9 Facebook acquisition 2014/1
300 Million Users 4.2 Billion likes 95 Million photo/video uploads 100 Million followers
Scale out
Scale up
Scale dev team
“To scale horizontally means to add more nodes to a system, such as adding a new computer to a distributed software application. An example might involve scaling out from one Web server system to three.”
—> —> —> vertical partition horizontal sharding
memcache RabbitMQ PostgreSQL Cassandra Celery Other Services
Django
user feeds, stories, activities, and other logs
user, media, friendship etc
Django RabbitMQ
PostgreSQL Cassandra
Celery Django RabbitMQ
PostgreSQL Cassandra
Celery
memcache
DC1 DC2
memcache
feed
get
Django
User R DC1
Django
PostgreSQL
memcache
User C
comment
set insert
Django memcache PostgreSQL
User C comment insert set DC1
Django memcache PostgreSQL
User R feed get DC2 replication
Django memcache PostgreSQL
User C comment insert set DC1
Django memcache PostgreSQL
User R feed set DC2 replication Cache invalidate Cache invalidate get
select count(*) from user_likes_media where media_id=12345; 100s ms
COUNTERS
select count from media_likes where media_id=12345;
10s us
Cache invalidated All djangos try to access DB
d1 d2 memcache db time
lease-get fill lease-get wait or use stale read from DB lease-set lease-get hit
Django RabbitMQ PostgreSQL Cassandra Celery memcache Django RabbitMQ PostgreSQL Cassandra Celery memcache
DC1 DC2
Requests/second
Servers
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
CPU instructions Loaded Regular Load Balancer Django Servers
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
User growth Server growth
Use as few CPU instructions as possible Use as few servers as possible
Use as few CPU instructions as possible Use as few servers as possible Scale up
Monitor Optimize Analyze
struct perf_event_attr pe; pe.type = PERF_TYPE_HARDWARE; pe.config = PERF_COUNT_HW_INSTRUCTIONS; fd = perf_event_open(&pe, 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
Follow Feed Explore
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
With new feature Without new feature
Monitor Optimize Analyze
import cProfile, pstats, StringIO pr = cProfile.Profile()
pr.enable() # ... do something ... pr.disable()
s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()
continuous profiling
generate_profile explore --start <start-time> --duration <minutes>
continuous profiling
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
Caller Callee
decorator
def get_photos(): …… def feed(): get_photos() @log_stats def get_follows(): …… def follow(): get_follows() @log_stats
get_follows get_photos feed follow log_stats
get_follows get_photos feed follow
Monitor Optimize Analyze
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s150x150/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s400x600/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s200x200/12345678_1234567890_987654321_a.jpg igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/ s300x300/12345678_1234567890_987654321_a.jpg 150x150 400x600 200x200
cProfile is not free False positive alerts Better automation
Use as few CPU instructions as possible
Scale up
(memory budget /process) X (# of processes) < system memory Less memory budget/process ===> Dies sooner ===> More processes
Servers
20 40 60 80 100 2 4 6 8 10 12 14 16 18 20 22 24
CPU instructions Loaded Regular Load Balancer Django Servers
Code Large configuration
Synchronous Processing model ===> All user experience impacted ===> Worker starvation Single service degradation ===> Fewer CPU instr executed Longer latency
Stories Feed
Django
Feed Stories Suggested Users
ASYNC IO
Use as few CPU instructions as possible
Scale up
30% engineers joined in last 6 months Bootcampers - 1 week Hack-A-Month - 4 weeks Intern - 12 weeks
Comment Filtering Self-harm Prevention Windows App Story Viewer Ranking Video View Notification Save Draft First Story Notification
Which server? NewTable
What Index? Should I cache it? Will I lock up DB?
USER1 USER2 USER3 media posted posted by likes liked by likes liked by
60-80 daily diffs >120 engineers committed code last month
gated by configuration
Once a week? 40-50 rollouts per day Once a day? Once a diff!!
Code review unittest Code accepted committed Canary To the Wild Dark launch Load test
Scaling is a continuous effort Scaling is multi-dimensional Scaling is everybody’s responsibility