Lessons from an Internet-Scale Notification System Atul Adya - - PowerPoint PPT Presentation
Lessons from an Internet-Scale Notification System Atul Adya - - PowerPoint PPT Presentation
Lessons from an Internet-Scale Notification System Atul Adya History End-client notification system Thialfi Presented at SOSP 2011 Since then: Scaled by several orders of magnitude Used by many more products and in
History
- End-client notification system Thialfi
○ Presented at SOSP 2011
- Since then:
○ Scaled by several orders of magnitude ○ Used by many more products and in different ways
- Several unexpected “lessons”
Case for Notifications
Ensuring cached data is fresh across users and devices
Bob's browser Phil's phones Alice’s Notebook "Colin is online"
Cost and speed issues at scale:
100M clients polling at 10 minute intervals => 166K QPS
Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? No! Did it change yet? Yes! Did it change yet? No! ....
Common Pattern #1: Polling
Complicated for every app to build
Common Pattern #2: App pushes updates over point-to-point channels
Bookkeeping
- bject ids
endpoints registrations cursors ACLs Plumbing Fan out to endpoints Manage channels Ensure reliable delivery XMPP GCM Pending HTTP "Colin is online"
Our Solution: Thialfi
- Scalable: handles hundreds of millions of clients
and objects
- Fast: notifies clients in less than a second
- Reliable: even when entire data centers fail
- Easy to use and deploy:
Chrome Sync (Desktop/Android), Google Plus, Contacts, Music, GDrive
Application backend Thialfi client library
Register X Notify X
Client Data center X: C1, C2 Client C1 Client C2 Thialfi Service
Client library
Register X Register X Update X Update X Notify X
Thialfi Programming Overview
Client Bigtable
- Matcher: Object→ registered clients, version
- Registrar: Client ID → registered object, unacked messages
Client Registrar Matcher
Object Bigtable
Data center
Notifications
Application Backend
Registrations, notifications, acknowledgments
HTTP/XMPP/GCM
Client library
Thialfi Architecture
Translation Bridge
Thialfi Abstraction
- Objects have unique IDs and version numbers,
monotonically increasing on every update
- Delivery guarantee
- Registered clients learn latest version number
- Reliable signal only: cached object ID X at version Y
(Think “Cache Invalidation”)
Thialfi Characteristics
- Built around soft-state
- Recover registration state from clients
- Lost notification signal: InvalidateUnknownVersion
- Registration-Sync:
- Exchange hash of registrations between client & server
- Helps in edge cases, async storage, cluster switch
- Multi-Platform:
- Libraries: C++, Java, JavaScript, Objective-C
- OS: Windows/Mac/Linux, browsers, Android, iOS
- Channels: HTTP, XMPP, GCM, Internal-RPC
Some Lesions Ouch! I mean, Lessons
Lesson 1: Is this thing on?
- Launch your system and no one is using it
○ How do I know it is working?
- People start using it
○ Is it working now?
- Magically know works for 99.999% of the time
○ Which 99.999%?
- How to distinguish among ephemeral,
disconnected, and buggy clients You can never know
Lesson 1: Is this thing on?
What’s the best you can do?
- Continuous testing in production
○ But may not be able to get client monitoring
- Look at server graphs
End-to-end, e.g., latency More detailed, e.g., reg-sync per client type
Lesson 1: Is this thing on?
- But graphs are not sufficient
○ Even when it looks right, averages can be deceptive ○ How know if “missing” some traffic
- Have other ways of getting more reports:
customer monitoring, real customers, Twitter, ...
Lesson 2: And you thought you could debug?
- Monitoring indicates that there is a problem
○ Server text logs: but hard to correlate ○ Structured logging: may have to log selectively ■ E.g., cannot log incoming stream multiple times ○ Client logs: typically not available ○ Monitoring graphs: but can be too many signals
- Specific user has problem (needle-in-a-haystack)
○ Structured logging - if available ○ Custom production code!
- Customer unable to receive notifications
- Whole team spent hours looking
- Early on - debugging support was poor
○ Text logs - had rolled over ○ Structured logs - not there yet ○ Persistent state - had no history
- Eventually got “lucky”
○ Version numbers were timestamps ○ Saw last notification “version” was very old ○ Deflected the bug
War Story: VIP Customer
- Automated tools to detect anomalies
○ Machine-learning based?
- Tools for root-cause analysis
○ Which signals to examine when problem occurs
- Finding needles in a haystack
○ Dynamically switch on debugging for a “needle” ■ E.g., trace a client’s registration and notifications
Opportunity: Monitoring & Debugging Tools
Lesson 3: Clients considered harmful
- Started out: “Offloading work to clients is good”
- But, client code is painful:
○ Maintenance burden of multiple platforms ○ Upgrades: days, weeks, months, years … never ○ Hurts evolution and agility
War Story: Worldwide crash of Chrome on Android (alpha)
- Switched a flag to change message delivery via
a different client code path
- Tested this path extensively with tests
- Unfortunately, our Android code did network
access from the main thread on this path
- Newer versions of the OS than in our tests
crashed the application when this happened
War Story: Strange Reg-Sync Loops
- Discovered unnecessary registrations for a
(small) customer
- “Some JavaScript clients in Reg-Sync loop”
- Theories: Races, Bug - app, library, Closure, ...
- Theory: HTTP clients switching too much
○ Nope!
War Story: Buggy Platform
- Logged platform of every Reg-sync looping client
- Found “6.0” and that meant Safari
- Wrote test but failed to find bug
- Engineer searched for
“safari javascript runtime bug"
- Ran test in a loop
○ SHA-1 hash not the same in all runs of loop! ○ Safari JavaScript mis-JIT i++ to ++i sometimes
Future direction: “Thin” client
- Move complexity to where it can be maintained
- Removing most code from client
○ Trying to make library be a thin wrapper around API
- Planning to use Spanner (synchronous store)
- But still keeping soft-state aspects of Thialfi
Lesson 4: Getting your foot (code) in the door
- Developers will use a system iff it obviously
makes things better than doing it on their own
- Clean semantics and reliability not the selling
point you think they are
○ Clients care about features not properties
Lesson 4: Getting your foot (code) in the door
- May need “unclean” features to get customers
○ Best-effort data along with versions ○ Support special object ids for users ○ Added new server (Bridge) for translating messages
- Customers may not be able to meet your strong
requirements
○ Version numbers not feasible for many systems ○ Allow time instead of version numbers
Lesson 4: Getting your foot (code) in the door
- Understand their architecture and review their
code for integrating with your system
○ “Error” path broken: invalidateUnknownVersion ○ Naming matters: Changing to mustResync
- Know where your customer’s code is - so that
you can migrate them to newer infrastructure
- Debugging tools also needed for “bug deflection”
War Story: “Thialfi is unreliable”
- A team used Thialfi for reliable “backup” path to
augment their unreliable “fast” path
- Experienced an outage when their fast path
became really unreliable
- Informed us Thialfi was dropping notifications!
- Investigation revealed
○ Under stress, backend dropped messages on their path and gave up publishing into Thialfi after few retries
Lesson 5: You are building your castle on sand
- You will do a reasonable job thinking through
your own design, protocols, failures, etc
- Your outage is likely to come from a violation of
- ne of your assumptions or another system
several levels of dependencies away
War Story: Delayed replication in Chrome Sync
- Chrome backend dependency stopped sending
notifications to Thialfi
- When it unwedged, traffic went up by more than
- 3X. We only had capacity for 2X
Incoming feed QPS
War Story: Delayed replication in Chrome Sync
- Good news: Internal latency remained low and
system did not fall over
- Bad news: End-to-end latency spiked to
minutes for all customers
- Isolation not strong enough - not only Chrome
Sync but all customers saw elevated latency
Opportunity: Resource Isolation
- Need the ability to isolate various customers
from each other
- General problem for shared infrastructure
services
War Story: Load balancer config change
- Thialfi needs clients to be stable w.r.t clusters
○ Not globally reshuffle during a single-cluster outage
- Change to inter-cluster load balancer config to
remove ad hoc cluster stickiness
○ Previously discussed with owning team
- Config change caused large-scale loss of
cluster stickiness for clients
War Story: Load balancer config change
- Client flapping between clusters caused an
explosion in the number of active clients
○ Same client was using resources many times over
- No. of active clients
Fix: Consistent hash routing
- Reverted load balancer config change
- Use consistent hashing for cluster selection
○ Routed client based on client id ○ Not geographically optimal
Opportunity: Geo-aware “stable” routing
- “Stable”: Client goes to same cluster for long
periods of time
- Geographically-aware
- How to ensure clients are somewhat uniformly-
distributed?
- How to add new clusters or shut down clusters
(e.g., for maintenance)
Lesson 6: The customer is always right
- Customers will ask for anything and everything
- Tension between keeping system pure/ well-
structured and responding to customers needs
○ C.f. “Getting your foot in the door”
Initial model: Lack of payload support
(Model we had in SOSP 2011)
- Developers want reliable, in-order data delivery
- But, adds complexity to Thialfi and application
₋ Hard state, arbitrary buffering ₋ Offline applications flooded with data on wakeup
- For most applications, reliable signal is enough
₋ Invoke polling path on signal: simplifies integration
War Story: No payloads hurts Chrome Sync
- Logistics:
- Requires a cache to handle backend fetches
- Backend writers wanted one team to build a cache
- Technical:
“Lost updates” with multi-master async stores
- No monotonically-increasing version
- Modify object in cluster A and B
- Need to get both updates to do conflict resolution
- But only get “last update” from one of them
Fix: Add payload support
- Expose a “Pubsub-like” API
- All updates sent to client
- No version numbers
- What about data problems mentioned earlier?
- System can throw away when “too much data” and
send “MustResync” signal
- Clients required to fetch only with MustResync
Still believe that reliable signal is the most important aspect of a notification system Data is just the “icing on the cake”
Lesson 6: … Except when they are not
Latency and SLAs
- If you ask, customers will tell you they need
○ <100ms, 99.999% availability ○ 5 minute response times when paged
- Lesson: Don’t ask your customers
- Thialfi averages 0.5 - 1 sec: seems to be fine
War Story: Unused “big” feature
- Important customer wanted large number of
- bjects per client
- We wanted to scale in various dimensions
○ Optimized architecture to never read all registrations together, never keep them in memory, etc. ○ For Reg-Sync, added Merkle tree support ○ But never shipped it ...
- Most apps use few (one!) objects per client
○ Why? Migrated from polling! ○ Same customer ended up with few objects per client!
Lesson 7: You cannot anticipate the hard parts
- The initial Thialfi design spent enormous energy
- n making the notification path efficient
- Once we got into production, we added 100s of
ms of batching for efficiency
○ No one cared ...
Lesson 7: You cannot anticipate the hard parts
Hard parts of Thialfi actually are:
- Registrations:
○ Getting client and data center to agree on registration state with asynchronous storage is tough ○ Reg-Sync solved a number of edge cases
- Wide-area routing:
○ Earliest Thialfi design ignored this issue completely ○ Had to hack it in on the fly ○ Took significant engineer effort to redo it properly
Lesson 7: You cannot anticipate the hard parts
- Client library and its protocol
○ Did not pay attention initially: grew “organically” ○ Had to redesign and rebuild this part completely
- Handling overload
○ Admission control to protect a server ○ Push back to previous server in pipeline ○ Sometimes better to drop data and issue MustResync
- 1. Is this thing is on?
- 2. And you thought you could debug
- 3. Clients considered harmful
- 4. Getting your foot (code) in the door
- 5. You are building your castle on sand
- 6. The customer is sometimes right
- 7. You cannot anticipate the hard parts