Counting is Hard: Probabilistically Counting Views at Reddit - - PowerPoint PPT Presentation

counting is hard probabilistically counting views at
SMART_READER_LITE
LIVE PREVIEW

Counting is Hard: Probabilistically Counting Views at Reddit - - PowerPoint PPT Presentation

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer What is probabilistic counting? How did probabilistic counting help us scale? Overview What issues did we face along the way? What is


slide-1
SLIDE 1

Counting is Hard: Probabilistically Counting Views at Reddit

Krishnan Chandra, Data Engineer

slide-2
SLIDE 2

Overview

  • What is probabilistic counting?
  • How did probabilistic counting

help us scale?

  • What issues did we face along

the way?

slide-3
SLIDE 3

What is Reddit?

Reddit is the frontpage of the internet A social network where there are tens of thousands of communities around whatever passions or interests you might have It’s where people converse about the things that are most important to them

slide-4
SLIDE 4

Reddit by the numbers

Alexa Rank (US/World) MAU Active Communities Posts per month Screenviews per month 4th/7th 330M+ 138K+ 10.7M 14B

slide-5
SLIDE 5

Counting Views

slide-6
SLIDE 6

Why Count Views?

  • Includes logged-out users
  • Better measure of reach than

votes

  • Currently exposed to

moderators and content creators

slide-7
SLIDE 7

Cat Walking a Human Cat Fist Bumping

slide-8
SLIDE 8

Why is Counting Hard?

slide-9
SLIDE 9

Product Requirements

  • Counts are over the life of a post
  • The same user should not count multiple

times within a short time frame

  • Should build in some protections against

spamming/cheating (similar to votes)

  • Should provide (near) real-time feedback
slide-10
SLIDE 10
  • Exact counting:

○ Requires storing state per user per post

  • Approximate counting:

○ Requires much less state and storage ○ Provides an estimate of reach within a few percentage points of the exact number Exact vs. Approximate Counting

slide-11
SLIDE 11
  • HyperLogLog (HLL)

○ Hash-based probabilistic algorithm published in 2007 ○ Approximates set cardinality ○ Works well for large cardinalities, but not for small ones

  • HyperLogLog++

○ Introduced by Google in 2013 ○ Uses sparse and dense HLL representations ○ Switches over to HLL once needed HyperLogLog (And Friends)

slide-12
SLIDE 12
slide-13
SLIDE 13
  • Hash table consisting of m registers or

buckets, each of width k bits

  • Hash the input value, and split the hash

value into 2 portions

  • First portion (log2m bits) used to index to

a register

  • Second portion used to count the

number of leading zeros and set the register value How does HLL work?

slide-14
SLIDE 14

Assume: m=8 registers, k=3 bits input hash

1 1 1 1 1 Register# 7 3 leading zeroes Record 3+1=4 into Register# 7 r0 r1 r2 r3 r4 r5 r6 r7 1 0 0 Adapted from HyperLogLog - A Layman’s Overview

slide-15
SLIDE 15
  • Estimate of cardinality is computed by

taking the harmonic mean of the registers and raising 2 to that power

  • Intuition: HLL is like flipping a coin!
  • Largest run of heads gives an estimate of

total number of flips Computing Cardinality

slide-16
SLIDE 16

Counting Error

  • HLL standard error

○ Number of registers/hash buckets m ○ Standard error = 1.04/sqrt(m) ○ Using Redis’s HLL implementation, standard error is 0.81%!

slide-17
SLIDE 17

Using HLL to Count Views

  • 1 HLL per post
  • HLL inserts are idempotent!

○ Allows reprocessing data if needed

  • How to manage de-duping over short

time window? ○ Store user + truncated timestamp as the value

slide-18
SLIDE 18
slide-19
SLIDE 19

Space Usage

  • Exact counting:

○ User id = 8 byte long ○ ~1.5m users * 8 bytes = 12 MB

  • HLL (Redis implementation)

○ Max size = 12 KB ○ 0.1% of the exact counting storage

slide-20
SLIDE 20

Counting Architecture

slide-21
SLIDE 21

Architecture Goals

  • 1. Consume a stream of view events and

filter out spam/bad events

  • 2. For good events, insert into an HLL in

real time

  • 3. Allow clients to consume views values in

real time

slide-22
SLIDE 22

Counting

Server Side Events

App Servers

Client Side Events

Anti-Spam

slide-23
SLIDE 23

Stream Processing Infrastructure

  • Kafka

○ Main message bus for view events

  • Redis

○ Used for storing state + HLLs ○ Intended as short term storage ○ Functions as a cache for Cassandra

  • Cassandra

○ Used to store the final counts and HLLs in separate column families ○ Intended as long term storage

slide-24
SLIDE 24

Counting Application (Part 1)

  • Anti-Spam Consumer

○ Consumes the stream of views from Kafka ○ Basic rules engine backed by Redis ○ Consumer outputs a decision to a Kafka topic

slide-25
SLIDE 25

Counting Application (Part 2)

  • Counting Consumer

○ Consumes the decisions topic output by the anti-spam consumer ○ Creates/updates the HLL for the post in Redis. ○ Stores both the count and the HLL filter out to Cassandra.

slide-26
SLIDE 26

Scaling Challenges

slide-27
SLIDE 27
  • Problems

○ Rules engine is very memory heavy ○ HLL counting is very CPU-heavy ○ Rules engine data is generally time-bound with expiry ○ HLL data should be kept in Redis as long as possible to avoid reading from Cassandra Redis

slide-28
SLIDE 28
  • Solutions

○ Separate Redis instances for the 2 parts of the application ○ Different instance types to reflect the different workloads ○ Allkeys-lru expiration on HLLs, volatile-ttl expiration on the rules engine

slide-29
SLIDE 29
slide-30
SLIDE 30
  • Problems

○ 1 row per post - overwritten frequently ○ Read rate on page loads

  • verwhelming the cluster

○ Issues with load when “catching up“ ○ Storage grows forever with the number of posts! Cassandra

slide-31
SLIDE 31
  • Solutions

○ Updates to the same row in Cassandra throttled to every 10 seconds ○ Read caching ○ Slow the update rate when catching up ○ More disk!

slide-32
SLIDE 32
slide-33
SLIDE 33
  • Views on Reddit skew towards newer

posts ○ Allows most views to be served by Redis ○ Keeps read rate on Cassandra very low Observations

slide-34
SLIDE 34
slide-35
SLIDE 35
  • Thanks to HLLs, counting views became

much more efficient ○ Current storage usage is ~1TB for a full year of posts!

  • Delivery was possible in a quarter with an

engineering team of 3 (not always full time) Takeaways

slide-36
SLIDE 36

Thanks to our team!

  • /u/gooeyblob - Cassandra + Backend
  • /u/d3fect - Backend + API
  • /u/powerlanguage - Product

Management

slide-37
SLIDE 37

Thanks!

Krishnan Chandra krishnan@reddit.com u/shrink_and_an_arch

PS: We’re hiring! http://reddit.com/jobs

slide-38
SLIDE 38

References

  • View Counting at Reddit (Blog Post from

2017)

  • Original HyperLogLog paper
  • Redis blog announcing HLL support
  • Google paper announcing HLL++

algorithm

  • HyperLogLog - A Layman’s Overview