Building real-time analytics applications using A LinkedIn case - - PowerPoint PPT Presentation

building real time analytics applications using
SMART_READER_LITE
LIVE PREVIEW

Building real-time analytics applications using A LinkedIn case - - PowerPoint PPT Presentation

Building real-time analytics applications using A LinkedIn case study Member Job Ad Post Company Course LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object Activity Data Scale Tens of million 3+


slide-1
SLIDE 1

Building real-time analytics applications using

A LinkedIn case study

slide-2
SLIDE 2

Member Job Ad Post Company Course

slide-3
SLIDE 3

LinkedIn Activity Data Model

Member Job Ad Post Company Course

Actor Verb Object

slide-4
SLIDE 4

Activity Data Scale

610+ million users Tens of million posts liked/shared per day 30 million companies 3+ million jobs posted per month

Trillions of events/day

slide-5
SLIDE 5

Generate Analyze Create

LifeCycle

What can we do with all the activity data?

slide-6
SLIDE 6

Pinot @ LinkedIn

slide-7
SLIDE 7

Pinot @ LinkedIn

slide-8
SLIDE 8

ThirdEye

Who Am I

ESPRESSO

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Use case 1: Article Analytics

slide-12
SLIDE 12

Option 1: Join on the Fly

Activity Stream

Member Id Action ArticleId Time Member Id Industry Geo Skills Company

SELECT M.industry, count(*) FROM Activity as A INNER join Member as M ON A.memberId = M.memberId WHERE A.articleId=<111> GROUP BY M.industry App Join

Activity Table Member Table

  • REALTIME ( Depending on storage)
  • High Latency

Like Comment Shares View

slide-13
SLIDE 13

Option 2: Pre Join + Pre Aggregate

Activity Stream

Member Id Industry Geo Skills Company

SELECT industry, sum(count) FROM PreJoined_Activity_Member WHERE A.articleId=<111> GROUP BY M.industry App

Member Table

Stream Processing Framework

Article Id Industry

...

Action Time Count

  • Near real-time ingestion
  • Low latency (unpredictable*)

Look Up Pre Join + Pre Agg

Like Comment Shares View

slide-14
SLIDE 14

Option 3: Pre Join + Pre Cube + Pre Agg

Activity Stream

Member Id Industry Geo Skills Company

SELECT industry, sum(count) FROM PreCubed_Activity_Member WHERE A.articleId=<111> AND company = ’*’ AND … = ‘*’ GROUP BY M.industry App

Member Table

Stream Processing Framework

Article Id Industry

...

Action Time Count

Look Up Pre Cube

  • Very fast (mostly lookup)
  • Batch (Hourly/Daily)
  • Extra storage (Curse of dimensionality)
  • Re-bootstrap on schema changes
  • Limited query capability

Like Comment Shares View

slide-15
SLIDE 15

Comparison

Activity Table Member Pre Join PreAggregation PreCubed

Latency Flexibility Presto, BigQuery, RedShift Pinot, Druid, ElasticSearch, InfluxDB Kylin, KV Store Pinot

slide-16
SLIDE 16

Publisher Analytics Architecture

PINOT

Like Comment Shares View

Article Analytics Samza Member DB

Article Activity Article Activity + Member data

Espresso 2 year retention

slide-17
SLIDE 17

Can we use the activities data to improve the feed?

slide-18
SLIDE 18

Feed Relevance

Behaviour

03

  • Prior interactions
  • Interests
  • Engagement

Content

02

  • Views, Likes, Comments
  • Age
  • Category

Identity

01

  • Company
  • Geography
  • Skills

Rank the feed based on relevance

slide-19
SLIDE 19

Feed Ranking Architecture

PINOT Feed Ranker

Like Comment Shares View

Samza Member Table

Article Activity Activity + Member data +

Article Data

Espresso Article Table 30 day retention

slide-20
SLIDE 20

Feed Ranking Perf Numbers

QPS p50 p90 p99 p99.9 6400 5ms 25ms 45ms 100ms Significant increase in engagement SELECT sum(count) from T WHERE memberId = <> AND article in (list of 1500 items) AND time >= (now - 14 days) GROUP BY action, item, position, time

slide-21
SLIDE 21

Site Facing use case: Pinot vs Druid

  • Sorted Index
  • Per query optimizer
  • Optional indexing
slide-22
SLIDE 22

What Business Insights can we generate from this data?

slide-23
SLIDE 23

Posts Published: Breakdown By Country

slide-24
SLIDE 24

Distribution: By Industry

slide-25
SLIDE 25

Views: Breakdown by Referrer

slide-26
SLIDE 26

Slice and Dice UI

  • 1000’s of Business

Metrics

  • Trillions of rows
slide-27
SLIDE 27

Dashboard Pipeline Architecture

HDFS UMP UI (Raptor) Metric Definition and Compute Logic Espresso Activity Data Member, Company, Article Data

slide-28
SLIDE 28

Dashboard use case: Pinot vs Druid

  • ~ 5000 random queries of the form

○ select sum(views), time from T where country = us, browser = chrome,… group by Date

  • run sequentially one after the other

Pinot Druid Total time 11 minutes 24 minutes p50 84 ms 136 ms p90 206 ms 667 ms

slide-29
SLIDE 29

Why don’t we monitor these metric and alert?

Anomaly Detection

slide-30
SLIDE 30

ThirdEye: Anomaly Detection

slide-31
SLIDE 31

ThirdEye:Root Cause Analysis

action_type Domain article_type author_type #connections industry …. location verb_type

18 Dimensions

Interactive break down sub-second Multiple queries

slide-32
SLIDE 32

Anomaly Detection

for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … SELECT sum(view), time FROM PostView WHERE country = d1 AND browser = d2 AND ... GROUP BY time

SELECT sum(view), time FROM PostView GROUP BY time

TOP LEVEL MULTI DIMENSIONAL

slide-33
SLIDE 33

Multi-dimensional anomaly detection challenges

for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … SELECT sum(view), time FROM PostView WHERE country = d1 AND browser = d2 AND ... GROUP BY time

MULTI DIMENSIONAL

1. Identifying issues requires monitoring all possible combinations 2. No Id column (ArticleId, Member Id) 3. Latency is unpredictable even with Inverted Index

select sum(view) where country=”us’ scan 60-70% of the rows Slow country=”ireland’ scan <1% of the rows Fast

slide-34
SLIDE 34

Space-Time trade off

Latency Storage

Columnar Store KV Store (Pre-computed) Startree Index

Partial pre-computation No pre-computation Full Pre-Cube

slide-35
SLIDE 35

Anomaly Detection: Druid vs Pinot

Druid Pinot -

(Inv Index)

Pinot

(Star tree Index)

slide-36
SLIDE 36

Anomaly Detection Architecture

HDFS UMP UI (Raptor) Metric Definition and Compute Logic Espresso ThirdEye

slide-37
SLIDE 37

Pinot usage

✓ MarketPlace ✓ UberPool ✓ UberFreight ✓ Jump

50TB 1000 qps

✓ UberEATS

slide-38
SLIDE 38

Conclusion

Activity Data Site Facing Applications Dashboard: Business Analytics Anomaly Detection Key Value store OLAP Store Stream Processing Engine

slide-39
SLIDE 39

Questions

Website http://pinot.apache.org Slack apache-pinot.slack.com Twitter Handle @apachepinot, @kishoreBytes