Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry - - PowerPoint PPT Presentation

naiad a timely dataflow system
SMART_READER_LITE
LIVE PREVIEW

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry - - PowerPoint PPT Presentation

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard Paul Barham Martin Abadi Microsoft Research Silicon Valley Presented by Braden Ehrat Batch Stream Graph processing processing processing


slide-1
SLIDE 1

Naiad: A Timely Dataflow System

Microsoft Research Silicon Valley

Presented by Braden Ehrat

Derek G. Murray Michael Isard Rebecca Isaacs Martin Abadi Frank McSherry Paul Barham

slide-2
SLIDE 2

Batch processing Stream processing Graph processing

slide-3
SLIDE 3

Batch processing Stream processing Graph processing

  • Hadoop
  • Dryad
  • Storm
  • MillWheel
  • GraphLab
  • PowerGraph
slide-4
SLIDE 4

Batch processing Stream processing Graph processing Timely dataflow with Naiad

slide-5
SLIDE 5

Timely dataflow

A new computational model for stream processing

  • Supports feedback loops
  • Stateful vertices with arbitrary data
  • Notifications for end of epoch
slide-6
SLIDE 6

Low-latency, incremental stream processing

< 100ms interactive queries < 1ms iterations < 1s batch updates

slide-7
SLIDE 7

Dataflow

Stage Connector

slide-8
SLIDE 8

Dataflow: parallelism

B C Vertex Edge

slide-9
SLIDE 9

Messages

B C D B.SENDBY(edge, message, time) C.ONRECV(edge, message, time)

✉ Messages are delivered asynchronously

slide-10
SLIDE 10

Notifications

B C D D.NOTIFYAT(time) D.ONNOTIFY(time)

Notifications support batching

C.SENDBY(_, _, time)

No more messages at time or earlier D.ONRECV(_, _, time)

slide-11
SLIDE 11

Progress tracking

E.NOTIFYAT(t) A B C D E C.ONRECV(_, _, t) C.SENDBY(_, _, tʹ) tʹ ≥ t Epoch t is complete

slide-12
SLIDE 12

Dataflow: iteration

slide-13
SLIDE 13

Progress tracking

A B C D E

Problem: C depends on its own output

C.NOTIFYAT(t)

slide-14
SLIDE 14

A B C D E C.NOTIFYAT((1, 6)) D.SENDBY(1, 6) A.SENDBY(_, _, 1) E.NOTIFYAT(?) B.SENDBY(_, _, (1, 7)) F Advances timestamp and loop counter E.NOTIFYAT(1)

Solution: structured timestamps in loops

C.NOTIFYAT(t)

slide-15
SLIDE 15

Simple API

class DistinctCount<S,T> : Vertex<T> { Dictionary<T, Dictionary<S,int>> counts; void OnRecv(Edge e, S msg, T time) { if (!counts.ContainsKey(time)) { counts[time] = new Dictionary<S,int>(); this.NotifyAt(time); } if (!counts[time].ContainsKey(msg)) { counts[time][msg] = 0; this.SendBy(output1, msg, time); } counts[time][msg]++; } void OnNotify(T time) { foreach (var pair in counts[time]) this.SendBy(output2, pair, time); counts.Remove(time); } }

slide-16
SLIDE 16

Evaluation

All-to-all exchange throughput Naiad exchanges 8-byte records between all processes Shows low, linear overhead

slide-17
SLIDE 17

Global barrier (Iteration) latency

Evaluates time to achieve global coordination No data was exchanged Effect of micro-straglers seen at 50-60 nodes

slide-18
SLIDE 18

Real world calculations

Twitter follower graph

  • 42M nodes
  • 1.5B Edges
  • 6GB on disk

PageRank on Twitter followers

slide-19
SLIDE 19

Real world calculations

Vowpal Wabbit: Open- source distributed machine learning Naiad is on-par with specialized implementations

slide-20
SLIDE 20

Query Latency

Compute connected components and top tweets

  • 32,000 tweets/s
  • 10 queries/s

Fresh: queries delayed behind updates 1s delay: querying stale but consistent data

slide-21
SLIDE 21

Conclusions

Timely Dataflow in Naiad achieves:

  • The performance of specialized frameworks
  • Generic flexibility

Open source: http://github.com/MicrosoftResearchSVC/naiad/