kineograph
play

Kineograph Raymond Cheng (University of Washinton, Microsoft - PowerPoint PPT Presentation

Kineograph Raymond Cheng (University of Washinton, Microsoft Research) et al. The challenge Social networks (Facebook, Twitter) generate a lot of information Let's analyze it! Simple data-mining won't do: too much data


  1. Kineograph Raymond Cheng (University of Washinton, Microsoft Research) et al.

  2. The challenge ● Social networks (Facebook, Twitter) generate a lot of information ● Let's analyze it! ● Simple data-mining won't do: ○ too much data ○ constant influx of new data ○ long computation time

  3. A solution ● Process live stream of data (i.e. tweets) ● Aggregate them as a dynamic graph ● Snapshot regularly ● Run distributed graph-mining on snapshots ○ support incremental computation

  4. Kineograph architecture

  5. Data influx (ingest node) @Alice: @Bob , check out these #kittens ! Ingest node node(Alice) T @Alice @Alice -> #kittens @Alice -> @Bob Transaction T @Bob #kittens T T @Bob -> @Alice #kittens -> @Alice node(Bob) node(kittens) after receiving ACKs, report T Progress table

  6. Data influx (ingest nodes) ● Parse data and convert them to graph updates (i.e. sets of edges) ● Send transaction to affected graph nodes ○ at this point, it's just stored in the queue ● Report submitted transaction to global vector clock

  7. Snapshot creation

  8. Snapshot creation ● Snapshooter initiates the process ○ in practice, every 10 seconds ● Snapshooter copies current progress table and sends it to graph nodes ● Graph nodes commit transactions up to times specified in progress table ○ new updates are coming in parallel

  9. Computation overview ● Ran on snapshots ● Algorithm-specific data stored in vertices ● Alternating phases of computation and propagation

  10. Example: TunkRank ● similar to PageRank: ● vertex value - single real number ● add ranks received from neighbours ● when rank increases by ε, push update to neighbours ● repeat until stable Bonus: it's incremental between snapshots!

  11. Example: Shortest Paths ● Bellman-Ford with landmarks ○ landmarks - top vertices from TunkRank ○ calculate only paths passing through landmarks ● vertex data - distances to landmarks ● shorten distances by relaxing edges ● push new distances to neighbours ● repeat until stable

  12. Evaluation ● 17,000 lines of C# code ● 50 Windows servers ○ Intel Xeon (quad-core, 2.8 GHz) with 8 GB RAM ● 100k tweets per second (10 times peak Twitter rate)

  13. Degree distribution

  14. Graph growth Decaying can help

  15. Throughput & timeliness

  16. Throughput

  17. Timeliness

  18. Incrementality helps! Tunk-rank:

  19. Incrementality helps!

  20. Scalability (TunkRank)

  21. Fault tolerance ● Centralized services (progress table & snapshooter): ○ simple replication ○ Paxos-based consensus ● Ingest nodes: ○ input data is cached until it is committed to a snapshot ○ if ingest node fails, all its transactions are discarded ○ another machine processes data from cache

  22. Replication of graph nodes ● quorum-based: 3 replicas of each node ● Update must be acknowledged by 2 replicas ● If replica misses update, it retrieves it from other replicas ● If replica fails and is replaced, it waits for the next snapshot and starts working normally from there ● For computation failures: rollback and redo

  23. Incremental expansion ● Ingest nodes - trivial, just add a node ● Storage nodes: ○ maintain more logical partitions than nodes ○ to add nodes, migrate some logical partitions to it ○ splitting logical partitions is possible too ○ new node starts working from the next snapshot - just as in failure recovery

  24. Failure recovery

  25. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend