cs6453
play

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal - PowerPoint PPT Presentation

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal Instructor Rachit Agarwal Assistant Professor , Cornell Previously: Postdoc, UC Berkeley PhD, UIUC Research interests : Systems, networking, theory Conferences


  1. CS6453 Big data Systems: Trends and Challenges Rachit Agarwal

  2. Instructor — Rachit Agarwal • Assistant Professor , Cornell • Previously: Postdoc, UC Berkeley • PhD, UIUC • Research interests : Systems, networking, theory • Conferences of interest : OSDI, NSDI, SIGCOMM, SODA • Non-research interests : • I am an assistant professor ;-)

  3. Instructor — Rachit Agarwal Interac9ve [NSDI’16] [NSDI’15] [EuroSys’17] … Queries Resource Systems for Post-Moore’s law Hardware Disaggrega9on [OSDI’16] … Graph Distance Improvement over several decade-old results Oracles [SODA’13] [ESA’14] [PODC’13] … Network Debugging the data plane Debugging [OSDI’16] [SIGCOMM’11] [SOSR’15] … Network New rouWng and scheduling mechanisms Protocols [NSDI’16] [SIGMETRICS’11] [INFOCOM’11] … Gap between linear and non-linear codes Coding theory [ISIT’11] [ISIT’07] [INFOCOM’10]

  4. Big data systems What is big? • Billion $ datacenters • Number of servers • Google, Microso]: ~1 million • Facebook, Yahoo!, IBM, HP: several 100,000s each • Amazon, EBay, Intel, Akamai: >50,000 each • If each server stores 1TB of data • 10s of Petabytes — Exabytes of data

  5. Big data — disrupting businesses

  6. Big data — what is fundamentally new? • Scale? • Applications? Complexity? • Workloads? Performance metrics? • Hardware? Or, are there fundamentally new “technical” problems?

  7. Scale Complexity Batch Analytics Batch Analytics Collect, scan, index Interactive Queries (e.g., Google) Streaming TBs of semi-structured Interactive Queries data a norm! e.g., Search all @tweets that mention “Cornell” Also, data growing faster than the Moore’s law Performance constraints unchanged Low latency , Streaming High throughput (e.g., #queries per second) (Wall/Timeline)

  8. Performance Scale Complexity Fundamentally New Problems

  9. Example 1 — Search Search queries • Customer logs from a video company • Single Amazon EC2 server, single core, 60GB RAM 1000 Search Throughput (Queries per second) ElasWcsearch Session Session MongoDB recordID userID … Tags 100 Start_Wme End_Time Cassandra In-memory query execu9on key to query performance 10 • secondary storage 100x slower 10% queries executed off-memory 1% queries executed off-memory 1 1 2 4 8 16 32 64 128 - throughput reduces by 10x - throughput reduces by 2x “Raw” Input data size (GB)

  10. Example 1 — Search (traditional solutions fail) Data Scans Indexes 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 Search ( ) 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 High storage Low storage Low Throughput Low Throughput

  11. Example 2 — Ranking • Problem : How to rank search results on social graph? • [LinkedIn, 2008][Facebook, 2009] : • Want rankings based on “expected interest” … • Expected interest: distance on social graph • Challenge : • #hops not the right distance measure (small-world) • Assign edge weights (#messages exchanged) • Rank search results according to: • “shortest path distance on a weighted graph” • perhaps one of the oldest problems

  12. Example 2 — Ranking (traditional solutions fail) • Problem : How to compute shortest paths on social graph? • Run a shortest path algorithm : • 10s of seconds on a billion node graph • Pre-compute and store shortest distances : • 277 exabytes of storage • Approximate distances? Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a b l e ... 5 1 3 Worst-case stretch

  13. Example 3 — Fault Tolerance • Problem : How to recover failed data? • Traditional technique : • 3x replication • Problem? • Erasure codes : • E.g., Reed-Solomon Codes • Reduce storage to 1.2x • But require 10x larger bandwidth! • simply moved the bottleneck from storage to network

  14. Big data — From problems to solutions Insight: Exploit the structure in the problem

  15. Example 1 — Search Structure: Do not need to support arbitrary computations A distributed “compressed” data store A distributed data store Queries executed directly on compressed data! ➡ Complex queries: Search, range, random access, RegEx ✓ Scale: In-memory data sizes >= memory capacity ✓ Interactivity: Avoid data scans and decompression

  16. Impact • Adoption in industry : • Elsevier • Databricks • 19 other companies • Academic impact : • New techniques • Text, Graphs, Images • Very active area of research

  17. Example 2 — Ranking Structure: Do not need to support arbitrary graphs Graphs with edges h m = ˜ O ( n ) Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a ... b l e 5 1 3 Worst-case stretch

  18. Impact • Adoption in industry : • LinkedIn • Apple Maps • Academic impact : • New routing protocols • New “compact” graph data structures • Still a very active area of research

  19. Example 3 — Fault tolerance Structure: ? Lot of work, but Still an unresolved problem

  20. Big data — From problems to solutions Approach? Co-design systems and techniques

  21. Big Data Problems Systems that fail to leverage the structure in the problem Scalable Algorithms and Techniques Scalable Systems Techniques that ignore advances in systems and hardware System Resources

  22. Big data systems — Trends & Challenges • Scale — need new algorithms & techniques • Applications — need new abstractions & systems • Workloads — need insights that enable new solutions • Hardware — need to co-design systems with hardware Welcome to 6453!

  23. 6453 — Plan • Learn about state-of-the-art research • 2-4 papers every week • Work on exciting project • Hopefully, start next generation of impactful directions

  24. 6453 — Reading papers Submit reviews before the lecture starts • Summary of problem being solved • Why is the problem interesting? • What are the main insights and technical contributions? • How does the paper advance the state-of-the-art? • Where may the solution not work well? • What are the next few problems you would solve? • What do you think is the holy grail in this direction?

  25. 6453 — Presenting papers Slides for Tuesday and Thursday lectures due by Saturday night and Monday night, respectively • 5-6 papers (depending on enrollment) • Similar to reading papers and writing reviews • but also provide broader overview of the sub-area • Please see course webpage • Please sign up for papers by Tuesday (next lecture)

  26. 6453 — Research Project • In groups of *maximum* 2 people • Interdisciplinary teams encouraged • New problem (may be in your sub-area) • Several deadlines: • Weekly project meetings • Survey — 02/14 • Mid-term report — 03/15 • Final report — 05/10 • Final presentation — 05/16 (does that work?)

  27. 6453 — Grade • Last thing you should worry about :-) • Paper reviews: 20% • Class participation: 10% • Research project: 70%

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend