CS6453 Big data Systems: Trends and Challenges Rachit Agarwal

Instructor — Rachit Agarwal • Assistant Professor , Cornell • Previously: Postdoc, UC Berkeley • PhD, UIUC • Research interests : Systems, networking, theory • Conferences of interest : OSDI, NSDI, SIGCOMM, SODA • Non-research interests : • I am an assistant professor ;-)

Instructor — Rachit Agarwal Interac9ve [NSDI’16] [NSDI’15] [EuroSys’17] … Queries Resource Systems for Post-Moore’s law Hardware Disaggrega9on [OSDI’16] … Graph Distance Improvement over several decade-old results Oracles [SODA’13] [ESA’14] [PODC’13] … Network Debugging the data plane Debugging [OSDI’16] [SIGCOMM’11] [SOSR’15] … Network New rouWng and scheduling mechanisms Protocols [NSDI’16] [SIGMETRICS’11] [INFOCOM’11] … Gap between linear and non-linear codes Coding theory [ISIT’11] [ISIT’07] [INFOCOM’10]

Big data systems What is big? • Billion $ datacenters • Number of servers • Google, Microso]: ~1 million • Facebook, Yahoo!, IBM, HP: several 100,000s each • Amazon, EBay, Intel, Akamai: >50,000 each • If each server stores 1TB of data • 10s of Petabytes — Exabytes of data

Big data — disrupting businesses

Big data — what is fundamentally new? • Scale? • Applications? Complexity? • Workloads? Performance metrics? • Hardware? Or, are there fundamentally new “technical” problems?

Scale Complexity Batch Analytics Batch Analytics Collect, scan, index Interactive Queries (e.g., Google) Streaming TBs of semi-structured Interactive Queries data a norm! e.g., Search all @tweets that mention “Cornell” Also, data growing faster than the Moore’s law Performance constraints unchanged Low latency , Streaming High throughput (e.g., #queries per second) (Wall/Timeline)

Performance Scale Complexity Fundamentally New Problems

Example 1 — Search Search queries • Customer logs from a video company • Single Amazon EC2 server, single core, 60GB RAM 1000 Search Throughput (Queries per second) ElasWcsearch Session Session MongoDB recordID userID … Tags 100 Start_Wme End_Time Cassandra In-memory query execu9on key to query performance 10 • secondary storage 100x slower 10% queries executed off-memory 1% queries executed off-memory 1 1 2 4 8 16 32 64 128 - throughput reduces by 10x - throughput reduces by 2x “Raw” Input data size (GB)

Example 1 — Search (traditional solutions fail) Data Scans Indexes 0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 Search ( ) 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21 High storage Low storage Low Throughput Low Throughput

Example 2 — Ranking • Problem : How to rank search results on social graph? • [LinkedIn, 2008][Facebook, 2009] : • Want rankings based on “expected interest” … • Expected interest: distance on social graph • Challenge : • #hops not the right distance measure (small-world) • Assign edge weights (#messages exchanged) • Rank search results according to: • “shortest path distance on a weighted graph” • perhaps one of the oldest problems

Example 2 — Ranking (traditional solutions fail) • Problem : How to compute shortest paths on social graph? • Run a shortest path algorithm : • 10s of seconds on a billion node graph • Pre-compute and store shortest distances : • 277 exabytes of storage • Approximate distances? Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a b l e ... 5 1 3 Worst-case stretch

Example 3 — Fault Tolerance • Problem : How to recover failed data? • Traditional technique : • 3x replication • Problem? • Erasure codes : • E.g., Reed-Solomon Codes • Reduce storage to 1.2x • But require 10x larger bandwidth! • simply moved the bottleneck from storage to network

Big data — From problems to solutions Insight: Exploit the structure in the problem

Example 1 — Search Structure: Do not need to support arbitrary computations A distributed “compressed” data store A distributed data store Queries executed directly on compressed data! ➡ Complex queries: Search, range, random access, RegEx ✓ Scale: In-memory data sizes >= memory capacity ✓ Interactivity: Avoid data scans and decompression

Impact • Adoption in industry : • Elsevier • Databricks • 19 other companies • Academic impact : • New techniques • Text, Graphs, Images • Very active area of research

Example 2 — Ranking Structure: Do not need to support arbitrary graphs Graphs with edges h m = ˜ O ( n ) Θ ( n 2 ) Space ˜ Θ ( kn 2 / ( k +1) ) Θ ( n √ n ) U n a t t a i n a ... b l e 5 1 3 Worst-case stretch

Impact • Adoption in industry : • LinkedIn • Apple Maps • Academic impact : • New routing protocols • New “compact” graph data structures • Still a very active area of research

Example 3 — Fault tolerance Structure: ? Lot of work, but Still an unresolved problem

Big data — From problems to solutions Approach? Co-design systems and techniques

Big Data Problems Systems that fail to leverage the structure in the problem Scalable Algorithms and Techniques Scalable Systems Techniques that ignore advances in systems and hardware System Resources

Big data systems — Trends & Challenges • Scale — need new algorithms & techniques • Applications — need new abstractions & systems • Workloads — need insights that enable new solutions • Hardware — need to co-design systems with hardware Welcome to 6453!

6453 — Plan • Learn about state-of-the-art research • 2-4 papers every week • Work on exciting project • Hopefully, start next generation of impactful directions

6453 — Reading papers Submit reviews before the lecture starts • Summary of problem being solved • Why is the problem interesting? • What are the main insights and technical contributions? • How does the paper advance the state-of-the-art? • Where may the solution not work well? • What are the next few problems you would solve? • What do you think is the holy grail in this direction?

6453 — Presenting papers Slides for Tuesday and Thursday lectures due by Saturday night and Monday night, respectively • 5-6 papers (depending on enrollment) • Similar to reading papers and writing reviews • but also provide broader overview of the sub-area • Please see course webpage • Please sign up for papers by Tuesday (next lecture)

6453 — Research Project • In groups of *maximum* 2 people • Interdisciplinary teams encouraged • New problem (may be in your sub-area) • Several deadlines: • Weekly project meetings • Survey — 02/14 • Mid-term report — 03/15 • Final report — 05/10 • Final presentation — 05/16 (does that work?)

6453 — Grade • Last thing you should worry about :-) • Paper reviews: 20% • Class participation: 10% • Research project: 70%

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal - PowerPoint PPT Presentation

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal Instructor Rachit Agarwal Assistant Professor , Cornell Previously: Postdoc, UC Berkeley PhD, UIUC Research interests : Systems, networking, theory Conferences

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit

Multi-Resource Packing for Cluster Schedulers CS6453: Johan Bjrck The problem Tasks in modern

Storage Fabric CS6453 Summary Last week: NVRAM is going to change the way we thing about

Architecting Government Websites Migrating Portland.gov to Drupal 8 Josh Mitchell with special

News and Announcements Tom Junk and Andrew Norman DUNE Core Computing Meeting August 29, 2017

Eliminating bugs in BPF JITs using automated formal verification Luke Nelson with Jacob van Ge

HR DAG 1 1. Who can issue Medical Certificates? (a) medical practitioner; (b) a clinic nurse

Aljan de Boer +31(0)6 3190 7499 aljan@trendsactive.com www.trendsactive.com MILLENNIALS |

FY19 Operating Budget October 24, 2017 FY18 Tax Supported Expenditures by Function Tax Supported

Member Management for Financial Secretaries Contact Information: FS Appointments/MM Access

State Strategies for Detecting Fis iscal Distress in Local Governments: Stu Study sh shows how

Transparency Stars Program Kim Townsend Senior Accountant Getting Started Larger data

RBC2 Taskforce Update June 27 th , 2014 CONTENT PAGE Introduction & Background Summary

GRADUATE DIPLOMA IN MANAGEMENT Cold Stores PLC GDM 403 MANAGING FINANCE MODEL EXAMINATION

COMPULSORY PURCHASE Elaine Farquharson-Black, Brodies LLP Keith Petrie, FG Burnett Rob McIntosh,

Public Sector property update Tuesday 3 rd July 2018 Newcastle | Leeds | Manchester 2

Localism Bill Planning Reform Mark Lee The Governments vision Freedom, Fairness and

Land readjustment (LR) Its potential for Africa Its potential for Africa Dr. Rob Home Rob Home

Ed Grant, Emmaline Lambert and Isabella Buono 18 June 2020 Practical Guide: Chapters Practical

Workshop 9: Developing & Using Your Plan Insert Title Here Housekeeping Agenda Item Title

A Course in Applied Econometrics 1. Introduction Lecture 13 2. Basics Bayesian Inference

Force Majeure & Frustration of Contract In Sale and Purchase Agreement and Transaction BY

CPD Lecture 3 1 Content of Lecture 3 Approach to a general theory question Approach to

Title page Corporate Finance Liaison May 2012 Commission Commissioner John Price Corporations