CS6453 Big data Systems: Trends and Challenges Rachit Agarwal - - PowerPoint PPT Presentation

cs6453
SMART_READER_LITE
LIVE PREVIEW

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal - - PowerPoint PPT Presentation

CS6453 Big data Systems: Trends and Challenges Rachit Agarwal Instructor Rachit Agarwal Assistant Professor , Cornell Previously: Postdoc, UC Berkeley PhD, UIUC Research interests : Systems, networking, theory Conferences


slide-1
SLIDE 1

Big data Systems: Trends and Challenges

CS6453

Rachit Agarwal

slide-2
SLIDE 2
  • Assistant Professor, Cornell
  • Previously: Postdoc, UC Berkeley
  • PhD, UIUC
  • Research interests: Systems, networking, theory
  • Conferences of interest: OSDI, NSDI, SIGCOMM, SODA
  • Non-research interests:
  • I am an assistant professor ;-)

Instructor — Rachit Agarwal

slide-3
SLIDE 3

Interac9ve Queries

[NSDI’16] [NSDI’15] [EuroSys’17] …

Resource Disaggrega9on

Systems for Post-Moore’s law Hardware [OSDI’16] …

Graph Distance Oracles

Improvement over several decade-old results [SODA’13] [ESA’14] [PODC’13] …

Network Debugging

Debugging the data plane [OSDI’16] [SIGCOMM’11] [SOSR’15] …

Network Protocols

New rouWng and scheduling mechanisms [NSDI’16] [SIGMETRICS’11] [INFOCOM’11] …

Coding theory

Gap between linear and non-linear codes [ISIT’11] [ISIT’07] [INFOCOM’10]

Instructor — Rachit Agarwal

slide-4
SLIDE 4

What is big?

  • Billion $ datacenters
  • Number of servers
  • Google, Microso]: ~1 million
  • Facebook, Yahoo!, IBM, HP: several 100,000s each
  • Amazon, EBay, Intel, Akamai: >50,000 each
  • If each server stores 1TB of data
  • 10s of Petabytes — Exabytes of data

Big data systems

slide-5
SLIDE 5

Big data — disrupting businesses

slide-6
SLIDE 6
  • Scale?
  • Applications? Complexity?
  • Workloads? Performance metrics?
  • Hardware?

Big data — what is fundamentally new?

Or, are there fundamentally new “technical” problems?

slide-7
SLIDE 7

Interactive Queries e.g., Search all @tweets that mention “Cornell”

Complexity

Batch Analytics Collect, scan, index (e.g., Google) Streaming (Wall/Timeline)

Scale

TBs of semi-structured data a norm!

Also, data growing faster than the Moore’s law

Performance constraints unchanged

Low latency, High throughput (e.g., #queries per second)

Batch Analytics Streaming Interactive Queries

slide-8
SLIDE 8

Scale

Complexity Performance Fundamentally New Problems

slide-9
SLIDE 9

Search queries

  • Customer logs from a video company
  • Single Amazon EC2 server, single core, 60GB RAM

Search Throughput (Queries per second)

1 10 100 1000

“Raw” Input data size (GB)

1 2 4 8 16 32 64 128

ElasWcsearch MongoDB Cassandra

10% queries executed off-memory

  • throughput reduces by 10x

In-memory query execu9on key to query performance

  • secondary storage 100x slower

1% queries executed off-memory

  • throughput reduces by 2x

recordID userID Session Start_Wme Session End_Time … Tags

Example 1 — Search

slide-10
SLIDE 10

0, 10, 14, 16, 19, 26, 29 1, 4, 5, 8, 20, 22, 24 2, 15, 17, 27 3, 6, 7, 9, 12, 13, 18, 23 .. 11, 21

Data Scans Indexes

Low storage Low Throughput High storage Low Throughput Search( )

Example 1 — Search (traditional solutions fail)

slide-11
SLIDE 11

Example 2 — Ranking

  • Problem: How to rank search results on social graph?
  • [LinkedIn, 2008][Facebook, 2009]:
  • Want rankings based on “expected interest” …
  • Expected interest: distance on social graph
  • Challenge:
  • #hops not the right distance measure (small-world)
  • Assign edge weights (#messages exchanged)
  • Rank search results according to:
  • “shortest path distance on a weighted graph”
  • perhaps one of the oldest problems
slide-12
SLIDE 12

Example 2 — Ranking (traditional solutions fail)

  • Problem: How to compute shortest paths on social graph?
  • Run a shortest path algorithm:
  • 10s of seconds on a billion node graph
  • Pre-compute and store shortest distances:
  • 277 exabytes of storage
  • Approximate distances?

˜ Θ(kn2/(k+1))

... U n a t t a i n a b l e

Worst-case stretch

Θ(n√n) Θ(n2) 1 3 5

Space

slide-13
SLIDE 13

Example 3 — Fault Tolerance

  • Problem: How to recover failed data?
  • Traditional technique:
  • 3x replication
  • Problem?
  • Erasure codes:
  • E.g., Reed-Solomon Codes
  • Reduce storage to 1.2x
  • But require 10x larger bandwidth!
  • simply moved the bottleneck from storage to network
slide-14
SLIDE 14

Big data — From problems to solutions

Insight: Exploit the structure in the problem

slide-15
SLIDE 15

A distributed data store

Queries executed directly on compressed data!

➡Complex queries: Search, range, random access, RegEx ✓ Scale: In-memory data sizes >= memory capacity ✓ Interactivity: Avoid data scans and decompression

A distributed “compressed” data store

Example 1 — Search

Structure: Do not need to support arbitrary computations

slide-16
SLIDE 16
  • Adoption in industry:
  • Elsevier
  • Databricks
  • 19 other companies
  • Academic impact:
  • New techniques
  • Text, Graphs, Images
  • Very active area of research

Impact

slide-17
SLIDE 17

Example 2 — Ranking

Structure: Do not need to support arbitrary graphs

˜ Θ(kn2/(k+1))

... U n a t t a i n a b l e

Worst-case stretch Θ(n√n)

Θ(n2)

1

3 5

Space

Graphs with edges

h m = ˜ O(n)

slide-18
SLIDE 18
  • Adoption in industry:
  • LinkedIn
  • Apple Maps
  • Academic impact:
  • New routing protocols
  • New “compact” graph data structures
  • Still a very active area of research

Impact

slide-19
SLIDE 19

Example 3 — Fault tolerance

Structure: ?

Lot of work, but Still an unresolved problem

slide-20
SLIDE 20

Big data — From problems to solutions

Approach? Co-design systems and techniques

slide-21
SLIDE 21

Scalable Systems Scalable Algorithms and Techniques

Big Data Problems System Resources

Techniques that ignore advances in systems and hardware Systems that fail to leverage the structure in the problem

slide-22
SLIDE 22
  • Scale — need new algorithms & techniques
  • Applications — need new abstractions & systems
  • Workloads — need insights that enable new solutions
  • Hardware — need to co-design systems with hardware

Big data systems — Trends & Challenges

Welcome to 6453!

slide-23
SLIDE 23
  • Learn about state-of-the-art research
  • 2-4 papers every week
  • Work on exciting project
  • Hopefully, start next generation of impactful directions

6453 — Plan

slide-24
SLIDE 24

Submit reviews before the lecture starts

  • Summary of problem being solved
  • Why is the problem interesting?
  • What are the main insights and technical contributions?
  • How does the paper advance the state-of-the-art?
  • Where may the solution not work well?
  • What are the next few problems you would solve?
  • What do you think is the holy grail in this direction?

6453 — Reading papers

slide-25
SLIDE 25

Slides for Tuesday and Thursday lectures due by Saturday night and Monday night, respectively

  • 5-6 papers (depending on enrollment)
  • Similar to reading papers and writing reviews
  • but also provide broader overview of the sub-area
  • Please see course webpage
  • Please sign up for papers by Tuesday (next lecture)

6453 — Presenting papers

slide-26
SLIDE 26
  • In groups of *maximum* 2 people
  • Interdisciplinary teams encouraged
  • New problem (may be in your sub-area)
  • Several deadlines:
  • Weekly project meetings
  • Survey — 02/14
  • Mid-term report — 03/15
  • Final report — 05/10
  • Final presentation — 05/16 (does that work?)

6453 — Research Project

slide-27
SLIDE 27
  • Last thing you should worry about :-)
  • Paper reviews: 20%
  • Class participation: 10%
  • Research project: 70%

6453 — Grade