Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 - - PowerPoint PPT Presentation

systems for data science
SMART_READER_LITE
LIVE PREVIEW

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 - - PowerPoint PPT Presentation

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 Course Structure Fundamentals you need to know about systems Caching, Virtual memory, concurrency, etc Review of several Big-data systems Learn how they


slide-1
SLIDE 1

Systems for Data Science

Marco Serafini

COMPSCI 532 Lecture 1

slide-2
SLIDE 2

2

Course Structure

  • Fundamentals you need to know about systems
  • Caching, Virtual memory, concurrency, etc…
  • Review of several “Big-data” systems
  • Learn how they work
  • Principles of systems design: Why systems are designed that way
  • Hands-on experience
  • No electronic devices during classes (not even in

airplane mode)

slide-3
SLIDE 3

3

Course Assignments

  • Reading research papers
  • 2-3 projects
  • Coding assignments
  • Midterm + final exam

http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/

slide-4
SLIDE 4

4

Course Grades

  • Midterm exam: 20%
  • Final exam: 30%
  • Projects: 50%
slide-5
SLIDE 5

5

Questions

  • Teaching Assistant
  • Nathan Ng <kwanhong@umass.edu>
  • Office hours: Tuesday 4.30-5.30 PM @ CS 207
  • Piazza website
  • https://piazza.com/umass/fall2019/compsci532/home
  • Ask questions there rather than emailing me or Nathan
  • Credits if you are active
  • Well-thought questions and answers: be curious (but don’t just show
  • ff)
  • I will never penalize you for saying or asking something wrong
slide-6
SLIDE 6

6

Projects

  • Groups of two people
  • See course website for details
  • High-level discussions with other colleagues: ok
  • “What are the requirements of the project?”
  • Low-level discussions with other colleagues: ok
  • “How do threads work in Java?”
  • Mid-level discussions: not ok
  • “How to design a solution for the project?”
  • Project delivery includes short oral exam
slide-7
SLIDE 7

What are “systems for data science”?

slide-8
SLIDE 8

Systems + Data Science

  • Data science research
  • New algorithms
  • New applications of existing algorithms
  • Validation: take small representative dataset, show accuracy
  • Systems research
  • Run these algorithms efficiently
  • Scale them to larger datasets
  • End-to-end pipelines
  • Applications of ML to system design and software engineering (seminar

next Spring!)

  • Validation: build a prototype that others can use
  • These are ends of a spectrum
slide-9
SLIDE 9

Overview

  • What type of systems will we target?
  • Storage systems
  • Data processing systems
  • Cloud analytics
  • System for machine learning
  • Goal: Hide complexity of underlying hardware
  • Parallelism: multi-core, distributed systems
  • Fault-tolerance: hardware fails
  • Focus on scalable systems
  • Scale to large datasets
  • Scale to computationally complex problems
slide-10
SLIDE 10

Transactional vs. Analytical Systems

  • Transactional data management system
  • Real-time response
  • Concurrency
  • Updates
  • Analytical data management system
  • Non real-time responses
  • No concurrency
  • Read-only
  • These are ends of a spectrum
slide-11
SLIDE 11

Example: Search Engine

  • Crawlers: download the Web
  • Hadoop file system (HDFS): store the Web
  • MapReduce: run massively parallel indexing
  • Key-value store: store index
  • Front-end
  • Serve client requests
  • Ranking à this is actual the data science
  • Q: Scalability issues?
  • Q: Which component is transactional / analytical
  • Q: Where are storage/data processing/cloud/ML involved?
slide-12
SLIDE 12

Design goals

slide-13
SLIDE 13

13

Ease of Use

  • Good APIs / abstractions are key in a system
  • High-level API
  • Easier to use, better productivity, safer code
  • It makes some implementation choices for you
  • These choices are based on assumptions on the use cases
  • Are these choices really what you need?
  • Low-level API
  • Harder to use, lower productivity, unsafer code
  • More flexible
slide-14
SLIDE 14

14

Scalability

Parallelism Speedup Ideal Reality

  • Ideal world
  • Linear scalability
  • Reality
  • Bottlenecks
  • For example: central coordinator
  • When do we stop scaling?
slide-15
SLIDE 15

15

Latency vs. Throughput

Throughput Latency 1x requests 10x req 50x req 100x req Max throughput

  • Pipe metaphor
  • System is a pipe
  • Requests are small marbles
  • Low load
  • Minimal latency
  • Increased load (2x w)
  • Higher throughput
  • Latency stable
  • High load
  • Saturation: no more throughput
  • Latency skyrockets
slide-16
SLIDE 16

16

Fault Tolerance

  • Assume that your system crashes every month
  • If you run Python scripts on your laptop, that’s fine
  • But imagine you run a cluster
  • 10 nodes = a crash every 3 days
  • 100 nodes = a crash every seven hours
  • 1000 nodes = a crash every 50 minutes
  • Some computations run for more than one hour
  • Cannot simply restart when something goes wrong
  • Even when restarting, we need to keep metadata safe
slide-17
SLIDE 17

17

Why do we need parallelism?

slide-18
SLIDE 18

Maximum Clock Rate is Stagnating

Source: https://queue.acm.org/detail.cfm?id=2181798

Two major “laws” are collapsing

  • Moore’s law
  • Dennard scaling
slide-19
SLIDE 19

Moore’s Law

  • “Density of transistors in an integrated circuit doubles

every two years”. Smaller à changes propagate faster

So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]

[1] https://www.technologyreview.com/s/601441/moores-law-is- dead-now-what/

Exponential axis

slide-20
SLIDE 20

Dennard Scaling

  • “Reducing transistor size does not increase power

density à power consumption proportional to chip area”

  • Stopped holding around 2006
  • Assumptions break when physical system close to limit
  • Post-Dennard-scaling world of today
  • Huge cooling and power consumption issues
  • If we kept the same clock frequency trends, today a CPU would

have the power density of a nuclear reactor

slide-21
SLIDE 21

Heat Dissipation Problem

  • Large datacenters consume energy like large cities
  • Cooling is the main cost factor

Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

slide-22
SLIDE 22

Where is Luleå?

slide-23
SLIDE 23

Single-Core Solutions

  • Dynamic Voltage and Frequency Scaling (DVFS)
  • E.g. Intel’s TurboBoost
  • Only works under low load
  • Use part of the chip for coprocessors (e.g. graphics)
  • Lower power consumption
  • Limited number of generic functionalities to offload
slide-24
SLIDE 24

Multi-Core Processors

core Processor (chip) core core core core Processor (chip) core core core core Processor (chip) core core core

Main Memory

Socket (to motherboard) Socket Socket

slide-25
SLIDE 25

Multi-Core processors

  • Idea: scale computational power linearly
  • Instead of a single 5 GHz core, 2 * 2.5 GHz cores
  • Scale heat dissipation linearly
  • k cores have ~ k times the heat dissipation of a single core
  • Increasing frequency of a single core by k times creates superlinear

heat dissipation increase

slide-26
SLIDE 26

How to Leverage Multicores

  • Run multiple tasks in parallel
  • Multiprocessing
  • Multithreading
  • E.g. PCs have many parallel background apps
  • OS, music, antivirus, web browser, …
  • How to parallelize one app is not trivial
  • Embarrassingly parallel tasks
  • Can be run by multiple threads
  • No coordination
slide-27
SLIDE 27

Memory Bandwidth Bottleneck

  • Cores compete for the same main memory bus
  • Solution: caching help in two ways
  • They reduce latency (as we have discussed)
  • They also increase throughput by avoiding bus contention
slide-28
SLIDE 28

SIMD Processors

  • Single Instruction Multiple Data (SIMD) processors
  • Example
  • Graphical Processing Units (GPUs)
  • Intel Phi coprocessors
  • Q: Possible SIMD snippets

for i in [0,n-1] do v[i] = v[i] * pi for i in [0,n-1] do if v[i] < 0.01 then v[i] = 0

slide-29
SLIDE 29

Other Approaches

  • SIMD
  • Single Instruction Multiple Data
  • A massive number of simpler cores
  • FPGAs
  • Dedicated hardware designed for a specific task
slide-30
SLIDE 30

Automatic Parallelization?

  • Holy grail in the multi-processor era
  • Approaches
  • Programming languages
  • Systems with APIs that help express parallelism
  • Efficient coordination mechanisms
slide-31
SLIDE 31

Homework

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA sergey@cs.stanford.edu and page@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of