Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1

Course Structure • Fundamentals you need to know about systems • Caching, Virtual memory, concurrency, etc… • Review of several “Big-data” systems • Learn how they work • Principles of systems design: Why systems are designed that way • Hands-on experience • No electronic devices during classes (not even in airplane mode) 2

Course Assignments • Reading research papers • 2-3 projects • Coding assignments • Midterm + final exam http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/ 3

Course Grades • Midterm exam: 20% • Final exam: 30% • Projects: 50% 4

Questions • Teaching Assistant • Nathan Ng <kwanhong@umass.edu> • Office hours: Tuesday 4.30-5.30 PM @ CS 207 • Piazza website • https://piazza.com/umass/fall2019/compsci532/home • Ask questions there rather than emailing me or Nathan • Credits if you are active • Well-thought questions and answers: be curious (but don’t just show off) • I will never penalize you for saying or asking something wrong 5

Projects • Groups of two people • See course website for details • High-level discussions with other colleagues: ok • “What are the requirements of the project?” • Low-level discussions with other colleagues: ok • “How do threads work in Java?” • Mid-level discussions: not ok • “ How to design a solution for the project?” • Project delivery includes short oral exam 6

What are “systems for data science”?

Systems + Data Science • Data science research • New algorithms • New applications of existing algorithms • Validation: take small representative dataset, show accuracy • Systems research • Run these algorithms efficiently • Scale them to larger datasets • End-to-end pipelines • Applications of ML to system design and software engineering (seminar next Spring!) • Validation: build a prototype that others can use • These are ends of a spectrum

Overview • What type of systems will we target? • Storage systems • Data processing systems • Cloud analytics • System for machine learning • Goal: Hide complexity of underlying hardware • Parallelism: multi-core, distributed systems • Fault-tolerance: hardware fails • Focus on scalable systems • Scale to large datasets • Scale to computationally complex problems

Transactional vs. Analytical Systems • Transactional data management system • Real-time response • Concurrency • Updates • Analytical data management system • Non real-time responses • No concurrency • Read-only • These are ends of a spectrum

Example: Search Engine • Crawlers: download the Web • Hadoop file system (HDFS): store the Web • MapReduce: run massively parallel indexing • Key-value store: store index • Front-end • Serve client requests • Ranking à this is actual the data science • Q: Scalability issues? • Q: Which component is transactional / analytical • Q: Where are storage/data processing/cloud/ML involved?

Design goals

Ease of Use • Good APIs / abstractions are key in a system • High-level API • Easier to use, better productivity, safer code • It makes some implementation choices for you • These choices are based on assumptions on the use cases • Are these choices really what you need? • Low-level API • Harder to use, lower productivity, unsafer code • More flexible 13

Scalability • Ideal world • Linear scalability Speedup Ideal • Reality • Bottlenecks • For example: central coordinator Reality • When do we stop scaling? Parallelism 14

Latency vs. Throughput • Pipe metaphor Latency • System is a pipe Max throughput • Requests are small marbles 100x req • Low load • Minimal latency • Increased load (2x w) 10x req 1x requests 50x req • Higher throughput • Latency stable • High load Throughput • Saturation: no more throughput • Latency skyrockets 15

Fault Tolerance • Assume that your system crashes every month • If you run Python scripts on your laptop, that’s fine • But imagine you run a cluster • 10 nodes = a crash every 3 days • 100 nodes = a crash every seven hours • 1000 nodes = a crash every 50 minutes • Some computations run for more than one hour • Cannot simply restart when something goes wrong • Even when restarting, we need to keep metadata safe 16

Why do we need parallelism? 17

Maximum Clock Rate is Stagnating Two major “laws” are collapsing • Moore’s law • Dennard scaling Source: https://queue.acm.org/detail.cfm?id=2181798

Moore’s Law • “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: Exponential axis until 2021 unless new technologies arise) [1] [1] https://www.technologyreview.com/s/601441/moores-law-is- dead-now-what/

Dennard Scaling • “Reducing transistor size does not increase power density à power consumption proportional to chip area” • Stopped holding around 2006 • Assumptions break when physical system close to limit • Post-Dennard-scaling world of today • Huge cooling and power consumption issues • If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor

Heat Dissipation Problem • Large datacenters consume energy like large cities • Cooling is the main cost factor Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

Where is Luleå?

Single-Core Solutions • Dynamic Voltage and Frequency Scaling (DVFS) • E.g. Intel’s TurboBoost • Only works under low load • Use part of the chip for coprocessors (e.g. graphics) • Lower power consumption • Limited number of generic functionalities to offload

Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core … core core core core core core Socket Socket Socket (to motherboard) Main Memory

Multi-Core processors • Idea: scale computational power linearly • Instead of a single 5 GHz core, 2 * 2.5 GHz cores • Scale heat dissipation linearly • k cores have ~ k times the heat dissipation of a single core • Increasing frequency of a single core by k times creates superlinear heat dissipation increase

How to Leverage Multicores • Run multiple tasks in parallel • Multiprocessing • Multithreading • E.g. PCs have many parallel background apps • OS, music, antivirus, web browser, … • How to parallelize one app is not trivial • Embarrassingly parallel tasks • Can be run by multiple threads • No coordination

Memory Bandwidth Bottleneck • Cores compete for the same main memory bus • Solution: caching help in two ways • They reduce latency (as we have discussed) • They also increase throughput by avoiding bus contention

SIMD Processors • Single Instruction Multiple Data (SIMD) processors • Example • Graphical Processing Units (GPUs) • Intel Phi coprocessors • Q: Possible SIMD snippets for i in [0,n-1] do for i in [0,n-1] do v[i] = v[i] * pi if v[i] < 0.01 then v[i] = 0

Other Approaches • SIMD • Single Instruction Multiple Data • A massive number of simpler cores • FPGAs • Dedicated hardware designed for a specific task

Automatic Parallelization? • Holy grail in the multi-processor era • Approaches • Programming languages • Systems with APIs that help express parallelism • Efficient coordination mechanisms

Homework The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA sergey@cs.stanford.edu and page@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 - PowerPoint PPT Presentation

Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 Course Structure Fundamentals you need to know about systems Caching, Virtual memory, concurrency, etc Review of several Big-data systems Learn how they

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

BSD Data Systems Report to the School Board, April 2019 Data-Driven ESSA Best Practices Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Bracing Systems Bracing Systems 1 1 Rod Bracing Rod Bracing 2 2 Wind Bracing Systems Wind

Announcements P4: Graded Will resolve all Project grading issues this week P5: File Systems

CS519: Computer Networks Lecture 5, Part 3: Mar 10, 2004 Transport: TCP performance TCP

A PEEK INSIDE RIAK Steve Vinoski Basho Technologies Cambridge, MA USA http://basho.com

Lecture 17 Log into Linux. Copy two subdirectories in /home/hwang/cs375/lecture17/ $ cp -r

Adrian Tate Adrian Tate Technical Lead of Scientific Libraries Technical Lead of Scientific

The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org

Our cloud is thirsty ! Shaolei Ren Florida International University sren@cs.fiu.edu 1 A

NFSv4 Replication for Grid Storage Middleware Peter Honeyman Center for Information Technology