Systems for Data Science
Marco Serafini
COMPSCI 532 Lecture 1
Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 - - PowerPoint PPT Presentation
Systems for Data Science Marco Serafini COMPSCI 532 Lecture 1 Course Structure Fundamentals you need to know about systems Caching, Virtual memory, concurrency, etc Review of several Big-data systems Learn how they
COMPSCI 532 Lecture 1
2
3
http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/
4
5
6
next Spring!)
13
14
Parallelism Speedup Ideal Reality
15
Throughput Latency 1x requests 10x req 50x req 100x req Max throughput
16
17
Source: https://queue.acm.org/detail.cfm?id=2181798
So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]
[1] https://www.technologyreview.com/s/601441/moores-law-is- dead-now-what/
Exponential axis
have the power density of a nuclear reactor
Google @ Columbia River valley (2006) Facebook @ Luleå (2015)
core Processor (chip) core core core core Processor (chip) core core core core Processor (chip) core core core
Socket (to motherboard) Socket Socket
heat dissipation increase
for i in [0,n-1] do v[i] = v[i] * pi for i in [0,n-1] do if v[i] < 0.01 then v[i] = 0
Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305, USA sergey@cs.stanford.edu and page@cs.stanford.edu Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of