Extreme Computing Admin and Overview Administration Your - - PowerPoint PPT Presentation

extreme computing
SMART_READER_LITE
LIVE PREVIEW

Extreme Computing Admin and Overview Administration Your - - PowerPoint PPT Presentation

Extreme Computing Admin and Overview Administration Your Background Overview Big data Performance Clusters 1 Course Staff 1 3 xKenneth Heafield 2 3 xVolker Seeker Currently 12 TAs/demonstrators/markers Administration Your Background


slide-1
SLIDE 1

Extreme Computing

Admin and Overview

Administration Your Background Overview Big data Performance Clusters

1

slide-2
SLIDE 2

Course Staff

1 3xKenneth Heafield 2 3xVolker Seeker

Currently 12 TAs/demonstrators/markers

Administration Your Background Overview Big data Performance Clusters

2

slide-3
SLIDE 3

Website

http://www.inf.ed.ac.uk/teaching/courses/exc

Piazza

https://piazza.com/class/j7m5dr4ns4dta (Linked from website)

Mailing List

exc-students at inf.ed.ac.uk is populated when you enroll.

Administration Your Background Overview Big data Performance Clusters

3

slide-4
SLIDE 4

Website

http://www.inf.ed.ac.uk/teaching/courses/exc

Piazza

https://piazza.com/class/j7m5dr4ns4dta (Linked from website)

Mailing List

exc-students at inf.ed.ac.uk is populated when you enroll. = ⇒ Check website for announcements, especially first two weeks.

Administration Your Background Overview Big data Performance Clusters

4

slide-5
SLIDE 5

Assessment

25% Assignment 1 25% Assignment 2 50% Exam in May (December for visitors) Don’t start the assignments yet; they are being updated.

Administration Your Background Overview Big data Performance Clusters

5

slide-6
SLIDE 6

Assessment

25% Assignment 1 25% Assignment 2 50% Exam in May (December for visitors) Don’t start the assignments yet; they are being updated. Solve the assignments on your own. Don’t share code. Exam is closed book.

Administration Your Background Overview Big data Performance Clusters

6

slide-7
SLIDE 7

Assignment Deadlines

We’ll provide you with a cluster to do assignments on. The cluster will be offline on Sunday 22 October 2017. → Assignment 1 will probably be due before then.

Administration Your Background Overview Big data Performance Clusters

7

slide-8
SLIDE 8

Lectures Online, subject to revision. Labs Practice on a cluster. Not marked, but in exam. Papers Linked from the website. Books Don’t buy them. They’re in the library: Data-Intensive Text Processing with MapReduce Hadoop: The Definitive Guide. The exam is based on the lectures and labs.

Administration Your Background Overview Big data Performance Clusters

8

slide-9
SLIDE 9

Labs

Run 2–27 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday 11am Friday 11am Friday 2pm Lab groups will be chosen online: https://student.inf.ed.ac.uk.

Administration Your Background Overview Big data Performance Clusters

9

slide-10
SLIDE 10

Unix Command Line

We assume you know the Unix command line (typically bash). tar cJ . | ssh server "cd $PWD && tar xJ" diff <(zcat a.gz) <(zcat b.gz)

Administration Your Background Overview Big data Performance Clusters

10

slide-11
SLIDE 11

Unix Command Line

We assume you know the Unix command line (typically bash). tar cJ . | ssh server "cd $PWD && tar xJ" diff <(zcat a.gz) <(zcat b.gz) If you didn’t understand that, work through these:

http://www.ed.ac.uk/information-services/help-consultancy/ is-skills/catalogue/program-op-sys-catalogue/unix1 https://www.lynda.com/Linux-tutorials/Linux-Bash-Shell-Scripts/ 504429-2.html (The university has a subscription to lynda.com)

Administration Your Background Overview Big data Performance Clusters

11

slide-12
SLIDE 12

Programming Languages

The only language we require is command line. Examples are mostly Python and Java, with occasional C++.

Administration Your Background Overview Big data Performance Clusters

12

slide-13
SLIDE 13

Programming Languages

The only language we require is command line. Examples are mostly Python and Java, with occasional C++. Average submission length: Lines Words Characters Python 45.54 140.60 1412.81 Java 57.53 153.99 1738.76 Hint: bash is a programming language.

Administration Your Background Overview Big data Performance Clusters

13

slide-14
SLIDE 14

Data Structures

Know and apply foundational data structures: hash tables, arrays, queues, . . . These are taught in our second year undergraduate course, Informatics 2B. Inefficient data structure choices will lose marks.

Administration Your Background Overview Big data Performance Clusters

14

slide-15
SLIDE 15

Core Course Content

Working with big data Cluster computing with 10,000 machines How to pass a Google interview1 How clouds like Amazon Web Services work

1Job at Google not guaranteed. Administration Your Background Overview Big data Performance Clusters

15

slide-16
SLIDE 16

Core Course Content

Working with big data Cluster computing with 10,000 machines How to pass a Google interview1 How clouds like Amazon Web Services work

Not Part of the Course

How to program (expected) Unix command line (learn it yourself) Mobile phones or Internet of things GPUs and FPGAs

1Job at Google not guaranteed. Administration Your Background Overview Big data Performance Clusters

16

slide-17
SLIDE 17

Topics

Big Data Cloud Computing Infrastructure MapReduce and Hadoop Beyond MapReduce Fault Tolerance and Replication NoSQL BASE vs ACID BitTorrent Data warehousing Data streams Virtualisation

Administration Your Background Overview Big data Performance Clusters

17

slide-18
SLIDE 18

What is big data?

“You can turn small data into big data by wrapping it XML.” “If things are breaking, you have big data.”

Administration Your Background Overview Big data Performance Clusters

18

slide-19
SLIDE 19

What is big data?

“You can turn small data into big data by wrapping it XML.” “If things are breaking, you have big data.” Big data is relative: not the same for Google and Informatics.

Administration Your Background Overview Big data Performance Clusters

19

slide-20
SLIDE 20

What is big data?

“You can turn small data into big data by wrapping it XML.” “If things are breaking, you have big data.” Big data is relative: not the same for Google and Informatics. Sometimes Google’s big data is our small data! [Brants et al, 2007]

Administration Your Background Overview Big data Performance Clusters

20

slide-21
SLIDE 21

The Internet Archive

560,000,000,000 Unique URLs of Web Crawl 4,000,000 eBooks 3,000,000 Hours of Television 2,400,000 Audio Recordings 2,300,000 Book Archive 2,000,000 Moving Images 25,000 Software Titles

30 Petabytes total 17 Petabytes of websites (gzipped) 2-3 Petabytes/year growth

Administration Your Background Overview Big data Performance Clusters

21

slide-22
SLIDE 22

900 TB in one machine

90 hard drives, each 10 TB, in one server

Administration Your Background Overview Big data Performance Clusters

22

slide-23
SLIDE 23

General Big Data

Government Demographics, communication Large Hadron Collider 15 PB/year Fraud detection Did your debit card work? Social media Who to follow? Search Can I borrow a copy of the web? Online advertising Placement, tracking, pricing

Administration Your Background Overview Big data Performance Clusters

23

slide-24
SLIDE 24

Common Source: Lots of Observations

Every web page Mobile phone location reports Twitter posts Every Google search

Administration Your Background Overview Big data Performance Clusters

24

slide-25
SLIDE 25

Modeling Challenges of Big Data

Hard to understand and visualize Tools often fail: need new algorithms Models may not scale Models that do scale may not show gains anymore

Administration Your Background Overview Big data Performance Clusters

25

slide-26
SLIDE 26

Performance

How do we access big data efficiently? What patterns do we use for computation?

Administration Your Background Overview Big data Performance Clusters

26

slide-27
SLIDE 27

Disk Performance

Read speed on various devices: Random bytes/s Sequential bytes/s NVMe SSD 24,732 2,774,080,000 Old SATA SSD 7,848 256,781,000 5 TB Hard drive 82 171,302,000

Administration Your Background Overview Big data Performance Clusters

27

slide-28
SLIDE 28

Disk Performance

Read speed on various devices: Random bytes/s Sequential bytes/s NVMe SSD 24,732 2,774,080,000 Old SATA SSD 7,848 256,781,000 5 TB Hard drive 82 171,302,000

Sequential is 100,000–2 million times faster!

Administration Your Background Overview Big data Performance Clusters

28

slide-29
SLIDE 29

Sequential access impacts algorithm choice:

Complexity Access Hash table O(n) Random Merge sort O(n log n) Sequential batches Constant factors matter: merge sort is faster on disk.

Administration Your Background Overview Big data Performance Clusters

29

slide-30
SLIDE 30

Power Law

Big data often follows a power law. Modelling the head (e.g. common words) is easier, but unrepresentative. Handling the tail is harder (e.g. selling all books, not just top 100).

Administration Your Background Overview Big data Performance Clusters

30

slide-31
SLIDE 31

Power Law

Big data often follows a power law. Modelling the head (e.g. common words) is easier, but unrepresentative. Handling the tail is harder (e.g. selling all books, not just top 100). The machine responsible for “the” will take longer.

Administration Your Background Overview Big data Performance Clusters

31

slide-32
SLIDE 32

Challenge: Load Balancing

Distributed computing is a natural way to tackle big data. But we need to balance work across machines: Head of power law goes to one or two nodes = ⇒ slow Tail balanced over nodes = ⇒ fast

Administration Your Background Overview Big data Performance Clusters

32

slide-33
SLIDE 33

Latency

How quickly does data move around the network? High-frequency trading: put machines next to the exchange Amazon (2007): sales decrease 1% for every 100ms increase in load time Google (2006): increasing page load time by 0.5 second produces a 20% drop in traffic Google rankings include page load time

Administration Your Background Overview Big data Performance Clusters

33

slide-34
SLIDE 34

Data centres and clusters

Administration Your Background Overview Big data Performance Clusters

34

slide-35
SLIDE 35

Supercomputers

A pile of Linux boxes in the same room, with a fast network. Top 3 (according to top500.org):

1 Sunway TaihuLight 93,014 TFLOP/s, 10,649,600 cores 2 Tiahne-2: 33 TFLOP/s, 3,120,000 cores 3 Piz Daint: 19 TFLOP/s, 361,760 cores Administration Your Background Overview Big data Performance Clusters

35

slide-36
SLIDE 36

Piz Daint

Administration Your Background Overview Big data Performance Clusters

36

slide-37
SLIDE 37

Economics of Servers: Own or Rent?

Many machines operate at 30% capacity. Own Security Full control, customized hardware Tune for latency- or time-critical tasks Cheaper if machines will be used all the time Rent Pay for servers, storage, and bandwidth by usage/hour Scale up to many servers when needed Compute is another commodity like electricity

Administration Your Background Overview Big data Performance Clusters

37

slide-38
SLIDE 38

Provisioning

Web traffic changes: time of day, shopping seasons, news, link from major site High traffic → more machines Low traffic → save cost

Target (US Retailer)

Website target.com is hosted on Amazon Web Services Busiest shopping day in 2009: 28 November Day target.com went offline: 28 November

Administration Your Background Overview Big data Performance Clusters

38

slide-39
SLIDE 39

Data lock-in and third-party control

Some provider hosts our data:

But we can only access it using proprietary (non-standard) APIs Lock-in makes customers vulnerable to price increases and dependent upon the provider

Providers may control our data in unexpected ways:

July 2009: Amazon remotely remove books from Kindles Twitter prevents exporting tweets more than 3200 posts back Facebook locks user-data in August 2010: Google drops Google Wave

Administration Your Background Overview Big data Performance Clusters

39

slide-40
SLIDE 40

Privacy and Security

Laundry list of breaches: Equifax NHS Ashley Madison hack US government HR database leaks, including security clearance Carphone Warehouse, Target, Health insurers What if your cloud provider is hacked? Who has access? The government? Which governments?

Need for privacy guarantees and measures.

Administration Your Background Overview Big data Performance Clusters

40

slide-41
SLIDE 41

Summary: Big Data

Scalable algorithms Tools for cluster computing Cloud providers and how they work

Administration Your Background Overview Big data Performance Clusters

41