61A Lecture 36 Announcements Unix Computer Systems 4 Computer - - PowerPoint PPT Presentation

61a lecture 36 announcements unix computer systems
SMART_READER_LITE
LIVE PREVIEW

61A Lecture 36 Announcements Unix Computer Systems 4 Computer - - PowerPoint PPT Presentation

61A Lecture 36 Announcements Unix Computer Systems 4 Computer Systems Systems research enables application development by defining and implementing abstractions: 4 Computer Systems Systems research enables application development by


slide-1
SLIDE 1

61A Lecture 36

slide-2
SLIDE 2

Announcements

slide-3
SLIDE 3

Unix

slide-4
SLIDE 4

Computer Systems

4

slide-5
SLIDE 5

Computer Systems

Systems research enables application development by defining and implementing abstractions:

4

slide-6
SLIDE 6

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

4

slide-7
SLIDE 7

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

  • Networks provide a robust data transfer interface to constantly evolving communications

infrastructure

4

slide-8
SLIDE 8

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

  • Networks provide a robust data transfer interface to constantly evolving communications

infrastructure

  • Databases provide a declarative interface to complex software that stores and retrieves

information efficiently

4

slide-9
SLIDE 9

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

  • Networks provide a robust data transfer interface to constantly evolving communications

infrastructure

  • Databases provide a declarative interface to complex software that stores and retrieves

information efficiently

  • Distributed systems provide a unified interface to a cluster of multiple machines

4

slide-10
SLIDE 10

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

  • Networks provide a robust data transfer interface to constantly evolving communications

infrastructure

  • Databases provide a declarative interface to complex software that stores and retrieves

information efficiently

  • Distributed systems provide a unified interface to a cluster of multiple machines

A unifying property of effective systems:

4

slide-11
SLIDE 11

Computer Systems

Systems research enables application development by defining and implementing abstractions:

  • Operating systems provide a stable, consistent interface to unreliable, inconsistent

hardware

  • Networks provide a robust data transfer interface to constantly evolving communications

infrastructure

  • Databases provide a declarative interface to complex software that stores and retrieves

information efficiently

  • Distributed systems provide a unified interface to a cluster of multiple machines

A unifying property of effective systems:

4

Hide complexity, but retain flexibility

slide-12
SLIDE 12

Example: The Unix Operating System

5

slide-13
SLIDE 13

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

5

slide-14
SLIDE 14

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware

5

slide-15
SLIDE 15

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine

5

slide-16
SLIDE 16

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format

5

slide-17
SLIDE 17

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

5

slide-18
SLIDE 18

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

slide-19
SLIDE 19

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

process

slide-20
SLIDE 20

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

standard input process

slide-21
SLIDE 21

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

standard input process Text input

slide-22
SLIDE 22

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

standard input standard output process Text input

slide-23
SLIDE 23

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

standard input standard output process Text input Text output

slide-24
SLIDE 24

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964.

5

standard input standard output process standard error Text input Text output

slide-25
SLIDE 25

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964. The standard streams in a Unix-like operating system are similar to Python iterators

5

standard input standard output process standard error Text input Text output

slide-26
SLIDE 26

Example: The Unix Operating System

Essential features of the Unix operating system (and variants):

  • Portability: The same operating system on different hardware
  • Multi-Tasking: Many processes run concurrently on a machine
  • Plain Text: Data is stored and shared in text format
  • Modularity: Small tools are composed flexibly via pipes

“We should have some ways of coupling programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way,” Doug McIlroy in 1964. The standard streams in a Unix-like operating system are similar to Python iterators

5

standard input standard output process standard error Text input Text output (Demo)

cd .../assets/slides && ls *.pdf | cut -f 1 -d - | sort -r | uniq -c

slide-27
SLIDE 27

Python Programs in a Unix Environment

6

slide-28
SLIDE 28

Python Programs in a Unix Environment

6

The sys.stdin and sys.stdout values provide access to the Unix standard streams as files

slide-29
SLIDE 29

Python Programs in a Unix Environment

6

The sys.stdin and sys.stdout values provide access to the Unix standard streams as files A Python file has an interface that supports iteration, read, and write methods

slide-30
SLIDE 30

Python Programs in a Unix Environment

6

The sys.stdin and sys.stdout values provide access to the Unix standard streams as files A Python file has an interface that supports iteration, read, and write methods Using these "files" takes advantage of the operating system text processing abstraction

slide-31
SLIDE 31

Python Programs in a Unix Environment

6

The sys.stdin and sys.stdout values provide access to the Unix standard streams as files A Python file has an interface that supports iteration, read, and write methods Using these "files" takes advantage of the operating system text processing abstraction The input and print functions also read from standard input and write to standard output

slide-32
SLIDE 32

Python Programs in a Unix Environment

(Demo)

6

The sys.stdin and sys.stdout values provide access to the Unix standard streams as files A Python file has an interface that supports iteration, read, and write methods Using these "files" takes advantage of the operating system text processing abstraction The input and print functions also read from standard input and write to standard output

slide-33
SLIDE 33

Big Data

slide-34
SLIDE 34

Big Data Examples

8

Examples from Anthony Joseph

slide-35
SLIDE 35

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes)

8

Examples from Anthony Joseph

slide-36
SLIDE 36

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes

8

Examples from Anthony Joseph

slide-37
SLIDE 37

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes)

8

Examples from Anthony Joseph

slide-38
SLIDE 38

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

slide-39
SLIDE 39

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

Facebook datacenter (2014) Typical hardware for big data applications:

slide-40
SLIDE 40

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

Facebook datacenter (2014) Typical hardware for big data applications: Consumer-grade hard disks and processors

slide-41
SLIDE 41

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

Facebook datacenter (2014) Typical hardware for big data applications: Consumer-grade hard disks and processors Independent computers are stored in racks

slide-42
SLIDE 42

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

Facebook datacenter (2014) Typical hardware for big data applications: Consumer-grade hard disks and processors Independent computers are stored in racks Concerns: networking, heat, power, monitoring

slide-43
SLIDE 43

Big Data Examples

Facebook's daily logs: 60 Terabytes (60,000 Gigabytes) 1,000 genomes project: 200 Terabytes Google web index: 10+ Petabytes (10,000,000 Gigabytes) Time to read 1 Terabyte from disk: 3 hours (100 Megabytes/second)

8

Examples from Anthony Joseph

Facebook datacenter (2014) Typical hardware for big data applications: Consumer-grade hard disks and processors Independent computers are stored in racks Concerns: networking, heat, power, monitoring When using many computers, some will fail!

slide-44
SLIDE 44

Apache Spark

slide-45
SLIDE 45

Apache Spark

10

slide-46
SLIDE 46

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

10

slide-47
SLIDE 47

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs

10

slide-48
SLIDE 48

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe

10

slide-49
SLIDE 49

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe
  • Supports common sequence operations: map, filter, reduce

10

slide-50
SLIDE 50

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe
  • Supports common sequence operations: map, filter, reduce
  • Supports common database operations: join, union, intersection

10

slide-51
SLIDE 51

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe
  • Supports common sequence operations: map, filter, reduce
  • Supports common database operations: join, union, intersection

All of these operations can be performed on RDDs that are partitioned across machines

10

slide-52
SLIDE 52

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe
  • Supports common sequence operations: map, filter, reduce
  • Supports common database operations: join, union, intersection

All of these operations can be performed on RDDs that are partitioned across machines

10

Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-53
SLIDE 53

Apache Spark

Apache Spark is a data processing system that provides a simple interface for large data

  • A Resilient Distributed Dataset (RDD) is a collection of values or key-value pairs
  • Supports common UNIX operations: sort, distinct (uniq in UNIX), count, pipe
  • Supports common sequence operations: map, filter, reduce
  • Supports common database operations: join, union, intersection

All of these operations can be performed on RDDs that are partitioned across machines

10

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-54
SLIDE 54

Apache Spark Execution Model

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-55
SLIDE 55

Apache Spark Execution Model

Processing is defined centrally but executed remotely

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-56
SLIDE 56

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-57
SLIDE 57

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes
  • A driver program defines transformations and actions on an RDD

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-58
SLIDE 58

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes
  • A driver program defines transformations and actions on an RDD
  • A cluster manager assigns tasks to individual worker nodes to carry them out

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-59
SLIDE 59

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes
  • A driver program defines transformations and actions on an RDD
  • A cluster manager assigns tasks to individual worker nodes to carry them out
  • Worker nodes perform computation & communicate values to each other

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-60
SLIDE 60

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes
  • A driver program defines transformations and actions on an RDD
  • A cluster manager assigns tasks to individual worker nodes to carry them out
  • Worker nodes perform computation & communicate values to each other
  • Final results are communicated back to the driver program

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-61
SLIDE 61

Apache Spark Execution Model

Processing is defined centrally but executed remotely

  • A Resilient Distributed Dataset (RDD) is distributed in partitions to worker nodes
  • A driver program defines transformations and actions on an RDD
  • A cluster manager assigns tasks to individual worker nodes to carry them out
  • Worker nodes perform computation & communicate values to each other
  • Final results are communicated back to the driver program

11

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-62
SLIDE 62

Apache Spark Interface

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

slide-63
SLIDE 63

Apache Spark Interface

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

The Last Words of Shakespeare (Demo)

slide-64
SLIDE 64

Apache Spark Interface

A SparkContext gives access to the cluster manager

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

The Last Words of Shakespeare (Demo)

slide-65
SLIDE 65

Apache Spark Interface

A SparkContext gives access to the cluster manager

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> The Last Words of Shakespeare (Demo)

slide-66
SLIDE 66

Apache Spark Interface

A SparkContext gives access to the cluster manager A RDD can be constructed from the lines of a text file

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> The Last Words of Shakespeare (Demo)

slide-67
SLIDE 67

Apache Spark Interface

A SparkContext gives access to the cluster manager A RDD can be constructed from the lines of a text file

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> >>> x = sc.textFile('shakespeare.txt') The Last Words of Shakespeare (Demo)

slide-68
SLIDE 68

Apache Spark Interface

A SparkContext gives access to the cluster manager A RDD can be constructed from the lines of a text file The sortBy transformation and take action are methods

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> >>> x = sc.textFile('shakespeare.txt') The Last Words of Shakespeare (Demo)

slide-69
SLIDE 69

Apache Spark Interface

A SparkContext gives access to the cluster manager A RDD can be constructed from the lines of a text file The sortBy transformation and take action are methods

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> >>> x = sc.textFile('shakespeare.txt') >>> x.sortBy(lambda s: s, False).take(2) ['you shall ...', 'yet , a ...'] The Last Words of Shakespeare (Demo)

slide-70
SLIDE 70

Apache Spark Interface

A SparkContext gives access to the cluster manager A RDD can be constructed from the lines of a text file The sortBy transformation and take action are methods

12

King Lear Romeo & Juliet

Two households , both alike in dignity , In fair Verona , where we lay our scene , From ancient grudge break to new mutiny , Where civil blood makes civil hands unclean . From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life ; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife . The fearful passage of their death-mark'd love , And the continuance of their parents' rage , Which , but their children's end , nought could remove , Is now the two hours' traffick of our stage ; The which if you with patient ears attend , What here shall miss , our toil shall strive to mend .

>>> sc <pyspark.context.SparkContext ...> >>> x = sc.textFile('shakespeare.txt') >>> x.sortBy(lambda s: s, False).take(2) ['you shall ...', 'yet , a ...'] (Demo) The Last Words of Shakespeare (Demo)

slide-71
SLIDE 71

What Does Apache Spark Provide?

13

slide-72
SLIDE 72

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

13

slide-73
SLIDE 73

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

13

slide-74
SLIDE 74

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

13

slide-75
SLIDE 75

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first

13

slide-76
SLIDE 76

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first Network locality: Data transfer is expensive

13

slide-77
SLIDE 77

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first Network locality: Data transfer is expensive

  • The cluster manager tries to schedule computation on the machines that hold

the data to be processed

13

slide-78
SLIDE 78

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first Network locality: Data transfer is expensive

  • The cluster manager tries to schedule computation on the machines that hold

the data to be processed Monitoring: Will my job finish before dinner?!?

13

slide-79
SLIDE 79

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first Network locality: Data transfer is expensive

  • The cluster manager tries to schedule computation on the machines that hold

the data to be processed Monitoring: Will my job finish before dinner?!?

  • The cluster manager provides a web-based interface 


describing jobs

13

slide-80
SLIDE 80

What Does Apache Spark Provide?

Fault tolerance: A machine or hard drive might crash

  • The cluster manager automatically re-runs failed tasks

Speed: Some machine might be slow because it's overloaded

  • The cluster manager can run multiple copies of a task and keep the result of

the one that finishes first Network locality: Data transfer is expensive

  • The cluster manager tries to schedule computation on the machines that hold

the data to be processed Monitoring: Will my job finish before dinner?!?

  • The cluster manager provides a web-based interface 


describing jobs

13

slide-81
SLIDE 81

MapReduce

slide-82
SLIDE 82

MapReduce Applications

15

slide-83
SLIDE 83

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google

15

slide-84
SLIDE 84

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google Generic application structure that happened to capture many common data processing tasks

15

slide-85
SLIDE 85

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google Generic application structure that happened to capture many common data processing tasks

  • Step 1: Each element in an input collection produces zero or more key-value pairs (map)

15

slide-86
SLIDE 86

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google Generic application structure that happened to capture many common data processing tasks

  • Step 1: Each element in an input collection produces zero or more key-value pairs (map)
  • Step 2: All key-value pairs that share a key are aggregated together (shuffle)

15

slide-87
SLIDE 87

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google Generic application structure that happened to capture many common data processing tasks

  • Step 1: Each element in an input collection produces zero or more key-value pairs (map)
  • Step 2: All key-value pairs that share a key are aggregated together (shuffle)
  • Step 3: The values for a key are processed as a sequence (reduce)

15

slide-88
SLIDE 88

MapReduce Applications

An important early distributed processing system was MapReduce, developed at Google Generic application structure that happened to capture many common data processing tasks

  • Step 1: Each element in an input collection produces zero or more key-value pairs (map)
  • Step 2: All key-value pairs that share a key are aggregated together (shuffle)
  • Step 3: The values for a key are processed as a sequence (reduce)

Early applications: indexing web pages, training language models, & computing PageRank

15

slide-89
SLIDE 89

MapReduce Evaluation Model

16

slide-90
SLIDE 90

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

16

slide-91
SLIDE 91

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

16

slide-92
SLIDE 92

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

Google MapReduce Is a Big Data framework For batch processing

16

slide-93
SLIDE 93

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

16

slide-94
SLIDE 94

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3

16

slide-95
SLIDE 95

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3

16

slide-96
SLIDE 96

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

16

slide-97
SLIDE 97

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1

16

slide-98
SLIDE 98

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1

16

slide-99
SLIDE 99

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1

16

slide-100
SLIDE 100

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1

16

slide-101
SLIDE 101

MapReduce Evaluation Model

Map phase: Apply a mapper function to all inputs, emitting intermediate key-value pairs

  • The mapper yields zero or more key-value pairs for each input

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1

16

slide-102
SLIDE 102

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-103
SLIDE 103

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-104
SLIDE 104

reducer a: 6

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-105
SLIDE 105

reducer e: 5 reducer a: 6

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-106
SLIDE 106

reducer e: 5 reducer a: 6

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... i: 2 Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-107
SLIDE 107

reducer e: 5 reducer a: 6

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... i: 2

  • : 5

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-108
SLIDE 108

reducer e: 5 reducer a: 6

MapReduce Evaluation Model

mapper Google MapReduce Is a Big Data framework For batch processing

  • : 2

a: 1 u: 1 e: 3 i: 1 a: 4 e: 1

  • : 1

a: 1

  • : 2

e: 1 i: 1 a: 4 a: 1 a: 1 e: 1 e: 3 e: 1 ... i: 2

  • : 5

u: 1 Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key

  • All key-value pairs with the same key are processed together
  • The reducer yields zero or more values, each associated with that intermediate key

17

slide-109
SLIDE 109

MapReduce Applications on Apache Spark

18

slide-110
SLIDE 110

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

slide-111
SLIDE 111

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn)

slide-112
SLIDE 112

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn)

slide-113
SLIDE 113

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression

slide-114
SLIDE 114

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data

slide-115
SLIDE 115

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input

slide-116
SLIDE 116

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input fn Output

slide-117
SLIDE 117

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output

slide-118
SLIDE 118

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values

slide-119
SLIDE 119

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value

slide-120
SLIDE 120

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value Zero or more key-value pairs

slide-121
SLIDE 121

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs

slide-122
SLIDE 122

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs Key-value pairs

slide-123
SLIDE 123

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs Key-value pairs Two values

slide-124
SLIDE 124

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs Key-value pairs Two values One value

slide-125
SLIDE 125

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs Key-value pairs Two values One key-value pair for each unique key One value

slide-126
SLIDE 126

MapReduce Applications on Apache Spark

Key-value pairs are just two-element Python tuples

18

data.flatMap(fn) data.reduceByKey(fn) Call Expression Data fn Input Result fn Output Values One value All key-value pairs returned by calls to fn Zero or more key-value pairs Key-value pairs Two values One key-value pair for each unique key One value (Demo)