Massive Data Algorithmics Gerth Stlting Brodal Aarhus University - - PowerPoint PPT Presentation

massive data algorithmics
SMART_READER_LITE
LIVE PREVIEW

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University - - PowerPoint PPT Presentation

Massive Data Algorithmics Gerth Stlting Brodal Aarhus University Forskningsdag for Datamatikerlrere, Erhvervsakademiet Lilleblt, Vejle, November 2, 2010 Gerth Stlting Brodal Kurt Mehlhorn 2007- 1994-2006 Erik Meineche Schmidt M.Sc.


slide-1
SLIDE 1

Gerth Stølting Brodal

Aarhus University

Forskningsdag for Datamatikerlærere, Erhvervsakademiet Lillebælt, Vejle, November 2, 2010

Massive Data Algorithmics

slide-2
SLIDE 2

AU 1983

September

1989

August

1993

January

1994

November

M.Sc. 1997

January

PhD 2004

April

Associate Professor 1969

September

1998

August

MPII MPII

April

AU M.Sc. PhD Post Doc Faculty 1994-2006 2007- AU

Erik Meineche Schmidt Kurt Mehlhorn

Gerth Stølting Brodal

slide-3
SLIDE 3

Outline of Talk

  • – who, where, what ?

– reseach areas

  • External memory algorithmics

– models – searching and sorting

  • Fault-tolerant searching
  • Flow simulation
slide-4
SLIDE 4
slide-5
SLIDE 5

– Wher here?

slide-6
SLIDE 6
  • Center of
  • Lars Arge, Professor, Centerleader
  • Gerth S. Brodal, Associate Professor
  • 5 Post Docs, 10 PhD students, 4 TAP
  • Total budget for 5 years ca. 60 million DKR

Arge Brodal Mehlhorn Meyer Demaine Indyk AU MIT MPII Frankfurt

slide-7
SLIDE 7

Faculty Lars Arge Gerth Stølting Brodal Researchers Henrik Blunck Brody Sandel Nodari Sitchinava Elad Verbin Qin Zhang PhD Students Lasse Kosetski Deleuran Jakob Truelsen Freek van Walderveen Kostas Tsakalidis Casper Kejlberg-Rasmussen Mark Greve Kasper Dalgaard Larsen Morten Revsbæk Jesper Erenskjold Moeslund Pooya Davoodi

slide-8
SLIDE 8

datamatiker ”3+5”

  • 8. år

PhD Part B

  • 7. år
  • 6. år

PhD Part A

  • 5. år

MSc

  • 4. år
  • 3. år
  • 2. år Bachelor

Bachelor

  • 1. år

00’erne

PhD Education @ AU

”5+3”

  • 8. år
  • 7. år Licentiat
  • 6. år

(PhD)

  • 5. år
  • 4. år
  • 3. år

MSc

  • 2. år
  • 1. år

80’erne ”4+4”

  • 8. år

PhD Part B

  • 7. år
  • 6. år

PhD Part A

  • 5. år

MSc

  • 4. år
  • 3. år
  • 2. år Bachelor

Bachelor

  • 1. år

90’erne merit merit

slide-9
SLIDE 9

PhD Education @ MADALGO

”3+5”

  • 8. år

PhD

  • 7. år

Part B

  • 6. år

PhD Part A

  • 5. år

MSc

  • 4. år
  • 3. år
  • 2. år

Bachelor Bachelor

  • 1. år

Kasper Morten, Pooya, Freek Mark, Jakob, Lasse, Kostas Casper 6 months abroad merit

slide-10
SLIDE 10
slide-11
SLIDE 11
  • High level objectives

– Advance algorithmic knowledge in “massive data” processing area – Train researchers in world-leading international environment – Be catalyst for multidisciplinary/industry collaboration

  • Building on

– Strong international team – Vibrant international environment (focus on people)

slide-12
SLIDE 12
  • Pervasive use of computers and sensors
  • Increased ability to acquire/store/process data

→ Massive data collected everywhere

  • Society increasingly “data driven”

→ Access/process data anywhere any time Nature special issues

– 2/06: “2020 – Future of computing” – 9/08: “BIG DATA

  • Scientific data size growing exponentially,

while quality and availability improving

  • Paradigm shift: Science will be about mining data

→ Computer science paramount in all sciences

Massive Data

slide-13
SLIDE 13
  • Pervasive use of computers and sensors
  • Increased ability to acquire/store/process data

→ Massive data collected everywhere

  • Society increasingly “data driven”

→ Access/process data anywhere any time Nature special issues

2/06: “2020 – Future of computing” 9/08: “BIG DATA

Scientific data size growing exponentially, while quality and availability improving Paradigm shift: Science will be about mining data → Computer science paramount in all sciences

Massive Data

Obviously not only in sciences:

  • Economist 02/10:
  • From 150 Billion Gigabytes five years ago

to 1200 Billion today

  • Managing data deluge difficult; doing so

will transform business/public life

slide-14
SLIDE 14

Example: Massive Terrain Data

  • New technologies: Much easier/cheaper to

collect detailed data

– Previous ‘manual’ or radar based methods

−Often 30 meter between data points −Sometimes 10 meter data available

– New laser scanning methods (LIDAR)

−Less than 1 meter between data points −Centimeter accuracy (previous meter)

Denmark ~2 million points at 30 meter (<1GB) ~18 billion points at 1 meter (>1TB)

slide-15
SLIDE 15

Cache Oblivious Algorithms Streaming Algorithms Algorithm Engineering I/O Efficient Algorithms

slide-16
SLIDE 16

The problem...

input size running time

Normal algorithm I/O-efficient algorithm

bottleneck = memory size

slide-17
SLIDE 17

Memory Hierarchies

Processor

L1 L2 A R M L3 Disk

bottleneck

increasing access times and memory sizes

CPU

slide-18
SLIDE 18

Memory Hierarkies vs. Running Time

input size running time L1 RAM L2 L3

slide-19
SLIDE 19

Disk Mechanics

track magnetic surface read/write arm read/write head

“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

slide-20
SLIDE 20

Disk Mechanics

track magnetic surface read/write arm read/write head

  • I/O is often bottleneck when handling massive datasets
  • Disk access is 107 times slower than main memory access!
  • Disk systems try to amortize large access time transferring

large contiguous blocks of data

  • Need to store and access data to take advantage of blocks !
slide-21
SLIDE 21

Memory Access Times

Latency Relative to CPU Register 0.5 ns 1 L1 cache 0.5 ns 1-2 L2 cache 3 ns 2-7 DRAM 150 ns 80-200 TLB 500+ ns 200-2000 Disk 10 ms 107

slide-22
SLIDE 22

I/O-Efficient Algorithms Matter

  • Example: Traversing linked list (List ranking)

– Array size N = 10 elements – Disk block size B = 2 elements – Main memory size M = 4 elements (2 blocks)

  • Difference between N and N/B large since block size is

large

– Example: N = 256 x 106, B = 8000 , 1ms disk access time  N I/Os take 256 x 103 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec

Algorithm 2: N/B=5 I/Os Algorithm 1: N=10 I/Os

1 5 2 6 7 3 4 10 8 9 1 2 10 9 8 5 4 7 6 3

slide-23
SLIDE 23

I/O Efficient Scanning

N B A

O(N/B) I/Os

slide-24
SLIDE 24

External-Memory Merging

Merging k sequences with N elements requires O(N/B) IOs (provided k ≤ M/B – 1)

write read k-way merger 2 3 5 6 9 2 3 5 6 9 57 33 41 49 51 52 1 4 7 10 14 29 8 12 16 18 22 24 3 1 34 35 38 42 46 3 2 1 4 5 6 7 8 9 10 11 12 13 14 11 13 15 19 21 25 27 17 20 23 26 28 30 32 37 39 43 45 50

slide-25
SLIDE 25

External-Memory Sorting

  • MergeSort uses O(N/B·logM/B(N/B)) I/Os
  • Practice number I/Os: 4-6 x scanning input

M M Partition into runs Sort each run Merge pass I Merge pass II ... Run 1 Run 2 Run N/M Sorted Sorted Sorted Sorted N Sorted Sorted ouput Unsorted input

slide-26
SLIDE 26

B-trees - The Basic Searching Structure

  • Searches

Practice: 4-5 I/Os

....

B

Search path

Internal memory

  • Repeated searching

Practice: 1-2 I/Os

slide-27
SLIDE 27

B-trees

  • Searches O(logB N) I/Os

....

B

Search path

Internal memory

  • Updates O(logB N) I/Os

Best possible

?

slide-28
SLIDE 28

B-trees with Buffered Updates

....

√B

Brodal and Fagerberg (2003)

B x x x x

  • N updates cost

O(N /√B ∙ logB N) I/Os

  • Searches cost

O(logB N) I/Os

Trade-off between search and update times – optimal !

slide-29
SLIDE 29

B-trees with Buffered Updates

Brodal and Fagerberg (2003)

slide-30
SLIDE 30

B-trees with Buffered Updates Experimental Study

  • 100.000.000 elements
  • Search time basically unchanged with buffers
  • Updates 100 times faster

Hedegaard (2004)

....

slide-31
SLIDE 31

101100011011100111011001010100101010100111110010000100011110010111100001000111001111010001010001110 100001001110000100001111011100110111101010110011101011110100111000010000111110110100110001000011011 011100101011111111110000100101011000001110110110101110011001001100001011011111011011011011010100000 000000010001100001000011101001110010100100100000001010000000100010010001101010011100100110001000010 001011111111010010011001100001001100110100101000000010011010011101001001101100011100010011101001111 110110100111101000011010100100111001101111100001010010101000100010010110111011111110000110100100001 000010110001011101010011111110011001111110100101000100010011000101001001100000001000001100011101011 110000110010111000100111101000001101011110010001101001000101000111101001000011010001011000110101011 110010101100010010010110101010100001010100000110110000000101011001011101110001001011111101010110111 110000100001010011010101111000111110011010100111011101001011011000100011011011111010001000100110110 010011110111010110011011111110100000001111010000101110010011100011010100001100100110001111101100010 110000011101100110100111100010001110001001111010001000110101010001010110110101000011001001000110101 111111111011011011001111001010100000110001110101011100101010110111001010010111011011010110000101011 100011001101001100010100001000000000011110010011010100100111111010010000011100111001110010001000000 101101101111101000010101111111101010000101101000010110011100111010011001000100111111011101010110111 010101010100010011001111111011110011111110011101011110100001100100001010010010010101011110011100010 010111101101000001001001101101001110111011101011000010000111011000011101000011011010110110110110010 110101101010011011100000010100101111010010001010100101011111010101111111110010001101100110001000110 001110101100011110001010101101101111111110111011000111101000110100000000001111110011101110000001110 100001011011100110100010010111110000010100000011000000000010110001010010110011101001001101101110110 101111001110010100001111001010001000011010101100101000000011011110011111000110101100000010111000001 111000011010011011111010010101011110110010100111001010110111101000001000011010110101100110001011010 110101001101000111010101000110100100001001110000011010111101011100010111101000011011000010101000110 011110101101001111011000010110001010011011000101101100100110010100010101101011101100011011101010111 111110011010110000000000000001000001000001100110001101101100100011001101001011010110011000000000101 010100111011100100101011110001011001010100010011100001000000111001100000110100110001000000001100111 001010000010111001001100111000110011100011011000011010001111011100101011101001101100001010011010110 110011101011010111110100111010010011111110111001000100100010101011110111011101001100010000010110110 100000110011011000011110110000010110101111110100011100000110001111111000001011010100100001011000010 000010100001000100000001011011110010101011101000010100010001100000101010101010100110010111011101011 110001011000001100111010011011011111010111101011011000100001110110011101000010100111110010010101101 011010011111000000101011100010110110001111110011011111100011100110100010001100001100010101010101001

slide-32
SLIDE 32

101100011011100111011001010100101010100111110010000100011110010111100001000111001111010001010001110 100001001110000100001111011100110111101010110011101011110100111000010000111110110100110001000011011 011100101011111111110000100101011000001110110110101110011001001100001011011111011011011011010100000 000000010001100001000011101001110010100100100000001010000000100010010001101010011100100110001000010 001011111111010010011001100001001100110100101000000010011010011101001001101100011100010011101001111 110110100111101000011010100100111001101111100001010010101000100010010110111011111110000110100100001 000010110001011101010011111110011001111110100101000100010011000101001001100000001000001100011101011 110000110010111000100111101000001101011110010001101001000101000111101001000011010001011000110101011 110010101100010010010110101010100001010100000110110000000101011001011101110001001011111101010110111 110000100001010011010101111000111110011010100111011101001011011000100011011011111010001000100110110 010011110111010110011011111110100000001111010000101110010011100011010100001100100110001111101100010 110000011101100110100111100010001110001001111010001000110101010001010110110101000011001001000110101 111111111011011011001111001010100000110001110101011100101010110111001010010111011011010110000101011 100011001101001100010100001000000000011110010011010100100111111010010000011100111001110010001000000 101101101111101000010101111111101010000101101000010110011100111010011001000100111111011101010110111 010101010100010011001111111011110011111110011101011110100001100100001010010010010101011110011100010 010111101101000001001001101101001110111011101011000010000111011000011101000011011010110110110110010 110101101010011011100010010100101111010010001010100101011111010101111111110010001101100110001000110 001110101100011110001010101101101111111110111011000111101000110100000000001111110011101110000001110 100001011011100110100010010111110000010100000011000000000010110001010010110011101001001101101110110 101111001110010100001111001010001000011010101100101000000011011110011111000110101100000010111000001 111000011010011011111010010101011110110010100111001010110111101000001000011010110101100110001011010 110101001101000111010101000110100100001001110000011010111101011100010111101000011011000010101000110 011110101101001111011000010110001010011011000101101100100110010100010101101011101100011011101010111 111110011010110000000000000001000001000001100110001101101100100011001101001011010110011000000000101 010100111011100100101011110001011001010100010011100001000000111001100000110100110001000000001100111 001010000010111001001100111000110011100011011000011010001111011100101011101001101100001010011010110 110011101011010111110100111010010011111110111001000100100010101011110111011101001100010000010110110 100000110011011000011110110000010110101111110100011100000110001111111000001011010100100001011000010 000010100001000100000001011011110010101011101000010100010001100000101010101010100110010111011101011 110001011000001100111010011011011111010111101011011000100001110110011101000010100111110010010101101 011010011111000000101011100010110110001111110011011111100011100110100010001100001100010101010101001

A bit in memory changed value because of e.g. background radiation, system heating, ...

slide-33
SLIDE 33

"You have to provide reliability on a software

  • level. If you're running 10,000 machines,

something is going to die every day."

― fellow Jeff Dean

slide-34
SLIDE 34

4 7 10 13 14 15 16 18 19 23 24 26 27 29 30 31 32 33 34 36 38

Binary Search for 16

O(log N) comparisons

slide-35
SLIDE 35

8 4 7 10 13 14 15 16 18 19 23 26 27 29 30 31 32 33 34 36 38 24

soft memory error

Binary Search for 16

000110002 = 24 000010002 = 8

Requirement: If the search key ocours in the array as an uncorrupted value, then we should report a match !

slide-36
SLIDE 36

Where is Michael ?

slide-37
SLIDE 37

Where is Michael ?

slide-38
SLIDE 38

Where is Michael ?

If at most 4 faulty answers then Jens is somewhere here

slide-39
SLIDE 39

Faulty-Memory RAM Model

Finocchi and Italiano, STOC’04

  • Content of memory cells can get corrupted
  • Corrupted and uncorrupted content cannot be distinguished
  • O(1) safe registers
  • Assumption: At most δ corruptions
slide-40
SLIDE 40

4 7 10 13 14 15 16 18 19 23 8 26 27 29 30 31 32 33 34 36 38

Faulty-Memory RAM: Searching

16?

Problem?

High confidence Low confidence

slide-41
SLIDE 41

Faulty-Memory RAM: Searching

When are we done (δ=3)?

Contradiction, i.e. at least one fault

If range contains at least δ+1 and δ+1 then there is at least one uncorrupted and , i.e. x must be contained in the range

slide-42
SLIDE 42

If verification fails → contradiction, i.e. ≥1 memory-fault → ignore 4 last comparisons → backtrack one level of search

Faulty-Memory RAM: Θ(log N + δ) Searching

1 1 2 2 3 3 4 4 5 5 Brodal, Fagerberg, Finocchi, Grandoni, Italiano, Jørgensen, Moruz, Mølhave, ESA’07

slide-43
SLIDE 43
slide-44
SLIDE 44

Flood Prediction Important

  • Prediction areas susceptible to floods

– Due to e.g raising sea level or heavy rainfall

  • Example: Hurricane Floyd Sep. 15, 1999

7 am 3pm

slide-45
SLIDE 45

Detailed Terrain Data Essential

Mandø with 2 meter sea-level raise 80 meter terrain model 2 meter terrain model

slide-46
SLIDE 46

Surface Flow Modeling

  • Conceptually flow is modeled using two basic attributes

– Flow direction: The direction water flows at a point – Flow accumulation: Amount of water flowing through a point

7 am 3pm

Hurricane Floyd (September 15, 1999)

slide-47
SLIDE 47
  • Flow accumulation on grid terrain model:

– Initially one unit of water in each grid cell – Water (initial and received) distributed from each cell to lowest lower neighbor cell (if existing) – Flow accumulation of cell is total flow through it

  • Note:

– Flow accumulation of cell = size of “upstream area” – Drainage network = cells with high flow accumulation

Flow Accumulation

slide-48
SLIDE 48

Massive Data Problems

  • Commercial systems:

– Often very slow – Performance somewhat unpredictable – Cannot handle 2-meter Denmark model

  • Collaboration environmental researchers in late 90’ties

– US Appalachian mountains dataset

  • 800x800km at 100m resolution  a few Gigabytes
  • Customized software on ½ GB machine:
  • Appalachian dataset would be Terabytes sized at 1m

resolution!

14 days!!

slide-49
SLIDE 49

Flood Modeling

  • Not all terrain below height h is

flooded when water rise to h meters!

  • Theoretically not too hard to compute

area flooded when rise to h meters

– But no software can do it for Denmark at 2-meter resolution

  • Use of I/O-efficient algorithms

 Denmark in a few days

  • Even compute new terrain where terrain

below h is flooded when water rise to h

slide-50
SLIDE 50

Interdisciplinary Collaboration

  • Flow/flooding work example of center interdisciplinary

collaboration

– Flood modeling and efficient algorithms in biodiversity modeling – Allow for used of global and/or detailed geographic data

  • Brings together researchers from

– Biodiversity – Ecoinformatics – Algorithms – Datamining – …

slide-51
SLIDE 51

TerraSTREAM

  • Flow/flooding work part of comprehensive software

package

– TerraSTREAM: Whole pipeline of terrain data processing software

  • TerraSTREAM used on full 2 meter Denmark model

(~25 billion points, ~1.5 TB)

– Terrain model (grid) from LIDAR point data – Surface flow modeling: Flow directions and flow accumulation – Flood modeling

slide-52
SLIDE 52

Summary

  • Basic research center in Aarhus

– Organization – PhD education

  • Examples of research

– Theoretical external memory algorithmics – Practical (flow simulation)

slide-53
SLIDE 53

Tau ∙ Jërë-jëf ∙ Tashakkur ∙ S.aHHa ∙ Sag olun Giihtu ∙ Djakujo ∙ Dâkujem vám ∙ Thank you Tesekkür ederim ∙ To-siä ∙ Merci ∙ Tashakur Taing ∙ Dankon ∙ Efharisto´ ∙ Shukriya ∙ Kiitos Dhanyabad ∙ Rakhmat ∙ Trugarez ∙ Asante Köszönöm ∙ Blagodarya ∙ Dziekuje ∙ Eskerrik asko Grazie ∙ Tak ∙ Bayarlaa ∙ Miigwech ∙ Dank u Spasibo ∙ Dêkuji vám ∙ Ngiyabonga ∙ Dziakuj Obrigado ∙ Gracias ∙ A dank aych ∙ Salamat Takk ∙ Arigatou ∙ Tack ∙ Tänan ∙ Aciu Korp kun kah ∙ Multumesk ∙ Terima kasih ∙ Danke Rahmat ∙ Gratias ∙ Mahalo ∙ Dhanyavaad Paldies ∙ Faleminderit ∙ Diolch ∙ Hvala Kam-sa-ham-ni-da ∙ Xìe xìe ∙ Mèrcie ∙ Dankie

Gerth Stølting Brodal gerth@cs.au.dk