CUDA-Based Implementation of GSLIB: The Geostatistical Software - - PowerPoint PPT Presentation

cuda based implementation of gslib the geostatistical
SMART_READER_LITE
LIVE PREVIEW

CUDA-Based Implementation of GSLIB: The Geostatistical Software - - PowerPoint PPT Presentation

CUDA-Based Implementation of GSLIB: The Geostatistical Software Library Daniel Baeza Oscar Peredo dabaeza@alges.cl operedo@alges.cl The Mining Process Exploration Evaluation Planning Operation GSLIB: The Geostatistical Software Library


slide-1
SLIDE 1

CUDA-Based Implementation of GSLIB: The Geostatistical Software Library

Daniel Baeza

dabaeza@alges.cl

Oscar Peredo

  • peredo@alges.cl
slide-2
SLIDE 2

The Mining Process

Evaluation Exploration Planning Operation

slide-3
SLIDE 3

GSLIB: The Geostatistical Software Library

slide-4
SLIDE 4

GSLIB: The Geostatistical Software Library

  • GSLIB is a software package composed by a set of utilities and

applications related with geostatistics

  • Full implemented in Fortran 77/90
  • Run in OSX, Linux and Windows
  • Widely used for academics, researchers, engineers

GSLIB: Geostatistical Software Library and user's guide (1998) Deutsch, Clayton V, Journel, André G

slide-5
SLIDE 5

Variogram calculation with GSLIB

  • gamv is the GSLIB variogram calculation method
  • It’s a fundamental tool in geostatistics
  • Allow to quantify the spatial variability of a variable
  • Used in geostatistical estimation and simulation
  • High computational cost

γ

Distance Normal Scores Semivariogram Low Solubil 0. 100. 200. 300. 400. 500. .00 .20 .40 .60 .80 1.00 1.20

γ

Distance Normal Scores Semivariogram High Solubi 0. 100. 200. 300. 400. 500. .00 .20 .40 .60 .80 1.00 1.20
slide-6
SLIDE 6

Variogram calculation

x z y Z(u) Z(u + h) b a a b

  • ||

|| = h 2!(h) = 1 ∑ [z(u) - z(u + h)]

2

N(h) more variability !(h) less variability h

slide-7
SLIDE 7

Variogram computation

  • Sequential & Parallel Implementations
slide-8
SLIDE 8

Input:

  • (VR, Ω): sample data values VR (m columns) defined in a 3D domain of coordinates Ω
  • nvar: number of variables (nvar ≤ m)
  • nlag: number of lags
  • h: lag separation distance
  • ndir: number of directions
  • h1, . . . , hndir: directions
  • τ1, . . . , τndir: geometrical tolerance parameters
  • nvarg: number of variograms
  • (ivtype1, ivtail1, ivhead1), . . . , (ivtypenvarg, ivtailnvarg, ivheadnvarg): variogram types

1 Read input parameter file; 2 Read sample data values file; 3 β ← zeros(nvar × nlag × ndir × nvarg); 4 for i ∈ {1, . . . , |Ω|} do 5

for j ∈ {i, . . . , |Ω|} do

6

for id ∈ {1, . . . , ndir} do

7

for iv ∈ {1, . . . , nvarg} do

8

for il ∈ {1, . . . , nlag} do

9

pi = (xi, yi, zi) ∈ Ω;

10

pj = (xj, y j, zj) ∈ Ω;

11

if (pi, p j) satisfy tolerances τid and ||pi − pj|| ≈ hid × il × h then

12

Save (VRi,ivheadiv or VRi,ivtailiv) and (VRj,ivheadiv or VRj,ivtailiv) according to variogram ivtypeiv into β;

13 γ ← build variogram using statistics β Write γ in the output file

Output: Output file with γ values

Sequential implementation

Setup parameters & Load data Main computation: Loop over pairs of points Read computation results

slide-9
SLIDE 9

Input:

  • (VR, Ω): sample data values VR (m columns) defined in a 3D domain of coordinates Ω
  • nvar: number of variables (nvar ≤ m)
  • nlag: number of lags
  • h: lag separation distance
  • ndir: number of directions
  • h1, . . . , hndir: directions
  • τ1, . . . , τndir: geometrical tolerance parameters
  • nvarg: number of variograms
  • (ivtype1, ivtail1, ivhead1), . . . , (ivtypenvarg, ivtailnvarg, ivheadnvarg): variogram types

1 Read input parameter file; 2 Read sample data values file; 3 β ← zeros(nvar × nlag × ndir × nvarg); 4 for i ∈ {1, . . . , |Ω|} do 5

for j ∈ {i, . . . , |Ω|} do

6

for id ∈ {1, . . . , ndir} do

7

for iv ∈ {1, . . . , nvarg} do

8

for il ∈ {1, . . . , nlag} do

9

pi = (xi, yi, zi) ∈ Ω;

10

pj = (xj, y j, zj) ∈ Ω;

11

if (pi, p j) satisfy tolerances τid and ||pi − pj|| ≈ hid × il × h then

12

Save (VRi,ivheadiv or VRi,ivtailiv) and (VRj,ivheadiv or VRj,ivtailiv) according to variogram ivtypeiv into β;

13 γ ← build variogram using statistics β Write γ in the output file

Output: Output file with γ values

Sequential implementation

slide-10
SLIDE 10

Input:

  • (VR, Ω): sample data values VR (m columns) defined in a 3D domain of coordinates Ω
  • nvar: number of variables (nvar ≤ m)
  • nlag: number of lags
  • h: lag separation distance
  • ndir: number of directions
  • h1, . . . , hndir: directions
  • τ1, . . . , τndir: geometrical tolerance parameters
  • nvarg: number of variograms
  • (ivtype1, ivtail1, ivhead1), . . . , (ivtypenvarg, ivtailnvarg, ivheadnvarg): variogram types

1 Read input parameter file; 2 Read sample data values file; 3 β ← zeros(nvar × nlag × ndir × nvarg); 4 for i ∈ {1, . . . , |Ω|} do 5

for j ∈ {i, . . . , |Ω|} do

6

for id ∈ {1, . . . , ndir} do

7

for iv ∈ {1, . . . , nvarg} do

8

for il ∈ {1, . . . , nlag} do

9

pi = (xi, yi, zi) ∈ Ω;

10

pj = (xj, y j, zj) ∈ Ω;

11

if (pi, p j) satisfy tolerances τid and ||pi − pj|| ≈ hid × il × h then

12

Save (VRi,ivheadiv or VRi,ivtailiv) and (VRj,ivheadiv or VRj,ivtailiv) according to variogram ivtypeiv into β;

13 γ ← build variogram using statistics β Write γ in the output file

Output: Output file with γ values

Sequential implementation

slide-11
SLIDE 11
  • (ivtype1, ivtail1, ivhead1), . . . , (ivtypenvarg, ivtailnvarg, ivheadnvarg): variogram types

1 Read input parameter file; 2 Read sample data values file; 3 β ← zeros(nvar × nlag × ndir × nvarg); 4 for i ∈ {1, . . . , |Ω|} do 5

for j ∈ {i, . . . , |Ω|} do

6

for id ∈ {1, . . . , ndir} do

7

for iv ∈ {1, . . . , nvarg} do

8

for il ∈ {1, . . . , nlag} do

9

pi = (xi, yi, zi) ∈ Ω;

10

pj = (xj, yj, zj) ∈ Ω;

11

if (pi, pj) satisfy tolerances τid and ||pi − pj|| ≈ hid × il × h then

12

Save (VRi,ivheadiv or VRi,ivtailiv) and (VRj,ivheadiv or VRj,ivtailiv) according to variogram ivtypeiv into β;

13 γ ← build variogram using statistics β Write γ in the output file

Output: Output file with γ values

Sequential implementation

compute statistics(pi, pj) store result in shared memory

slide-12
SLIDE 12
  • (ivtype1, ivtail1, ivhead1), . . . , (ivtypenvarg, ivtailnvarg, ivheadnvarg): variogram types

1 Read input parameter file; 2 Read sample data values file; 3 β ← zeros(nvar × nlag × ndir × nvarg); 4 for i ∈ {1, . . . , |Ω|} do 5

for j ∈ {i, . . . , |Ω|} do

6

for id ∈ {1, . . . , ndir} do

7

for iv ∈ {1, . . . , nvarg} do

8

for il ∈ {1, . . . , nlag} do

9

pi = (xi, yi, zi) ∈ Ω;

10

pj = (xj, yj, zj) ∈ Ω;

11

if (pi, pj) satisfy tolerances τid and ||pi − pj|| ≈ hid × il × h then

12

Save (VRi,ivheadiv or VRi,ivtailiv) and (VRj,ivheadiv or VRj,ivtailiv) according to variogram ivtypeiv into β;

13 γ ← build variogram using statistics β Write γ in the output file

Output: Output file with γ values

Sequential implementation

STEP 1 STEP 2 STEP 3 ... STEP N

compute statistics(pi, pj) store result in shared memory

slide-13
SLIDE 13

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

GPU Kernel

Each thread compute

  • nly a couple of correlation values
slide-14
SLIDE 14

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

slide-15
SLIDE 15

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 2 3 0 1 2 3 1 3 STEP 1 STEP 2 STEP 3 STEP 4

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 STEP 1 STEP 2

slide-16
SLIDE 16

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 2 3 0 1 2 3 1 3 STEP 1 STEP 2 STEP 3 STEP 4

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 STEP 1 STEP 2

slide-17
SLIDE 17

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 2 3 0 1 2 3 1 3 STEP 1 STEP 2 STEP 3 STEP 4

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 STEP 1 STEP 2

slide-18
SLIDE 18

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 2 3 0 1 2 3 1 3 STEP 1 STEP 2 STEP 3 STEP 4

Parallel implementation

  • 1 idx = blockId.x*blockDim.x + threadId.x

/* x threads coord in the GPU grid */

2 idy = blockId.y*blockDim.y + threadId.y

/* y threads coord in the GPU grid */

3 Set and Initialize shared memory in the block 4

syncthreads()

5 iterx = idx 6 itery = idy 7 while (iterx & itery ∈ ChunkPoints)

/* Chunk points belongs thread (x,y) */

8 do 9

j = iterx + |Ω|

2 10

i = itery

11

compute statistics(pi, p j)

12

store result in shared memory via atomic functions

13

if (iterx > itery) then

14

i = itery

15

j = iterx

16

compute statistics(pi, pj)

17

store result in shared memory via atomic functions

18

else

19

if (iterx == itery) then

20

i = itery

21

j = itery

22

compute statistics(pi, p j)

23

store result in shared memory via atomic functions

24

i = iterx + |Ω|

2 25

j = itery + |Ω|

2 26

compute statistics(pi, pj)

27

store result in shared memory via atomic functions

28

up date(iterx, itery)

29

syncthreads()

30 save Statistics values that are in shared memory into globa memory via atomic functions

Output: Array β

2 0 1 2 3 1 3 3 0 1 2 3 0 2 1 3 0 1 2 3 0 1 2 3 STEP 1 STEP 2

slide-19
SLIDE 19

Results

slide-20
SLIDE 20

γ

Distance

10x10x120 Omni-Directional Semivariogram

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 0.00 0.20 0.40 0.60 0.80 1.00

Standard version CUDA version

γ

Distance

510x510x1 Omni-Directional Semivariogram

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0.0 10.0 20.0 30.0 40.0 50.0

Standard version CUDA version

slide-21
SLIDE 21

Standard Version

Seconds 575 1150 1725 2300 Number of points 5000 12000 65000 261000 500000 800000 1000000 2095 1330 518.69 143.2 9.52 4.68 0.8

  • Intel Xeon E5
  • 3.3 GHz CPU Frequency
  • 10 MB cache
  • 16GB RAM (1,6 MHz)
  • 1 TB HDD
  • Linux

Dell Precision T7600

slide-22
SLIDE 22

CUDA version

Seconds 12.5 25 37.5 50 Number of points 5000 12000 65000 261000 500000 800000 1000000 43.2 28.45 12.28 3.62 0.61 0.53 0.25

  • 448 CUDA cores
  • 1.15 GHz Frequency of CUDA cores
  • 6GB RAM
  • 114 GB/sec Memory bandwidth

Tesla c2075

slide-23
SLIDE 23

CUDA version

Seconds 12.5 25 37.5 50 Number of points 5000 12000 65000 261000 500000 800000 1000000 43.2 28.45 12.28 3.62 0.61 0.53 0.25

Standard Version

Seconds 575 1150 1725 2300 Number of points 5000 12000 65000 261000 500000 800000 1000000 2095 1330 518.69 143.2 9.52 4.68 0.8

slide-24
SLIDE 24

SpeedUp

SpeedUp 13.75 27.5 41.25 55 Number of points 5000 12000 65000 261000 500000 800000 1000000 48.5 46.7 42.2 39.6 15.6 8.8 3.2 Sequential Parallel

43 sec 34 min

1x 48x

CPU GPU

slide-25
SLIDE 25

Current and future work

slide-26
SLIDE 26
  • Finish the CUDA version of indicator simulation and gaussian

simulation of GSLIB

  • Release the first GSLIB-CUDA version with these three methods
slide-27
SLIDE 27

Thanks! Questions?