Out-of-Core Proximity Computation for Particle-based Fluid - - PowerPoint PPT Presentation

out of core proximity computation
SMART_READER_LITE
LIVE PREVIEW

Out-of-Core Proximity Computation for Particle-based Fluid - - PowerPoint PPT Presentation

Out-of-Core Proximity Computation for Particle-based Fluid Simulation Presenter: Duksu Kim 1 Myung-Bae Son 2 Young J. Kim 3 Jeong-Mo Hong 4 Sung-Eui Yoon 2 1 KISTI (Korea Institute of Science and Technology Information) 2 KAIST (Korea Advanced


slide-1
SLIDE 1

Out-of-Core Proximity Computation for Particle-based Fluid Simulation

Duksu Kim 1 Myung-Bae Son 2 Young J. Kim 3 Jeong-Mo Hong 4 Sung-Eui Yoon 2

1 KISTI (Korea Institute of Science and Technology Information) 2 KAIST (Korea Advanced Institute of Science and Technology) 3 Ewha Woman’s University, Korea 4 Dongguk University, Korea

Presenter:

slide-2
SLIDE 2

Particle-based Fluid Simulation

2

slide-3
SLIDE 3

Motivation

  • To meet the higher realism, a large number of

particles are required

– Tens of millions particles

  • In-core algorithm (previous work)

– Manage all data in GPU’s video memory – Can handle up to 5 M particles with 1 GB memory for particle- based fluid simulation

  • Recent commodity GPUs have 1 ~ 3 GB memories

(up to 12 GB)

3

slide-4
SLIDE 4

Contributions

  • Propose out-of-core methods that utilize

heterogeneous computing resources and process neighbor search for a large number of particles

  • Propose a memory footprint estimation method

to identify a maximal work unit for efficient out-

  • f-core processing

4

slide-5
SLIDE 5

Ours Map-GPU

NVIDIA mapped memory Tech.

  • Map CPU memory space

into GPU memory address space

Result

5

Up to 65.6 M Particles Maximum data size: 13 GB

  • Two hexa-core CPUs (192 GB Mem.)
  • One GPU (3 GB Mem.)
slide-6
SLIDE 6

Particle-based Fluid Simulation

6

Neighbor search Compute force Move particles

slide-7
SLIDE 7

Particle-based Fluid Simulation

7

Neighbor search Compute force Move particles

Performance bottleneck

  • Takes 60~80% of simulation computation time

ε ε-Nearest Neighbor (ε-NN)

slide-8
SLIDE 8

Preliminary: Grid-based ε-NN

8

ε 𝑚 (ε < 𝑚)

slide-9
SLIDE 9

Preliminary: Grid-based ε-NN

9

𝑚 (ε < 𝑚)

slide-10
SLIDE 10

Main memory (CPU side)

10

GPU

Video memory

  • Grid data
  • Particle data

ε-NN

Results

In-Core Algorithm (Data<Video Memory)

Assume: Main memory is enough

  • can equip up to 4 TB
slide-11
SLIDE 11

Main memory (CPU side)

11

GPU

Video memory

  • Grid data
  • Particle data

ε-NN

Results

Data > Video Memory

slide-12
SLIDE 12

Main memory (CPU side)

12

GPU

Video memory

  • Sub-grid(Block) data
  • Particle data

ε-NN

Results

Out-of-Core Algorithm

slide-13
SLIDE 13

Boundary Region

13

  • Required data in adjacent blocks
  • Inefficient to handle in an out-of-core manner
slide-14
SLIDE 14

Boundary Region

  • Required data in adjacent blocks
  • Inefficient to handle in an out-of-core manner
  • Multi-core CPUs handle the boundary region

–CPU (main) memory contain all required data –Ratio of boundary regions is usually much smaller than inner regions

14

slide-15
SLIDE 15

How to Divide the Grid ?

15

slide-16
SLIDE 16

How to Divide the Grid ?

  • Goal: Find the largest block

that fits to the GPU memory

– Improve parallel computing efficiency

  • Process a large number of

particles at once

  • Minimize data transfer overhead

– Reduce the boundary region

  • As the ratio of boundary region is

increased, the workload of CPU is increased

16

slide-17
SLIDE 17

Required Memory Size for processing a block, B

17

𝑻 𝑪 = 𝒐𝑪𝑻𝒒 + 𝑻𝒐 𝒐𝒒𝒋

𝒒𝒋∈𝑪

Data size for storing a particle Data size for storing a neighbor info. # of particles in B # of neighbor particles for the particle i (pi)

slide-18
SLIDE 18

Hierarchical Work Distribution

18

a b d

… … Front nodes

Workload tree

a b c d

𝑻 𝑪 < GPU memory

c

  • # of particles in the block
  • # of neighbors in the block
slide-19
SLIDE 19

Chicken-and-Egg Problem

19

𝑻 𝑪 = 𝒐𝑪𝑻𝒒 + 𝑻𝒐 𝒐𝒒𝒋

𝒒𝒋∈𝑪

# of neighbor particles for the particle i, pi Data size for storing a particle Data size for storing a neighbor info. # of particles in B

slide-20
SLIDE 20

Chicken-and-Egg Problem

20

𝑻 𝑪 = 𝒐𝑪𝑻𝒒 + 𝑻𝒐 𝒐𝒒𝒋

𝒒𝒋∈𝑪

Our approach: Estimation the number of neighbors for particles

slide-21
SLIDE 21

Problem Formulation

  • Assumption

– Particles are uniformly distributed in a cell

  • Idea

– For a particle, the number of neighbors in a cell is proportional to the overlap volume between the search sphere and the cell weighted by the number of particles in the cell

21

p

ε S(p,ε)

slide-22
SLIDE 22

Expected Number of Neighbors of a particle p located at (x, y, z)

22

  • 𝑫𝒋 : cells of 𝒒𝒚,𝒛,𝒜 and its adjacency cells
  • 𝒐 𝑫𝒋 : the number of particles in the cell
  • 𝑷𝒘𝒇𝒔𝒎𝒃𝒒 𝑻(𝒒𝒚,𝒛,𝒜, 𝜻 , 𝑫𝒋) : overlap volume between them
  • 𝑾 𝑫𝒋 : volume of the cell

𝑭 𝒒𝒚,𝒛,𝒜 = 𝒐 𝑫𝒋 ∗ 𝑷𝒘𝒇𝒔𝒎𝒃𝒒 𝑻(𝒒𝒚,𝒛,𝒜, 𝜻 , 𝑫𝒋) 𝑾(𝑫𝒋)

𝒋

slide-23
SLIDE 23

Problem Formulation

  • Compute 𝐹 𝑞𝑦,𝑧,𝑨 for each particle takes high

computational overhead

  • Instead, (approximation)

–Compute the average 𝐹 𝑞𝑦,𝑧,𝑨 for particles in a cell –Use the value for all particles in the cell

23

slide-24
SLIDE 24

The Average, Expected Number of Neighbors of particles in a cell 𝐷𝑟

24

𝑭 𝑫𝒓 = 𝟐 𝑾 𝑫𝒓 ∗ 𝑭 𝒒𝒚,𝒛,𝒜 𝒆𝒚

𝒎 𝟏

𝒆𝒛

𝒎 𝟏

𝒆𝒜

𝒎 𝟏

  • 𝑚 is the length of a cell along each dimension
  • 𝒒𝒚,𝒛,𝒜 is a particle positioned at (x, y, z) on a local coordinate space in 𝐷𝑟

Expensive to compute at runtime

slide-25
SLIDE 25

25

= 𝟐 𝑾 𝑫𝒓 ∗ 𝒐 𝑫𝒋 ∗ 𝑬 𝑫𝒓, 𝑫𝒋 𝑾 𝑫𝒋

𝒋

𝑭 𝑫𝒓 = 𝟐 𝑾 𝑫𝒓 ∗ 𝑭 𝒒𝒚,𝒛,𝒜 𝒆𝒚

𝒎 𝟏

𝒆𝒛

𝒎 𝟏

𝒆𝒜

𝒎 𝟏

𝐸 𝐷𝑟, 𝐷𝑗 = 𝑃𝑤𝑓𝑠𝑚𝑏𝑞 𝑇 𝑄

𝑦,𝑧,𝑨, 𝜁 , 𝐷𝑗 𝑒𝑦 𝑚

𝑒𝑧

𝑚

𝑒𝑨

𝑚

The Average, Expected Number of Neighbors of particles in a cell 𝐷𝑟

slide-26
SLIDE 26
  • Pre-compute 𝐸 𝐷𝑟, 𝐷𝑗

–The value depends on the ratio between 𝑚 and 𝜁 values –𝑚 and 𝜁 are not frequently changed by user –Use the Monte-Carlo method with many samples (e.g., 1 M)

  • Use look-up table at runtime

26

𝐸 𝐷𝑟, 𝐷𝑗 = 𝑃𝑤𝑓𝑠𝑚𝑏𝑞 𝑇 𝑄

𝑦,𝑧,𝑨, 𝜁 , 𝐷𝑗 𝑒𝑦 𝑚

𝑒𝑧

𝑚

𝑒𝑨

𝑚

The Average, Expected Number of Neighbors of particles in a cell 𝐷𝑟

slide-27
SLIDE 27

Validation

27

  • Correlation = 0.97
  • Root Mean Square Error (RMSE) = 3.7
slide-28
SLIDE 28

Chicken-and-Egg Problem

28

𝑻 𝑪 = 𝒐𝑪𝑻𝒒 + 𝑻𝒐 𝒐′𝒒𝒋

𝒒𝒋∈𝑪

+ 𝑻𝑩𝒗𝒚

Expected number

  • f neighbors

Auxiliary space to cover the estimation error

𝑻𝑩𝒗𝒚 = 𝟒. 𝟖 ∗ 𝒐𝑪𝑻𝒐

RMSE

slide-29
SLIDE 29

Chicken-and-Egg Problem

29

𝑻 𝑪 = 𝒐𝑪𝑻𝒒 + 𝑻𝒐 𝒐′𝒒𝒋

𝒒𝒋∈𝑪

+ 𝑻𝑩𝒗𝒚

Expected number

  • f neighbors

Auxiliary space to cover the estimation error

𝑻𝑩𝒗𝒚 = 𝟒. 𝟖 ∗ 𝒐𝑪𝑻𝒐

RMSE

slide-30
SLIDE 30

Results

  • Testing Environment

–Two hexa-core CPUs –192 GB main memory (CPU side) –One GPU (GeForce GTX 780) with 3 GB video memory

30

slide-31
SLIDE 31

Ours Map-GPU

NVIDIA mapped memory Tech

  • Map CPU memory space

into GPU memory address space

Results

31

Up to 65.6 M Particles Maximum data size: 13 GB

slide-32
SLIDE 32

32

Up to 32.7 M Particles Maximum data size: 16 GB 15.8 M Particles Maximum data size: 6 GB

slide-33
SLIDE 33

Results

33

12 CPU cores A CPU core 12 CPU cores +One GPU

Map-GPU Our method

Up to 26 X Up to 51 X Up to 8.4 X Up to 6.3 X

slide-34
SLIDE 34

Conclusion

  • Proposed an out-of-core ε-NN algorithm for

particle-based fluid simulation

–Utilize heterogeneous computing resources –Utilize GPUs in out-of-core manner –Propose hierarchical work distribution method

34

slide-35
SLIDE 35

Conclusion

  • Proposed an out-of-core ε-NN algorithm for

particle-based fluid simulation

  • Presented a novel, memory estimation method

–Based on expected number of neighbors

35

slide-36
SLIDE 36

Conclusion

  • Proposed an out-of-core ε-NN algorithm for

particle-based fluid simulation

  • Presented a novel, memory estimation method
  • Handled a large number of particles
  • Achieved much higher performance compared

with a naïve OOC-GPU approach

36

slide-37
SLIDE 37

Future Work

  • Extend to support multi-GPUs
  • Improve the parallelization efficiency by

employing an optimization-based approach

  • Extend to other applications

37

slide-38
SLIDE 38

Thanks!

38

Any questions?

(bluekdct@gmail.com) Project homepage: http://sglab.kaist.ac.kr/OOCNNS

  • Benchmark scenes are available in the homepage
slide-39
SLIDE 39

Benefits of Our Memory Estimation Model

  • Fixed space VS Ours

39

slide-40
SLIDE 40

Benefits of Hierarchical Workload Distribution

  • Larger block size shows a better performance

–E.g., using 323 and 643 block sizes takes 22% and 30% less processing time in GPU than using 163 blocks on average

40

slide-41
SLIDE 41
  • But, the maximal block size varies depending
  • n the benchmarks and region of the scene
  • Compared manually set fixed block size based
  • n our estimation model, hierarchical

approaches shows 33% higher performance on average

41

Benefits of Hierarchical Workload Distribution