Click here to Skip to main content
15,868,016 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I want to increase the particle number calculation. Up to now I am being able to calculate 1 million particle using single GPU. Is it possible to increase the particle calculation to 2 million using multi GPU?
Posted
Updated 11-Nov-18 2:36am
Comments
Mehdi Gholam 28-Aug-13 0:15am    
It depends on your design, coding and hardware capabilities.
The_Inventor 4-Sep-13 5:47am    
I say, "GO4IT" !!

1 solution

If you can parallelize an algorithm and map onto a GPU, then you can easily do a similar task to map it to multiple GPUs.

You can simply do this by allocating half of work to GPU1 and the other half to GPU2. Just need to use streams to overlap two GPUs working timeline. This way you can reduce total time to compute by 50% if GPUs are equally performant.

If GPUs are different, then you will need to solve which GPU gets which percentage problem and its not too hard to solve.

Depending on the "divide and conquer" way you choose, kernels of GPUs may look different or same. Input data could be same, output could be different. Also memory visibility things can get easier by using unified memory.

For example, I wrote a brute-force nbody kernel (for 64k particles) to run on two Quadro-K420(cc3.0) GPUs and had 405 GFLOPS performance (60% of peak) out of them.

Things that can help you:

- cooperative kernels
- unified memory
- explicit device selection by streams on kernels and buffer copies

If you have a Kepler cc3.0 like me, then you can try the old way (splitting kernels and buffers per-device and managing them explicitly).

What kind of particles are those? Fluid particles with low range of forces? Gravitationally interacting particles that have long-range forces? Totally different scenario? Does each "update" needs hundreds of kernel calls with global data synchronization or just a few kernels with just 1 data sync between all workitems? For some algorithms, increasing number of GPUs can't grow linearly in performance while some have good scaling. How much calculation per byte are you making in kernel? Does it use atomics?
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900