The issue is not really parallelism of computation. The issue is locality. Usual...

The issue is not really parallelism of computation. The issue is locality. Usually a hydro solver need to solve 2 very different problem short and long range interaction. therefore you "split" the problem into a particle-mesh (long range) and a particle to particle (short range).

In this case there is no long range interaction (aka gravity, electrodynamics), therefore you would go for a pure p2p implementation.

Then in a p2p, if you have very strong coupling between particles that will insure the fact that neighbors stay neighbors (that will be the case with solids, or with very high viscosity). But in most case you will need are rebalancing of the tree (and therefor the memory layout) every time steps. This rebalancing can in fact dominate the execution time as usually the computation on a given particle represent just a few (order 100) flop. Then this rebalancing is usually faster to be done on CPU than on GPU. Then evidently, you can do this rebalancing "well" on gpu, but the effort to have a proper implementation will be huge ;-).