Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mpi4py-fft with CuPy arrays? #14

Open
leofang opened this issue Jan 5, 2022 · 24 comments
Open

Use mpi4py-fft with CuPy arrays? #14

leofang opened this issue Jan 5, 2022 · 24 comments

Comments

@leofang
Copy link
Member

leofang commented Jan 5, 2022

@dalcinl I had a tiny bit of time over the holidays to pick up this old task that has been on my mind for a couple of years. I noticed DistArray is the workhorse behind mpi4py-fft, however it requires subclassing numpy.ndarray

class DistArray(np.ndarray):

which makes it not possible to use cupy.ndarray as a drop-in replacement, as CuPy currently does not support subclassing (cupy/cupy#3972).

Is it absolutely necessary to make DistArray a subclass? Can it be tweaked (hopefully mildly) to become an array container to support both NumPy and CuPy arrays (and potentially other array/tensor libraries)? Could this change break backward compatibility? From my first glimpse it seems possible but I'd like to discuss further and see if I miss anything.

@mikaem
Copy link
Member

mikaem commented Jan 5, 2022

Hi @leofang
I think it's more correct to consider DistArray a convenience class to create distributed Numpy arrays of the right shape, whereas the PFFT class is the workhorse. So I don't really think DistArray is the correct place to look for a cuda drop-in replacement. Note that the Numpy work arrays of PFFT are created in the FFT class and perhaps that is a better place to start? With a serial cuda transform class?

@leofang
Copy link
Member Author

leofang commented Jan 5, 2022

Thanks, @mikaem! I am not sure I fully follow, maybe I still miss something...

Yes, DistArray is a convenient class to create an array with the right shape locally. But my understanding of the workflow is

  • Create a PFFT object with the desired transform specified
  • Use the PFFT object to create a local array using newDistArray
  • Perform the transform on the local array

In terms of code, I am looking at the tutorial:

import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray

N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)

From a CuPy developer/user's perspective, I expect that this code should just work after replacing numpy by cupy, but it'd require newDistArray to return a CuPy array.

@mikaem
Copy link
Member

mikaem commented Jan 5, 2022

My point is that there are Numpy arrays within the PFFT class, or more precisely the FFT class. The arrays you set up with DistArray are copied into these before the transforms start. So just changing DistArray does not change the underlying routines. Transforms are not in-place. Which is why I think you need to add a serial fft class for cupy just to get started. Is there a serial cupy fft library?

@brownbaerchen
Copy link

What's the status of this? Support for subclassing has since been added to cupy (see the issue mentioned above) and cupy has FFTs as well.
I am just starting to look into this, but I suppose it may be possible to replace np with cp in both DistArray and PFFT and be done with it now? Of course I have no idea about the details...
Are there any plans to add cupy support from your side?

@mikaem
Copy link
Member

mikaem commented Dec 7, 2023

There are no plans as far as I know. But contributions are welcome😃

@leofang
Copy link
Member Author

leofang commented Dec 7, 2023

I didn't have bandwidth for this, so feel free to give it a shot and let us know how it goes!

@brownbaerchen
Copy link

I started out with CuPyfying the DistArray class. There were a few issues, but I think so far nothing catastrophic.
It is not as simple as replacing np with cp after all, as the signature for __new__ is different and the get function needs to stay on CPU, I believe.
I will continue with the PFFT class next week.

I would like to merge this back here eventually. I guess you would reasonably ask the changes to be tested. However, testing code on GPU is not straightforward on GitHub, at least without paying. So is there any chance of merging this here? Do you have a way of testing on GPUs? Otherwise, I guess I would not have to pay as much attention to cleaning up the code. That's why I ask already now.
We are currently working on testing CuPy code in pySDC by mirroring to a local gitlab and running the tests on the Jülich supercomputers, but it's unclear when this will be running and how the accounting for compute budget will work.

If you are interested, you can track my progress here.

@leofang
Copy link
Member Author

leofang commented Dec 8, 2023

I won't have time before the holidays, unfortunately, but after that I could help test it locally. I agree too that it should be merged if everything works locally.

For accessing GPUs in GitHub Actions, that's a longer discussion that won't happen over night, what I can share now is interested parties are working on this.

@brownbaerchen
Copy link

I made some progress, but there is still plenty of room for improvement and I was hoping you have some suggestions.
First of all, I made some scaling tests with the current implementation on the JUWELS booster cluster in Jülich which has 96 CPU threads per node and 4 A100 40GB GPUs per node. The following is 3D c2c single precision back and forth FFTs with slab decomposition:
weak_scaling
strong_scaling

You can see up to three lines:

  • fftw+MPI: Current state of mpi4py-fft on CPUs
  • cupy+customMPI: GPU version of mpi4py-fft with custom Alltoallw
  • cuFFTMp+NVSHMEM: Distributed FFT library from NVIDIA that doesn't have Python bindings, but is included for comparison

The x-axis is per node to judge if the compute budged is better spent on CPUs or GPUs. Clearly, the GPUs perform pretty well and mpi4py-fft works on them pretty well with only minor changes.
The weak scaling plot includes one data point with one GPU (1/4 node) which is basically a single 3D transform with no communication.

However, cuFFTMp is faster by about a factor of two in weak scaling. I see two big issues with the GPU version of mpi4py-fft. First, the default Alltoallw from mpi4py is really slow for some reason. I did not check what is going on under the hood, but the custom Alltoallw is based on a series of ring sends and is faster by about a factor of 10. To conserve memory, I block between each ring send. This can definitely be improved. Matching the performance of NVSHMEM with MPI is, however, not possible afaik, especially for strong scaling. Weak scaling is a higher priority for me, though.
But a second big issue is memory operations which consume about 25% of compute time. First a 2D transform is performed on the slabs, then the data is transformed and a second 1D transform is applied. In the second transform, I can see that a kernel cupy_copy__complex64_complex64 is launched before and after the actual transform and the data is returned in 'F' ordering. I cannot see this kind of memory operation in cuFFTMp, but maybe I did miss it. Do you have suggestions as to how I could avoid these memory operations?

@leofang
Copy link
Member Author

leofang commented Jan 12, 2024

Thanks for working on this and sharing your finding, @brownbaerchen, this is very encouraging!

First, the default Alltoallw from mpi4py is really slow for some reason.

I'd let @dalcinl or @mikaem comment on this, but my hunch is it's related to the CUDA aware MPI implementation -- which MPI did you use, and how did you build it?

In the second transform, I can see that a kernel cupy_copy__complex64_complex64 is launched before and after the actual transform and the data is returned in 'F' ordering.

I think this roughly explains the perf difference between CuPy + mpi4py-fft and cuFFTMp + NVSHMEM. If you share your source code of the mpi4py-fft integration, I could take a look next week and try to see if there's any improvement I can suggest.

btw how did you use cuFFTMp in Python?

@brownbaerchen
Copy link

brownbaerchen commented Jan 12, 2024

Regarding the slow Alltoallw, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.

I didn't actually run cuFFTMp in Python. I may try to build bindings for that at some point. I just ran this file from the NVIDIA samples repository.

My code is available on this branch.
I guess the main thing is the FFT backend, which has the following code:

def _Xfftn_plan_cupy(shape, axes, dtype, transforms, options):
    import cupy as cp

    transforms = {} if transforms is None else transforms
    if tuple(axes) in transforms:
        plan_fwd, plan_bck = transforms[tuple(axes)]
    else:
        if cp.issubdtype(dtype, cp.floating):
            plan_fwd = cp.fft.rfftn
            plan_bck = cp.fft.irfftn
        else:
            plan_fwd = cp.fft.fftn
            plan_bck = cp.fft.ifftn


    s = tuple(np.take(shape, axes))
    U = cp.array(fftw.aligned(shape, dtype=dtype))  # TODO: avoid going via CPU
    _V = plan_fwd(U, s=s, axes=axes)
    V = cp.array(fftw.aligned_like(_V.get()))  # TODO: avoid going via CPU
    M = np.prod(s)

    # CuPy has forward transform unscaled and backward scaled with 1/N
    return (
        _Yfftn_wrap(plan_fwd, U, V, 1, {'s': s, 'axes': axes}),
        _Yfftn_wrap(plan_bck, V, U, M, {'s': s, 'axes': axes}),
    )

It's basically copy-paste from the numpy backend. I also tried cupyx.scipy.fftpack.fftn with overwrite_x=True, but the result was the same.

I don't construct plans myself because they are cached during the first transform and that is fine by me. But maybe there is a plan that avoids going to F ordering?

@llukas
Copy link

llukas commented Jan 12, 2024

@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?

@brownbaerchen
Copy link

brownbaerchen commented Jan 15, 2024

@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python?

Thanks for mentioning this! Since I haven't done any work with pybind11 or JAX before, this is a bit opaque to me. I am wondering why this was done with JAX and not with CuPy.
I want to do distributed FFTs in a library that relies on subclassing numpy arrays. Subclassing cupy arrays works just fine and transitioning to JAX would probably be feasible but I haven't tried. At this point, I don't know what the best course of action is.

I guess I have three options:

  • Figure out how to call cuFFT from CuPy more efficiently in the transform after transposing and continue in mpi4py-fft
  • Design a new datatype in my library based on JAX and use the bindings you mentioned
  • Make bindings for cuFFTMp for CuPy analogous to the ones using JAX

Any recommendations?

@leofang
Copy link
Member Author

leofang commented Feb 7, 2024

Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling fftw.aligned() and then converting to CuPy arrays. It's a waste of memory bandwidth. fftw.aligned() is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?

@brownbaerchen
Copy link

Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling fftw.aligned() and then converting to CuPy arrays. It's a waste of memory bandwidth. fftw.aligned() is just a convenient helper function allocating aligned CPU buffers. Could you try removing it and redo the benchmarks?

You're right, thanks for your input! I ignored this so far because it is only done once during setup. Indeed, just removing the detour to fftw.aligned() significantly speeds up instantiation of the PFFT object. But once it is instantiated, it doesn't make any difference. Since I didn't include the setup in the plots, the results don't change. In particular, the expensive memory operations remain. Still, your comment is very useful in practice!

@leofang
Copy link
Member Author

leofang commented Feb 8, 2024

Cool, if you send a draft PR to this repo, it might be easier for me and the maintainers (Mikael/Lisandro) here to review/suggest changes, and Mikael/Lisandro might have thoughts on the PR scope and your custom alltoallw. I did take another quick glance at your branch but didn't see other major issues.

Regarding the slow Alltoallw, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host and back with this due to non-contiguous buffers? I should check that at some point.

Totally forgot to follow up, sorry. Since you tried two flavors of MPI implementations, do you happen to know if they are both CUDA-aware (that is, you can just pass a GPU array to mpi4py and the underlying MPI impl knows how to handle GPU pointers)? If you already have mpi4py/CuPy in your local env, you can easily test them following Step 4 in conda-forge/openmpi-feedstock#128 (review).

@mikaem
Copy link
Member

mikaem commented Feb 13, 2024

Hi
Sorry for not following up. This looks very promising, but unfortunately I have almost no experience with cuda or gpu in general and I don't think that I can be of much help. Not sure about @dalcinl ? If this works I believe it would be a great addition to mpi4py-fft though. A PR is welcome:-)
Regarding Alltoallw, I remember we had some success experimenting with MPI environment variables on the supercomputer at Kaust, getting performance close to Alltoall. But it has been awhile since I looked at this and I'm getting old and forgetful.

@brownbaerchen
Copy link

@leofang, yes, both MPI implementations are built with CUDA support. I talked to the support here in Jülich and they also don't know why this is happening, but assured me that that I loaded the correct modules. It would be nice if someone could test this on a different machine maybe. Since it seems that NCCL is preferable over MPI anyways, maybe this doesn't matter too much, though.

@leofang
Copy link
Member Author

leofang commented Feb 13, 2024

@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of ompi_info. For MPICH, please share the output of mpichversion. Also, in both MPI implementations, is CUDA-aware UCX used?

@brownbaerchen
Copy link

@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of ompi_info. For MPICH, please share the output of mpichversion. Also, in both MPI implementations, is CUDA-aware UCX used?

Since the output is quite lengthy, I uploaded the output of ompi_info here and the output of mpichversion here.
I do explicitly load a module enabling CUDA support for UCX and MPI. However, this is the default behaviour on this machine anyways as it is very GPU centric.

@brownbaerchen
Copy link

After @mhrywniak gave me some tipps on how to use NSIGHT, I could finally see where the expensive memory operations occur that I have been talking about. To recap: The launch of cupy_copy__complex64_complex64 kernels has held back performance of CuPy in mpi4py-fft significantly. It turns out that this kernel is launched when doing array[...] = data and can be replaced by cupy.copyto(array, data) to launch a D2D memcpy operation which is much faster. For low node counts the overhead compared to cuFFTMp is now about 1.5x instead of 2x as before.
However, there is still a kernel cupy_multiply__complex64_complex_complex64, which is very expensive. It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like array *= float. I tried replacing this with cupy.multiply(array, float), but it was not any faster. It seems odd to me that this multiplication is so relatively expensive and was wondering if you have any idea if it can be done more fast?

Also, I experimented with streams in the NCCL Alltoallw implementation, but I was not able to gain significant speedup with this. Suggestions on how to improve this are welcome!

Any updates on cuFFTMp python bindings?

@dalcinl
Copy link
Member

dalcinl commented Feb 23, 2024

However, there is still a kernel cupy_multiply__complex64_complex_complex64, which is very expensive.

Weird name, because ...

It takes at least as long as the serial FFTs and is launched when doing the normalisation for the FFT. In the code this looks like array *= float

... normalization is about scaling with a real value, not a complex value.
Is that kernel promoting the float scaling factor to complex, next doing complex * complex operations?
What if you try to create a view of the array as real floating, then multiply in-place by the scaling factor?

@brownbaerchen
Copy link

Good point @dalcinl! Indeed, multiplying a real view in-place results in a kernel called cupy_multiply__float32_float_float32. Unfortunately, it is not any faster...

I noticed something else, though, but I am not sure if it is 100% correct. So please confirm the following:
In the _Xfftn_plan_... classes, e.g. here, the factor M is used to revert any scaling done by the FFT library. So if NumPy has unscaled forward transform and backward transform scaled by $1/n$, M is 1 for forward and $n$ for backward, such that the data comes out entirely unscaled. Only in the FFT class, e.g. here is the scaling applied that is visible to the user of mpi4py-fft. But NumPy allows to change the normalisation like so. So we can call the forward transform for NumPy and tell it not to scale it by giving norm='backward' and then call the backward transform and also tell NumPy not to scale it by passing norm='forward'. Then we choose M=1 for both in the _Xfftn_plan_numpy wrapper and save many operations.
CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose M=1, but it's faster because it does two fewer cupy_multiply_... kernels.
Does this have any side effects that I didn't consider?

@dalcinl
Copy link
Member

dalcinl commented Feb 26, 2024

So we can call the forward transform for NumPy and tell it not to scale it by giving norm='backward' and then call the backward transform and also tell NumPy not to scale it by passing norm='forward'. Then we choose M=1 for both in the _Xfftn_plan_numpy wrapper and save many operations.

Definitely. Good catch! The NumPy backed was kind of a proof of concept, I think we did not really think about the proper, performant way to do it.

CuPy has the same normalisation as NumPy and the result seems to be the same if I prevent rescaling and choose M=1, but it's faster because it does two fewer cupy_multiply_... kernels.

Of course.

Does this have any side effects that I didn't consider?

None that I can think of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants