Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Chicoma-CPU and add Chicoma-GPU #73

Closed

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Jan 27, 2024

This merge makes a few updates to Chicoma-CPU and adds support for Chicoma's GPU partition.

@xylar xylar changed the base branch from master to alternate January 27, 2024 17:01
@xylar xylar changed the base branch from alternate to master January 27, 2024 17:01
@xylar
Copy link
Collaborator Author

xylar commented Jan 27, 2024

@jonbob and @vanroekel, I haven't run any tests on this yet but wanted to give you a heads up that I'm working on it. I will test all the supported compilers (4 on CPU and 4 on GPU) early next week to make sure they can run a simple E3SM test.

After that, I'll ask for your input.

@xylar
Copy link
Collaborator Author

xylar commented Jan 31, 2024

@jonbob, I'm not having any luck testing this on Chicoma. Any runs of ./create_test hang either in CREADE_NEWCASE or SETUP. It could be that it's related to running out of space but I would have thought it would produce an error rather than hanging. Could you see if you have better luck whenever you have some time?

@jonbob
Copy link
Collaborator

jonbob commented Jan 31, 2024

@xylar -- I'll try it later today

@jonbob
Copy link
Collaborator

jonbob commented Jan 31, 2024

@xylar -- I was able to successfully build:

  • SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_gnu
  • SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_intel
    but they failed at runtime. I added a comment where I think there's a problem. I'm testing a fix

@xylar
Copy link
Collaborator Author

xylar commented Jan 31, 2024

I'll test again tomorrow. Thanks for the help, @jonbob!

@jonbob
Copy link
Collaborator

jonbob commented Jan 31, 2024

After I fixed those lines, it's still complaining about "-m" when it tries to run. From the e3sm.log:

/bin/sh: line 1: -m: command not found

@jonbob
Copy link
Collaborator

jonbob commented Jan 31, 2024

And here's more output:

run command is srun  --label  -n 64 -N 1 -c 2  --cpu_bind=cores
   -m plane=128 /lustre/scratch4/turquoise/jonbob/E3SM/scratch/chicoma-cpu/SMS.T62_oQU120.CMPASO-NYF.chicoma-cpu_gnu.20240131_130253_wft96a/bld/e3sm.exe   >> e3sm.log.$LID 2>&1  
2024-01-31 13:27:55 SAVE_PRERUN_PROVENANCE BEGINS HERE

maybe another line break?

@xylar
Copy link
Collaborator Author

xylar commented Jan 31, 2024

Yep, it's possible. I had missed the same formatting issues with chicoma-gpu. I appreciate you testing this. If my dumb mistakes get too annoying, you are certainly welcome to kick this back to me and ask me to clean up my mess before involving you further. Sorry about that!

@xylar xylar force-pushed the machinefiles/update-chicoma branch from a2b5a06 to b5db2b8 Compare February 3, 2024 13:39
@xylar xylar changed the base branch from master to alternate February 3, 2024 13:40
@xylar xylar changed the base branch from alternate to master February 3, 2024 13:40
@xylar xylar force-pushed the machinefiles/update-chicoma branch from b5db2b8 to a27fe2b Compare February 3, 2024 13:48
@xylar
Copy link
Collaborator Author

xylar commented Feb 3, 2024

With a few fixes, I am now able to run tests on chicoma-cpu:

./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu --walltime 1:00:00 --wait -p w23_freddy
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_intel --walltime 1:00:00 --wait -p w23_freddy
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_nvidia --walltime 1:00:00 --wait -p w23_freddy

However, I'm not able to build mct with nvidiagpu:

configure:2948: checking whether we are cross compiling
configure:2956: cc -o conftest  -I/lustre/scratch4/turquoise/xylar/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu.20240203_071102_4rj99z/bld/nvidiagpu/mpich/nodebug/nothreads/mct/include  -Wl,--allow-multiple-definition -lstdc++ conftest.c  >&5
nvc-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
"conftest.c", line 17: warning: statement is unreachable
    return 0;
    ^

/usr/bin/ld: warning: /tmp/pgcudafatDlMUx6vrhVXz.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
configure:2960: $? = 0
configure:2967: ./conftest
./conftest: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I haven't been able to figure out what is supposed to be providing libcuda.so.1. It doesn't seem to be in cudatoolkit but that also seems to be true on Perlmutter so I don't get the sense that it's supposed to. I couldn't find anything that was done for pm-gpu that I forgot to duplicate for chicoma-gpu that might account for this.

@xylar
Copy link
Collaborator Author

xylar commented Feb 3, 2024

On Perlmutter, libcuda.so.1 is in /usr/lib64/. It seems to be something installed as part of the NVidia drivers and not something that can (typically) be found in a module. I can't find the equivalent on Chicoma anywhere, nor any helpful documentation on LANL HPC.

@xylar
Copy link
Collaborator Author

xylar commented Feb 3, 2024

I'm giving up on this for now. @vanroekel, if this becomes pressing for you, I suggest getting some help from LANL IC on this.

@vanroekel
Copy link
Collaborator

Thank you for working on this @xylar I appreciate it. I’ll try pick this up and push on this later in the week.

@xylar
Copy link
Collaborator Author

xylar commented Feb 4, 2024

@vanroekel, one thought I had was that maybe libcuda.so.1 is only available on the GPU compute nodes, so I will try might try running tests on an interactive GPU node. You are welcome to give this a try if you beat me to it. A quick check would be if /usr/lib64/libcuda.so.1 exists on a GPU node.

@vanroekel
Copy link
Collaborator

@xylar you were right on the money. When I logged onto the gpu partition there is a /usr/lib64/libcuda.so.1 file.

@xylar
Copy link
Collaborator Author

xylar commented Feb 5, 2024

Okay, that's going to mean that everything for chicoma-gpu has to be done from an interactive or batch job, not on a login node.

@xylar
Copy link
Collaborator Author

xylar commented Feb 5, 2024

I can test things there tomorrow if I find the time.

@vanroekel
Copy link
Collaborator

@xylar I have a bit of time to push on this this morning, do you have a test I could try on chicoma-gpu to verify? Would I change something like this one

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu

to

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu

?

@vanroekel
Copy link
Collaborator

my change of the test worked. There is an error (different from what you saw before that I'll look into)

Error invoking pkg-config!
Package cudatoolkit_22.7_11.7 was not found in the pkg-config search path.
Perhaps you should add the directory containing `cudatoolkit_22.7_11.7.pc'
to the PKG_CONFIG_PATH environment variable
No package 'cudatoolkit_22.7_11.7' found

<MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-petersen, the issue you pointed out on Slack should be fixed here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should definitely be 128 for chicoma-cpu. Thanks.

@xylar
Copy link
Collaborator Author

xylar commented Feb 6, 2024

@vanroekel, that sounds to me like something isn't configured right in the cudatoolkit module on Chicoma, though that's just a guess. Can you load the cudatoolkit module (and all the modules before that) on and do module show cudatoolkit, then compare with what you get on Perlmutter for the same? Specifically, how it's setting up $PKG_CONFIG_PATH? It seems like something isn't right on Chicoma.

@xylar
Copy link
Collaborator Author

xylar commented Feb 6, 2024

SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu

Yes, that seems like a good thing to try. Maybe @jonbob has something even simpler but given that most of the wait is compile time, it doesn't hurt to test MPAS-O, MPAS-Seaice and MALI all in one go.

@jonbob
Copy link
Collaborator

jonbob commented Feb 6, 2024

That's about as small a test as we could come up with for all three components. You could always try a C- or D-case and just have to build one active component, but it may not save you much with the parallel build

@vanroekel
Copy link
Collaborator

a small update on the pkg-config issue. I've dug all over on chicoma and cannot find the package config file that is needed. I even dug into the module load script and it is pointing to a pkg_config directory that is non-existent! I've contacted LANL support, so I guess this work is on hold until I hear back from them

@vanroekel
Copy link
Collaborator

Well an unfortunate update - it seems the files missing are only visible on the front end nodes but libcuda.so is only visible on compute nodes. I'm working with LANL support on how to address this.

@vanroekel
Copy link
Collaborator

A bit of progress, I have a work around for the pkg-config error. I'm now able to build all the dependencies on gpu, but am getting an error in the mpas build now.

355 nvcc fatal   : Unknown option '-Wl,--allow-multiple-definition'
356 Target namelist_gen built in 0.002274 seconds
357 gmake[2]: *** [mpas-framework/src/tools/CMakeFiles/namelist_gen.dir/build.make:132: mpas-framework/src/tools/namelist_gen] Error 1
358 gmake[2]: Leaving directory '/lustre/scratch4/turquoise/.mdt2/lvanroekel/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_T    EST.chicoma-gpu_nvidiagpu.20240207_114345_9m1jel/bld/cmake-bld'

@jonbob or @xylar do either of you know where MPAS picks up build options so I can try and remove those options that are 'unknown'?

@xylar
Copy link
Collaborator Author

xylar commented Feb 7, 2024

@vanroekel, these are presumably coming from here:
https://github.com/E3SM-Project/E3SM/blob/50485d0c929a873b5935ea3c2dc1da4d236319dd/cime_config/machines/cmake_macros/nvidiagpu.cmake#L27
But I would be suspicious that something else is going wrong because these are linker flags and I don't think nvcc should be getting invoked as the linker. (But maybe @jonbob or @philipwjones would be more helpful here.)

@xylar
Copy link
Collaborator Author

xylar commented Feb 7, 2024

@vanroekel, I suspect the problem might be that we're missing the equivalent of:
https://github.com/E3SM-Project/E3SM/blob/50485d0c929a873b5935ea3c2dc1da4d236319dd/cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake
I missed the cmake_macros in this PR. Feel free to copy them from their pm-gpu equivalents and push that to this branch. I went ahead and copied them.

@xylar
Copy link
Collaborator Author

xylar commented Feb 7, 2024

@vanroekel, see if the macros I just added make a difference.

@vanroekel
Copy link
Collaborator

Thanks @xylar! I took these and made one more change and I got it to build! Do you want me to pass you my small changes or push to this branch?

However, it still won't run or submit. I'm getting this error

sbatch: error: Requested GRES option unsupported by configured SelectType plugin

@philipwjones any suggestions on what this means? Here are the gpu directives

 60       <directive> --gpus-per-task=1</directive>
 61       <directive> --gpu-bind=none</directive>
```

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2024

@vanroekel, that sounds like progress! Yes, just push to this branch.

@mark-petersen
Copy link

It would be convenient to remove badger in this PR, since chicoma replaced badger. Otherwise, we should remove the badger machine file section in a separate PR.

@philipwjones
Copy link
Collaborator

@vanroekel Do you have the actual batch submit command from the logs? GRES is a resource error so not sure what you actually asked for...

@vanroekel
Copy link
Collaborator

So a bit of a funny story. I figured out the GRES error, turns out it triggers when you use an account value in sbatch that doesn't have access to chicoma-gpu. My tests have been using a chicoma-cpu only test. When I switch to a different account it submits! I'm testing the E3SM test again, will report back and push changes soon

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2024

Sounds like progress! User errors are usually the easiest ones to fix (once you spot them).

nvidiagpu now works
@vanroekel
Copy link
Collaborator

Okay I just pushed changes that for me allowed me to build, submit and run

 ./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu --walltime 1:00:00 --wait -p g23_nonhydro_g

i've only tested nvidiagpu.

Anything else you'd like me to test?

@xylar
Copy link
Collaborator Author

xylar commented Feb 9, 2024

@vanroekel, could you run the same test with gnugpu? I think that would be all we need. I'd run it but I don't think I'm on any projects with access to the GPU partition.

After that, let's call it good. We can always make follow-up PRs to fix anything that comes up.

@xylar
Copy link
Collaborator Author

xylar commented Feb 9, 2024

I've moved to E3SM-Project#6228 so please report testing on gnugpu there, @vanroekel.

@xylar xylar closed this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants