Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586

Draft
wants to merge 26 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented May 24, 2024

WORK IN PROGRESS

Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.

PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).

Initially, we only build for compute capability 7.0, later we build for architectures from Pascal but excluding architectures for embedded GPUs and very special compute capabilities such as 9.0a. That is the list of compute capabilities could be 6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0

truib added 11 commits May 17, 2024 11:13
…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system
- `EESSI-install-software.sh`
  - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with
    `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml`
- `create_lmodsitepackage.py`
  - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one
    (`eessi_cuda_and_libraries_enabled_load_hook`)
  - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR
- `eb_hooks.py`
  - put code that iterates over all files replacing non-distributable ones with
    symlinks into `host_injections` with a common function
    (`replace_non_distributable_files_with_symlinks`)
- `install_scripts.sh`
  - add files to copy to CVMFS (see `nvidia_files`)
- `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh`
  - improved creation of tmp directory
@trz42 trz42 added help wanted Extra attention is needed 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels May 24, 2024
Copy link

eessi-bot bot commented May 24, 2024

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

Copy link

eessi-bot bot commented May 24, 2024

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented May 24, 2024

We run a first attempt without doing any modifications (e.g., to work around issues)...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 24, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11348

  • failed with
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
date job status comment
May 24 07:24:07 UTC 2024 submitted job id 11348 awaits release by job manager
May 24 07:24:09 UTC 2024 released job awaits launch by Slurm scheduler
May 24 07:25:11 UTC 2024 running job 11348 is running
May 24 07:39:24 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11348.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716535927.tar.gzsize: 698 MiB (732486169 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 07:39:25 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11348.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42 trz42 marked this pull request as draft May 24, 2024 07:24
@trz42
Copy link
Collaborator Author

trz42 commented May 24, 2024

Building after applied changes provided by #579...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 24, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11349

  • failed with the same error (possibly because the environment variable EESSI_OVERRIDE_GPU_CHECK is not set or not passed through to the Prefix shell)
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
  • need to add some code for passing that environment variable into the Prefix shell (see 58120d2)
date job status comment
May 24 08:07:29 UTC 2024 submitted job id 11349 awaits release by job manager
May 24 08:08:30 UTC 2024 released job awaits launch by Slurm scheduler
May 24 08:09:32 UTC 2024 running job 11349 is running
May 24 08:23:46 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11349.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716538578.tar.gzsize: 698 MiB (732497279 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 08:23:46 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11349.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 24, 2024

Trying again...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 24, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

@trz42
Copy link
Collaborator Author

trz42 commented May 25, 2024

Cleaned up code for creating/updating Lmod cfg files (lmodrc.lua and SitePackage.lua) and reinstated setting of EESSI_OVERRIDE_GPU_CHECK=1 (but moved that to bot/build.sh). Thus, it should be able to load CUDA/12.1.1 and go on with the installation. We can let this run until it may hit the linker error.

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 25, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11430

  • this hit a different error (might be fluke, because the next job went on building magma successfully):
[ 10%] Building CUDA object CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/bin/nvcc -forward-unknown-to-host-compiler -DMAGMA_CUDA_ARCH_MIN=700 -DMAGMA_HAVE_CUDA=1 --options-file CMakeFiles/magma.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++17 --generate-code=arch=compute_52,code=[compute_52,sm_52] -Xcompiler=-fPIC --compiler-options -fPIC, -MD -MT CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o -MF CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o.d -x cu -c /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/magmablas/zlarfbx.cu -o CMakeFiles/magma.dir/magmablas/zlarfbx.cu.o
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4390: CMakeFiles/magma.dir/magmablas/zlacpy_sym_in.cu.o] Error 127
make[2]: *** Waiting for unfinished jobs....
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4405: CMakeFiles/magma.dir/magmablas/zlacpy_sym_out.cu.o] Error 127
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4420: CMakeFiles/magma.dir/magmablas/zlag2c.cu.o] Error 127
sh: 1: cudafe++: not found
make[2]: *** [CMakeFiles/magma.dir/build.make:4435: CMakeFiles/magma.dir/magmablas/clag2z.cu.o] Error 127
In file included from /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/include/cuda_fp16.h:4019,
                 from /opt/eessi/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/targets/x86_64-linux/include/cublas_api.h:77,
                 from /opt/eessi/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/targets/x86_64-linux/include/cublas_v2.h:69,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_types.h:72,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_copy.h:12,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magmablas.h:12,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/include/magma_v2.h:22,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/control/magma_internal.h:63,
                 from /tmp/bot/easybuild/build/magma/2.7.2/foss-2023a-CUDA-12.1.1/magma-2.7.2/magmablas/zlanhe.cu:12:
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA/12.1.1/include/cuda_fp16.hpp:65:10: fatal error: nv/target: No such file or directory
   65 | #include <nv/target>
      |          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/magma.dir/build.make:4465: CMakeFiles/magma.dir/magmablas/zlanhe.cu.o] Error 1
sh: 1: cudafe++: not found
date job status comment
May 25 05:05:24 UTC 2024 submitted job id 11430 awaits release by job manager
May 25 05:06:38 UTC 2024 released job awaits launch by Slurm scheduler
May 25 05:07:57 UTC 2024 running job 11430 is running
May 25 05:40:33 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11430.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716615134.tar.gzsize: 698 MiB (732495353 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 25 05:40:33 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11430.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 25, 2024

Commented out code (in eessi_container.sh) that used different "tricks" to disable the GPU check in the Lmod hook. It should still work because we set EESSI_OVERRIDE_GPU_CHECK in bot/build.sh (and we pass it through into the Prefix shell in run_in_compat_layer_env.sh).

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 25, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11431

date job status comment
May 25 05:24:16 UTC 2024 submitted job id 11431 awaits release by job manager
May 25 05:25:06 UTC 2024 released job awaits launch by Slurm scheduler
May 25 05:31:02 UTC 2024 running job 11431 is running
May 25 07:19:16 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11431.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716621051.tar.gzsize: 1000 MiB (1049201591 bytes)
entries: 188
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 25 07:19:16 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11431.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 25, 2024

Added pre_configure hook that adds the missing directory containing libcupti.so.12 to LIBRARY_PATH. The building should now proceed further, maybe even finish...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 25, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11432

  • failed due to indention error in eb_hooks.py
date job status comment
May 25 08:49:31 UTC 2024 submitted job id 11432 awaits release by job manager
May 25 08:50:44 UTC 2024 released job awaits launch by Slurm scheduler
May 25 08:55:05 UTC 2024 running job 11432 is running
May 25 09:05:20 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11432.out
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716627633.tar.gzsize: 0 MiB (15695 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 25 09:05:20 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11432.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 25, 2024

Try again...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 25, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11433

date job status comment
May 25 09:11:24 UTC 2024 submitted job id 11433 awaits release by job manager
May 25 09:12:41 UTC 2024 released job awaits launch by Slurm scheduler
May 25 09:14:01 UTC 2024 running job 11433 is running
May 25 18:49:05 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-11433.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716662368.tar.gzsize: 1208 MiB (1267078707 bytes)
entries: 12927
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 25 18:49:05 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11433.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 25, 2024

Now, try building for multiple compute capabilities (6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0)...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

Copy link

eessi-bot bot commented May 25, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented May 25, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11437

date job status comment
May 25 19:27:39 UTC 2024 submitted job id 11437 awaits release by job manager
May 25 19:28:29 UTC 2024 released job awaits launch by Slurm scheduler
May 25 19:29:49 UTC 2024 running job 11437 is running
May 26 06:30:09 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-11437.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716704446.tar.gzsize: 1634 MiB (1714103717 bytes)
entries: 12927
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 26 06:30:09 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11437.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42 trz42 removed the help wanted Extra attention is needed label Jun 18, 2024
@trz42
Copy link
Collaborator Author

trz42 commented Jun 18, 2024

Currently not actively being worked on because we need to have rework/implement support for building for GPUs which also depends on support for dev.eessi.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants