You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SI team delivered a way to have hybrid decomposition per Component. For a CPU/GPU hybrid this means we could run the CPU only component on a small domain, to maximize L3/L2 on CPU and GPU to maximize bandwith we would leverage a bigger domain and push everything in VRAM.
This task needs to benchmark (and light validate) the decomposition for dycore and/or moist.
Benchmark CPU/GPU hybrid decomposition vs Fortran and vs single-domain NDSL
Detail Per Hamid email (might be merged by the time we get onto the task):
Hi Florian,
After we resolved the layout reproducibility issue, the mixed hybrid code is now ready.
It works correctly (zero diff with the baseline) but needs some tuning – a known issue.
We can share screen when you have time to go over.
To get the code, you can do:
- mepo clone [[email protected]:GEOS-ESM/GEOSgcm](mailto:[email protected]:GEOS-ESM/GEOSgcm)
- cd GEOSgcm
- mepo checkout-if-exists feature/aoloso/hybrid_112923
- mepo develop fvdycore
You build as usual.
To configure a run, please take a look at AGCM.rc in /discover/nobackup/aoloso/geos_hybrid5/c48_hybrid for a run that uses 3 OpenMP threads per MPI process in dyncore gridcomp.
The run uses a total of 36 PEs. On the dyncore side you have 12 MPI ranks x 3 OpenMP threads. Everywhere else you have 36 MPI ranks.
A more interesting run is in c720_splitField in the same directory. That run is configured to use 4 threads for dyncore. It uses 2400 PEs – 600 MPI ranks x 4 OpenMP threads for dyncore, 2400 MPI ranks everywhere else.
There are restrictions on how chop cubed sphere into subdomains. Checks are in the code to catch violations with explanations.
The text was updated successfully, but these errors were encountered:
The SI team delivered a way to have hybrid decomposition per Component. For a CPU/GPU hybrid this means we could run the CPU only component on a small domain, to maximize L3/L2 on CPU and GPU to maximize bandwith we would leverage a bigger domain and push everything in VRAM.
This task needs to benchmark (and light validate) the decomposition for dycore and/or moist.
Detail Per Hamid email (might be merged by the time we get onto the task):
The text was updated successfully, but these errors were encountered: