[develop] Build conda and environments in SRW. #938

christinaholtNOAA · 2023-10-10T22:57:38Z

DESCRIPTION OF CHANGES:

Modifies devbuild.sh to add the option to install miniforge (a version of miniconda that manages channels more strictly) in a user's specified location and defaults to inside the user's clone. It also installs two environments needed for SRW -- srw_app, which is similar to the old workflow_tools environment, and srw_graphics, which is sufficient to support the plotting scripts in SRW.

A few additional details:

Does the conda installation right away so that MacOS users have the necessary bash utilities like readlink to use the devbuild.sh script.
Requires the user to provide the conda target to build conda -- it doesn't build by default.
Adds a conda module file that points to the user-installed location of miniconda.
Modifies the GitHub Actions workflows to use the same environments that are built by devbuild.sh.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

TESTS CONDUCTED:

~~Test suite is still pending on Hera.~~ Edit: The fundamental test suite has passed on Hera.

I also tested the conda installation bits on MacOS, but did not carry out an entire build. I did confirm that the environments were installed and could support the readlink utility.

This needs tests on all platforms, most likely.

DEPENDENCIES:

None.

DOCUMENTATION:

I update the docs with this PR.

ISSUE:

N/A

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

This reverts commit d3329d4.

Co-authored-by: Michael Lueken <[email protected]>

christinaholtNOAA · 2023-10-31T15:54:39Z

@MichaelLueken After giving this AQM issue some thought, I've reached out to @chan-hoo via email to ask some questions about whether AQM can tolerate this sort of change in develop. I will come back to the solution once we've had a chance to work out those kinks. Sorry for the delay here, and I'm glad you caught this!

MichaelLueken · 2023-11-01T16:35:08Z

@christinaholtNOAA - Given @chan-hoo's reply to your email, I was able to successfully build your build_conda branch on Hera with the -a=ATMAQ option and was able to run the aqm_grid_AQM_NA13km_suite_GFS_v16 test:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16                                  COMPLETE            1083.12
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1083.12

What you currently have should be fine, but we will need to make sure that nothing is done to build the srw_app conda environment on WCOSS2.

MichaelLueken · 2023-11-10T17:15:56Z

@christinaholtNOAA - Given @chan-hoo's reply to your email, I was able to successfully build your build_conda branch on Hera with the -a=ATMAQ option and was able to run the aqm_grid_AQM_NA13km_suite_GFS_v16 test:
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16                                  COMPLETE            1083.12
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1083.12
What you currently have should be fine, but we will need to make sure that nothing is done to build the srw_app conda environment on WCOSS2.

@christinaholtNOAA - It should be noted that this test was ran using the current settings in modulefiles/tasks/hera/miniconda_regional_workflow_cmaq.lua.

I have attempted to run the AQM WE2E test once again, replacing:

prepend_path("MODULEPATH","/scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles")
load(pathJoin("miniconda3", os.getenv("miniconda3_ver") or "4.12.0"))

setenv("SRW_ENV", "regional_workflow_cmaq")

with:

load("conda")
setenv("SRW_ENV", "srw_app")

This time, the test failed in the following tasks:
point_source, nexus_emission_00, nexus_emission_01, and nexus_emission_02

The following traceback is in the point_source log files:

Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/AQM-utils/python_utils/stack-pt-merge.py", line 12, in <module>
    import netCDF4 as nc
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/py-netcdf4-1.5.3-ofq7pt3/lib/python3.9/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ModuleNotFoundError: No module named 'netCDF4._netCDF4'

The following traceback is in the nexus_emission_* log files:

Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/arl_nexus/utils/python/make_nexus_output_pretty.py", line 144, in <module>
    raise SystemExit(main(**parse_args()))
                     ^^^^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/arl_nexus/utils/python/make_nexus_output_pretty.py", line 50, in main
    import netCDF4 as nc
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/py-netcdf4-1.5.3-ofq7pt3/lib/python3.9/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ModuleNotFoundError: No module named 'netCDF4._netCDF4'

So, continuing to use the current regional_workflow_cmaq conda environment works, but updating to srw_app leads to failures due to missing netCDF4 in the conda environment.

christinaholtNOAA · 2023-11-21T14:46:26Z

My apologies for the huge delay on getting back to the comments here. I have modified the code to install a srw_aqm environment when the AQM application is being built. I tested that on Hera and found that the test no longer worked -- data was unavailable. I'm now second-guessing that test though since it was an older version. I did confirm that the appropriate environment was being loaded for the tasks that were run.

I updated my branch to the top of develop, did a fresh build, and successfully ran the fundamental tests on Hera again.

christinaholtNOAA · 2023-11-21T14:47:07Z

I'm actually going to re-try that AQM test now.

…s-srweather-app into build_conda

christinaholtNOAA · 2023-11-21T18:54:01Z

I was able to more successfully run the AQM test and fixed the issue that @MichaelLueken reported above. My test is now failing in the aqm_lbcs task because it can't find data on Hera. Many other AQM tasks ran successfully and loaded the appropriate environment with NetCDF4.

MichaelLueken

@christinaholtNOAA -

Thank you very much for addressing my concerns regarding WCOSS2 and the AQM WE2E test! Unfortunately, it looks like the old WE2E staged data was overwritten with a much newer case, so the aqm_lbcs task will fail. As you have noted, however, following your latest updates, all of the rest of the AQM-specific tasks appear to be running smoothly now, so I'm fine with these changes. I have noted one update to the scripts/exregional_make_grid.sh script that appears to have been made for debugging purposes. If this is the case, then it would be a good idea to go ahead and remove this line from the script.

I will go ahead and approve this work now and I will submit the Jenkins tests in the morning.

scripts/exregional_make_grid.sh

MichaelLueken

@christinaholtNOAA -

Unfortunately, while working through the conversations and resolving those that I started, I noted the failure of the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 fundamental WE2E test on Orion. A rerun of this test on Orion this morning shows that the test is still failing. It should be noted that this is only an issue on Orion, the rest of the RDHPCS machines successfully run the fundamental WE2E tests without issue.

The test is failing in the run_MET_Pb2nc_obs task with the following error message in the log:

/work/noaa/epic/role-epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2022.0.2/met-10.1.1-eiz3e5i/bin/pb2nc: error while loading shared libraries: libpython3.9.so.1.0: cannot open shared object file: No such file or directory

This error message is indicating that python needs to be used while running the verification tasks. Adding load("stack-python/3.9.7") back into build_orion_intel.lua allows the test to run through to completion without issue.

The current develop branch fundamental WE2E tests pass on Orion, so I will not be able to merge this PR until the issue with this test is resolved. Once this correction has been made, I will re-approve this PR and submit the automated Jenkins tests.

christinaholtNOAA · 2023-11-27T14:55:08Z

@MichaelLueken Thanks for all the help with debugging other platforms.

I've added back the stack-python that was removed for Orion, but did it in the run_vx local module file for just those tasks. I was noticing that the stack python environment was interacting poorly with the conda environment on some tasks (when netcdf is expected, for example), so want to limit the interaction between stack and conda environments as much as possible. It looks necessary for verification from this failure.

If you don't mind re-running that test again, it would be super helpful. Thanks!

MichaelLueken · 2023-11-27T15:08:00Z

Thanks, @christinaholtNOAA! Rerunning the test now.

MichaelLueken

@christinaholtNOAA -

Thank you very much for working with me to get these final changes in! The grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 WE2E test is now successfully running on Orion:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              10.41
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              11.26
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               8.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.12
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              26.35
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              13.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              21.56
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             108.03

Once again approving this PR. I will go ahead and launch the Jenkins tests as well. I'd like for @gspetro-NOAA to provide one last look to ensure that she is happy, then I'll be able to merge this PR. Thanks again!

MichaelLueken · 2023-11-27T17:24:30Z

@christinaholtNOAA -

The Functional UnitTests have failed on Hera GNU, Hera Intel, and Jet. The failure appears to be due to the fact that the .cicd/scripts/srw_unittest.sh script, which runs the Functional UnitTests phase of the pipeline, occurs before the Build phase (no srw_app conda environment). It looks like the Functional UnitTests phase will need to be moved to after the Build phase in .cicd/Jenkinsfile in order to ensure that the necessary conda environment is available before running the unit tests.

Additionally, the Functional WorkflowTaskTests phase in the pipeline, which is ran using the .cicd/scripts/srw_ftest.sh script, failed because the conda activate is hardwired into this script:
conda activate workflow_tools
Please replace workflow_tools with srw_app in this script.

MichaelLueken · 2023-11-27T20:41:39Z

The We2E coverage tests were manually ran on Derecho and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              24.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              38.53
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              46.45
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              44.70
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              18.82
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              41.38
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              24.88
pregen_grid_orog_sfc_climo                                         COMPLETE              17.04
specify_template_filenames                                         COMPLETE              19.63
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             275.61

MichaelLueken · 2023-11-28T14:26:49Z

The only failures in the automated Jenkins tests came from Hera Intel:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              24.51
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.04
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             760.78
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.83
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        DEAD                   4.02
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     DEAD                   3.94
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.43
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         DEAD                 112.91
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             305.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             324.28
pregen_grid_orog_sfc_climo                                         COMPLETE               7.04
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                1579.82

The use of rocotorewind/rocotoboot allowed the three tests to pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              24.51
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.04
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             760.78
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              13.83
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               9.03
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              14.72
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.43
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             236.20
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             305.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             324.28
pregen_grid_orog_sfc_climo                                         COMPLETE               7.04
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1718.90

I would like to check with @gspetro-NOAA to ensure that all of her concerns have been addressed, then I will move forward with merging this work.

docs/UsersGuide/source/BackgroundInfo/TechnicalOverview.rst

docs/UsersGuide/source/BuildingRunningTesting/ContainerQuickstart.rst

gspetro-NOAA

Looks good to me! I ran the fundamental tests on Jet for good measure, and they pass. :)

christinaholtNOAA and others added 21 commits October 6, 2023 22:02

WIP.

5a38a5d

Make the fundamental tests run.

2a5b05a

Modify the rest of the modulefiles accordingly.

aeccb68

Modifying the GHA workflows to use the environment

4611e70

Merge remote-tracking branch 'origin' into build_conda

c8f4cf9

Attempt to fix mamba

4bdf0bc

Fix autosyntax error.

c9735bc

Cleanup old flags.

64c5369

Try to install env with mamba install.

255765c

Try better default shell for activating mamba

e6bb3a3

Perhaps a typo in the micromamba docs?

4421031

Don't separate jobs

9ee47c3

Reorder and cleanup.

0f8b58a

Update docs.

66202d7

Try different syntax

5c644d8

Check if this limits this to only mac.

d3329d4

Revert "Check if this limits this to only mac."

a0e047e

This reverts commit d3329d4.

Fixes for building on MacOs.

17c5af2

Merge remote-tracking branch 'ufs/develop' into build_conda

16796b6

Configurable conda location.

2fcc2b2

Make it MacOS compatible.

d11254c

christinaholtNOAA requested review from mkavulich, gsketefian, JeffBeck-NOAA, RatkoVasic-NOAA, BenjaminBlake-NOAA, ywangwof, chan-hoo, panll and christopherwharrop-noaa as code owners October 10, 2023 22:57

Update docs/UsersGuide/source/BuildingRunningTesting/BuildSRW.rst

e4e9a27

Co-authored-by: Michael Lueken <[email protected]>

christinaholtNOAA added 4 commits November 20, 2023 13:39

Adding AQM to the supported environments

47860e5

Adding aqm environment file.

59039e8

Fix a typo for creating environment.

a358c52

Merge remote-tracking branch 'ufs/develop' into build_conda

65b05b7

christinaholtNOAA added 2 commits November 21, 2023 14:47

Merge branch 'build_conda' of https://github.com/christinaholtNOAA/uf…

18389b9

…s-srweather-app into build_conda

Fixing python environments for AQM.

fff3a25

MichaelLueken approved these changes Nov 21, 2023

View reviewed changes

scripts/exregional_make_grid.sh Outdated Show resolved Hide resolved

MichaelLueken requested changes Nov 22, 2023

View reviewed changes

Address review comments.

387373e

MichaelLueken approved these changes Nov 27, 2023

View reviewed changes

MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Nov 27, 2023

Changes for Jenkins pipeline failures.

1e3a8f9

gspetro-NOAA reviewed Nov 28, 2023

View reviewed changes

docs/UsersGuide/source/BackgroundInfo/TechnicalOverview.rst Outdated Show resolved Hide resolved

gspetro-NOAA reviewed Nov 28, 2023

View reviewed changes

docs/UsersGuide/source/BuildingRunningTesting/ContainerQuickstart.rst Outdated Show resolved Hide resolved

Updating docs.

e9b6318

gspetro-NOAA approved these changes Nov 29, 2023

View reviewed changes

MichaelLueken merged commit eecfbdd into ufs-community:develop Nov 29, 2023
1 check passed

MichaelLueken mentioned this pull request Nov 30, 2023

Updates to the authoritative develop branch since the SRW v2.2 release (10/31/2023) #981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Build conda and environments in SRW. #938

[develop] Build conda and environments in SRW. #938

christinaholtNOAA commented Oct 10, 2023 •

edited

Loading

christinaholtNOAA commented Oct 31, 2023

MichaelLueken commented Nov 1, 2023

MichaelLueken commented Nov 10, 2023

christinaholtNOAA commented Nov 21, 2023

christinaholtNOAA commented Nov 21, 2023

christinaholtNOAA commented Nov 21, 2023

MichaelLueken left a comment

MichaelLueken left a comment

christinaholtNOAA commented Nov 27, 2023

MichaelLueken commented Nov 27, 2023

MichaelLueken left a comment

MichaelLueken commented Nov 27, 2023

MichaelLueken commented Nov 27, 2023

MichaelLueken commented Nov 28, 2023

gspetro-NOAA left a comment

[develop] Build conda and environments in SRW. #938

[develop] Build conda and environments in SRW. #938

Conversation

christinaholtNOAA commented Oct 10, 2023 • edited Loading

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

christinaholtNOAA commented Oct 31, 2023

MichaelLueken commented Nov 1, 2023

MichaelLueken commented Nov 10, 2023

christinaholtNOAA commented Nov 21, 2023

christinaholtNOAA commented Nov 21, 2023

christinaholtNOAA commented Nov 21, 2023

MichaelLueken left a comment

Choose a reason for hiding this comment

MichaelLueken left a comment

Choose a reason for hiding this comment

christinaholtNOAA commented Nov 27, 2023

MichaelLueken commented Nov 27, 2023

MichaelLueken left a comment

Choose a reason for hiding this comment

MichaelLueken commented Nov 27, 2023

MichaelLueken commented Nov 27, 2023

MichaelLueken commented Nov 28, 2023

gspetro-NOAA left a comment

Choose a reason for hiding this comment

christinaholtNOAA commented Oct 10, 2023 •

edited

Loading