Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Externals.cfg to cesm2_3_beta17 and remove mct #2539

Merged
merged 46 commits into from
May 28, 2024

Conversation

slevis-lmwg
Copy link
Contributor

@slevis-lmwg slevis-lmwg commented May 9, 2024

Description of changes

Specifics listed in issues
#2493 upd. externals to beta17
#2294 remove mct, but not entirely, so I'm removing this issue from the "fixed" list
#2546 fix error in cam4/cam5 test (unrelated)
#2279 Retire the /test/tools framework for CESM test system custom tests that do the same thing

Specific notes

Contributors other than yourself, if any:
@ekluzek @jedwards4b @billsacks

CTSM Issues Fixed (include github issue #):
Fixes #2493
Fixes #2546
Fixes #2279

Are answers expected to change (and if so in what way)?
No

Any User Interface Changes (namelist or namelist defaults changes)?
Yes, and it was done.

Does this create a need to change or add documentation? Did you do so?
I don't think so.

Testing performed, if any:
To play safe, I will run the following tests:
PASS ./build-namelist_test.pl
PASS python tests -u and -s
PASS make black (make lint gives minor complaint but perfect score)
OK aux_clm on derecho (first time result; see subsequent results below)

@slevis-lmwg slevis-lmwg self-assigned this May 9, 2024
@slevis-lmwg slevis-lmwg added enhancement new capability or improved behavior of existing capability code health improving internal code structure to make easier to maintain (sustainability) labels May 9, 2024
Externals.cfg Outdated Show resolved Hide resolved
@slevis-lmwg slevis-lmwg changed the title Update Externals.cfg to cesm2_3_beta17 Update Externals.cfg to cesm2_3_beta17 and remove mct May 9, 2024
@slevis-lmwg
Copy link
Contributor Author

Next I will go through the checklist in #2294 and rerun tests.

@slevis-lmwg
Copy link
Contributor Author

git grep -i mct returns very little stuff now.

Question: May I remove (or rename) this:

src/main/glc2lndMod.F90:     procedure, public  :: set_glc2lnd_fields_mct   ! set coupling fields sent from glc to lnd
src/main/glc2lndMod.F90:  subroutine set_glc2lnd_fields_mct(this, bounds, glc_present, x2l, &
src/main/glc2lndMod.F90:    character(len=*), parameter :: subname = 'set_glc2lnd_fields_mct'
src/main/glc2lndMod.F90:  end subroutine set_glc2lnd_fields_mct

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented May 10, 2024

aux_clm
derecho FAIL, the two baseline diffs may be due to the new PE layouts, though the latter are different for all the tests. The first three failures seem caused by the new externals. I will review these with Erik.

FAIL FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel RUN time=16
FAIL LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac RUN time=0
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop SETUP
FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.2.004: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.004: DIFF

izumi FAIL
Several nag tests fail with

Runtime Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: INTEGER(int32) overflow for 538976288 * 538976288
Program terminated by fatal error
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: Error occurred in MED_PHASES_RESTART_MOD:MED_PHASES_RESTART_WRITE
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
[i027.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)

@slevis-lmwg

This comment was marked as resolved.

@fischer-ncar
Copy link
Contributor

@slevis-lmwg the SETUP failure looks like an issue with the debug libraries missing for ESMF and pio for nvhpc/24.3.

cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug"

@jedwards4b should these libraries be available, or do we need to switch to the non-debug versions?

@jedwards4b
Copy link
Contributor

@slevis-lmwg I am currently working on the parallelio-debug issue, hope to have a resolution soon.

@jedwards4b
Copy link
Contributor

@slevis-lmwg although you can remove cpl7 you cannot yet remove mct. However I don't expect you to remove any externals in this tag, I will remove them in the next tag.

@slevis-lmwg
Copy link
Contributor Author

Ok, thanks @jedwards4b
I will put back mct then.

@jedwards4b
Copy link
Contributor

@slevis-lmwg the issue (cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug") should now be fixed, please try again.

@slevis-lmwg
Copy link
Contributor Author

@billsacks I'm guessing that I should push the change, but I wanted to make sure you didn't have any follow-up before I went ahead. And THANK YOU for helping with this!

@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2024

@slevis-lmwg and @billsacks the one thought I have on where this should go is if this is JUST in the LILACSMOKE test, someone running LILAC outside of testing will still see this problem. So maybe that load_env should be done in an appropriate place in cime?

@billsacks
Copy link
Member

That's great news, @slevis-lmwg - thank you! It's a shame that the LILAC test is so fragile, due to not using the same runner code that other tests use, but for now, yes, I'd say go ahead and push that change and we'll call it good!

@billsacks
Copy link
Member

the one thought I have on where this should go is if this is JUST in the LILACSMOKE test, someone running LILAC outside of testing will still see this problem. So maybe that load_env should be done in an appropriate place in cime?

@ekluzek this is a good point and something that occurred to me but which I honestly kind of buried my head in the sand about. My guess is that this problem only shows up in the LILAC testing because of the weird way it's set up. In a normal use of LILAC, we wouldn't be using CIME to run the model, but we use CIME for the LILACSMOKE test so that we can leverage its testing and batch submission infrastructure. So I think that, in normal user usage, they'll be loading the appropriate module environment on their own rather than relying on the CIME infrastructure to do it. My head has been out of this for long enough that I'm not confident of this, but my intuition leans enough in this direction that I personally feel okay with just putting in in the LILACSMOKE test for now and then waiting until we get a problem report that indicates that more work may be needed.

@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2024

Ahh, that sounds reasonable to me. I'm not sure how big the LILAC user base is. And you are right they wouldn't be doing this within cime (unless users are doing unexpected things which is always possible).

Especially in the short term we should get the fix in for the testing so this tag can move forward. So thanks both for your thoughts as well as the work you did on this. Take care.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented May 24, 2024

derecho tests
PASS make black and make lint
PASS ./run_ctsm_py_tests -u
PASS ./run_ctsm_py_tests -s
PASS ./build-namelist_test.pl

aux_clm on izumi OK
aux_clm on derecho OK

@slevis-lmwg
Copy link
Contributor Author

I have great news:
All tests pass.
This PR is ready, other than any final review that we should do @ekluzek.
Let me know if we need to do something before Tuesday.
Otherwise, I will plan on finishing this up on Tuesday.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do ask for a couple little things, that shouldn't require testing to be redone. But, this looks great and ready to come in as I see.

cime_config/SystemTests/lilacsmoke.py Show resolved Hide resolved
doc/ChangeLog Outdated Show resolved Hide resolved
doc/ChangeLog Outdated Show resolved Hide resolved
doc/ChangeLog Outdated Show resolved Hide resolved
doc/ChangeLog Outdated Show resolved Hide resolved
@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2024

@slevis-lmwg awesome thanks for getting this done. And in time for a long weekend, so you can free your mind of work.

I sent a review approval in, but asked for a couple things that I don't think will require retesting. Although I do suggest running the ctsm_sci testlist, because we talked about doing that as a regular practice in a CTSM SE meeting a few weeks ago.

@jedwards4b
Copy link
Contributor

@slevis-lmwg is your repository up to date with what you tested? If I run
git clone https://github.com/slevis-lmwg/ctsm.git -b upd_externals_to_beta17 ctsm5.2.006
I get share1.0.18 and cannot compile - you need share1.0.19.

@jedwards4b
Copy link
Contributor

Also when I checkout upd_externals_to_beta17 (and update share to 1.0.19) and run test ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream
I get a baseline difference???

@slevis-lmwg
Copy link
Contributor Author

@slevis-lmwg is your repository up to date with what you tested? If I run git clone https://github.com/slevis-lmwg/ctsm.git -b upd_externals_to_beta17 ctsm5.2.006 I get share1.0.18 and cannot compile - you need share1.0.19.

I think "share" was missed in the list of updates in #2493, so thank you for catching that @jedwards4b.

I will rerun all testing and see what happens...

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented May 28, 2024

derecho tests
PASS make black and make lint
PASS ./run_ctsm_py_tests -u and -s
PASS ./build-namelist_test.pl
OK ctsm_sci

OK aux_clm on izumi
FAIL aux_clm on derecho (details in later post)

@ekluzek
Copy link
Collaborator

ekluzek commented May 28, 2024

@jedwards4b I don't see the update to the share library documented anywhere. Does that update need to be added to the testdb for cesm3_0_alpha01a?

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented May 28, 2024

Also when I checkout upd_externals_to_beta17 (and update share to 1.0.19) and run test ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream I get a baseline difference???

For the above test, I get this, so I think we're fine:
FAIL ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream COMPARE_base_rest (EXPECTED FAILURE)

HOWEVER, I am now getting the following failures on derecho:

FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop MODEL_BUILD
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.005: DIFF

I will try these again without the latest update to share, to confirm my pre-weekend results.

  • share1.0.18: now both tests appear DIFF from baseline again. I now see that I confused myself last week into thinking that this was fixed. I likely forgot to rerun ./manage_externals/checkout_externals after bisecting.
  • cmeps0.14.50: the f45 test is DIFF from ctsm5.2.005 and the diffs are identical when using cmeps0.14.63
  • ccs_config_cesm0.0.98: the f45 test FAILs to build
  • ctsm5.2.005 Externals.cfg except keep ccs_config_ccsm0.0.106 and mizuRoute: the f45 test is DIFF from ctsm5.2.005 and the diffs are identical with ctsm5.2.006 Externals.cfg

I have convinced myself now that the last two test failures are due to the compiler update. Last week I confused myself by forgetting to rerun ./manage_externals/checkout_externals when necessary.

@jedwards4b
Copy link
Contributor

FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.005: DIFF

This is due to the compiler update.

For case SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop I was able to build to completion and run but it had the same baseline failure - again due to the compiler update.

@wwieder
Copy link
Contributor

wwieder commented May 28, 2024

Thanks for working through this @jedwards4b and @slevis-lmwg .

@briandobbins is this the same issue you brought up in our conversation, or is there another task for gitflexi-mod that needs to happen on the CTSM side?

@jedwards4b
Copy link
Contributor

@wwieder: git-fleximod

@wwieder
Copy link
Contributor

wwieder commented May 28, 2024

I got the letters in the right place, just not the dash. for me that's pretty good ;)

@slevis-lmwg
Copy link
Contributor Author

FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.005: DIFF

This is due to the compiler update.

For case SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop I was able to build to completion and run but it had the same baseline failure - again due to the compiler update.

Thank you @jedwards4b, sorry about my confusion regarding the compiler update.

As soon as I finish rerunning the job that didn't build for me before, so that we can have correct baseline files for it, I will begin the process of merging this PR.

@slevis-lmwg slevis-lmwg merged commit 6aebaad into ESCOMP:master May 28, 2024
1 of 2 checks passed
@slevis-lmwg slevis-lmwg deleted the upd_externals_to_beta17 branch May 28, 2024 21:46
@slevis-lmwg
Copy link
Contributor Author

I wanted to reiterate my thanks to @jedwards4b @billsacks @ekluzek for helping me with this PR, especially during the testing phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code health improving internal code structure to make easier to maintain (sustainability) enhancement new capability or improved behavior of existing capability
Projects
Status: Done
6 participants