Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the provis_state subpool at RK4 initialization to avoid memory leak #6334

Merged
merged 4 commits into from
Apr 26, 2024

Conversation

sbrus89
Copy link
Contributor

@sbrus89 sbrus89 commented Apr 5, 2024

This PR fixes a memory leak I noticed in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (MPAS-Dev/compass#802) on pm-cpu. Previously, I could only get through about 42 days of simulation before running out of memory.

This issue is related to creating/destroying the provis_state subpool at each timestep. We had a similar problem a few years back that required memory leaks fixes in the mpas_pool_destroy_pool (MPAS-Dev/MPAS-Model#367) subroutine. However, I believe there is still a memory leak in the mpas_pool_remove_subpool routine (which calls pool_remove_member) that is called following mpas_pool_destroy_pool. It seems like the TODO comment here:

!TODO: are there cases where we need to delete more data here?
deallocate(ptr_prev)
end if
suggests it is possible things aren't being completely cleaned up by this subroutine.

I'm not familiar enough with the low-level details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the provis_state subpool once at initialization as opposed to creating and destroying it every timestep. The main consequence of this approach is that the mpas_pool_copy_pool subroutine needs to have a overrideTimeLevel option similar to that used in mpas_pool_clone_pool used previously.

I've tested these changes with the vr45to5 tides test case and they do allow me to run for a full 125 days and are B4B with the previously create/destroy approach. The LTS and FB-LTS timestepping schemes are also updated here in the same way to prevent the same memory leak.

Since RK4 is not used in E3SM, this PR is B4B for all E3SM tests. The mpas_pool_copy_pool routine modified here is not used in MPAS-Seaice or MALI. See the Ocean-Discussions PR for additional background and testing.

[BFB]

@sbrus89 sbrus89 added mpas-ocean bug fix PR BFB PR leaves answers BFB MPAS-Ocean standalone Issues and features for standalone MPAS-Ocean code that dont impact E3SM. labels Apr 5, 2024
@sbrus89
Copy link
Contributor Author

sbrus89 commented Apr 5, 2024

@matthewhoffman and @akturner, I thought I'd tag you on this since it touches the framework, even though this is apparently the first usage of mpas_pool_copy_pool in E3SM.

@gcapodag
Copy link
Contributor

gcapodag commented Apr 6, 2024

great, thanks @sbrus89

@sbrus89
Copy link
Contributor Author

sbrus89 commented Apr 8, 2024

@erinethomas and @darincomeau, would you mind taking a look at the framework changes in this PR just in case you see any issues from the sea ice side?

@mark-petersen mark-petersen removed the request for review from gcapodag April 10, 2024 21:16
@mark-petersen
Copy link
Contributor

Repeating post from E3SM-Ocean-Discussion#87 (comment)

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel

@mark-petersen
Copy link
Contributor

Repeating post from E3SM-Ocean-Discussion#87 (comment). Note these tests all do not use LTS.

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel

Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with LTS on using gnu on chicoma and intel on chryslis, both debug and optimized. The following tests compare bfb against the master branchpoint on both.

ocean/dam_break/40cm/default_lts
  * step: initial_state
  * step: lts_regions
  * step: forward
  * step: viz
  test execution:      SUCCESS
  baseline comparison: PASS
  test runtime:        05:11
ocean/dam_break/120cm/default_lts
  * step: initial_state
  * step: lts_regions
  * step: forward
  * step: viz
  test execution:      SUCCESS
  baseline comparison: PASS
  test runtime:        01:00
ocean/dam_break/120cm/ramp_lts
  * step: initial_state
  * step: lts_regions
  * step: forward
  * step: viz
  test execution:      SUCCESS
  baseline comparison: PASS
  test runtime:        01:04

Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above tests show that we get the identical answer to master with standard time stepping and LTS. In addition, tests by @gcapodag show that this addition solves an out-of-memory error on perlmutter that he and @jeremy-lilly have been struggling with for the past two months. Thank you @sbrus89!

@mark-petersen
Copy link
Contributor

@xylar and @matthewhoffman these changes are only used by the ocean core, and only for RK4 and LTS timestepping. I think you are safe with a visual review, as I tried to be thorough in my testing of RK4 and LTS and saw BFB results.

git grep -in mpas_pool_clone_pool
mpas-framework/src/core_sw/mpas_sw_time_integration.F:120:        call mpas_pool_clone_pool(statePool, provisStatePool, 1)
mpas-framework/src/framework/mpas_pool_routines.F:434:!  routine mpas_pool_clone_pool
mpas-framework/src/framework/mpas_pool_routines.F:444:   recursive subroutine mpas_pool_clone_pool(srcPool, destPool, overrideTimeLevels)!{{{
mpas-framework/src/framework/mpas_pool_routines.F:466:            call pool_mesg('ERROR in mpas_pool_clone_pool: Input time levels cannot be less than 1.')
mpas-framework/src/framework/mpas_pool_routines.F:909:                call mpas_pool_clone_pool(ptr % data % p, newmem % data % p)
mpas-framework/src/framework/mpas_pool_routines.F:921:   end subroutine mpas_pool_clone_pool!}}}
mpas-framework/src/framework/mpas_stream_manager.F:411:           call mpas_pool_clone_pool(manager % defaultAtts, new_stream % att_pool)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:259:    call mpas_pool_clone_pool(tendPool, tendSum1stPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:261:    call mpas_pool_clone_pool(tendPool, tendSum2ndPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:263:    call mpas_pool_clone_pool(tendPool, tendSum3rdPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:265:    call mpas_pool_clone_pool(tendPool, tendSlowPool, 1)
mpas-ocean/src/mode_forward/mpas_ocn_time_integration_rk4.F:233:         call mpas_pool_clone_pool(statePool, provisStatePool, 1)

@erinethomas
Copy link
Contributor

in line with @mark-petersen's comment above - I was not able to see any place where this would impact MPAS-SI this morning ... so I think its ok on the sea-ice side...

@sbrus89 sbrus89 removed the request for review from akturner April 11, 2024 20:58
@sbrus89
Copy link
Contributor Author

sbrus89 commented Apr 11, 2024

@mark-petersen, the mpas-ocean rk4, lts, and fblts time stepping routines are the only places that use the mpas_pool_copy_pool routine I modified:

git grep -in mpas_pool_copy_pool
components/mpas-framework/src/framework/mpas_pool_routines.F:925:!  routine mpas_pool_copy_pool
components/mpas-framework/src/framework/mpas_pool_routines.F:935:   recursive subroutine mpas_pool_copy_pool(srcPool, destPool, overrideTimeLevels)!{{{
components/mpas-framework/src/framework/mpas_pool_routines.F:958:            call pool_mesg('ERROR in mpas_pool_copy_pool: Input time levels cannot be less than 1.')
components/mpas-framework/src/framework/mpas_pool_routines.F:1171:                   call mpas_pool_copy_pool(ptr % data % p, mem % p)
components/mpas-framework/src/framework/mpas_pool_routines.F:1181:   end subroutine mpas_pool_copy_pool!}}}
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_fblts.F:230:      call mpas_pool_copy_pool(tendPool, tendSum3rdPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_fblts.F:231:      call mpas_pool_copy_pool(tendPool, tendSlowPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:264:    call mpas_pool_copy_pool(tendPool, tendSum1stPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:265:    call mpas_pool_copy_pool(tendPool, tendSum2ndPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:266:    call mpas_pool_copy_pool(tendPool, tendSum3rdPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_lts.F:267:    call mpas_pool_copy_pool(tendPool, tendSlowPool, 1)
components/mpas-ocean/src/mode_forward/mpas_ocn_time_integration_rk4.F:233:         call mpas_pool_copy_pool(statePool, provisStatePool, 1)

Copy link
Contributor

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the code and discussed improvements with @sbrus89, which he implemented. I'm happy with the changes now. I agree that the framework changes only affect MPAS-Ocean and only in configurations that are not used in E3SM production simulations.

Copy link
Contributor

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbrus89 , I'm not very familiar with this code, so what you've done to solve the problem makes sense to me, and I can't review it very critically. I have two small items, but they are not major concerns.

@matthewhoffman matthewhoffman self-requested a review April 24, 2024 02:11
jonbob added a commit that referenced this pull request Apr 25, 2024
Create the provis_state subpool at RK4 initialization to avoid memory leak

This PR fixes a memory leak in the RK4 timestepping when running 125 day
single-layer barotropic tides cases with the vr45to5 mesh on pm-cpu.
Previously, it could only get through about 42 days of simulation before
running out of memory. This issue is related to creating/destroying the
provis_state subpool at each timestep.

Since RK4 is not used in E3SM, this PR is B4B for all E3SM tests. The
mpas_pool_copy_pool routine modified here is not used in MPAS-Seaice or
MALI.

[BFB]
@jonbob
Copy link
Contributor

jonbob commented Apr 25, 2024

passes:

  • ERS.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel
  • SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.CRYO1850-DISMF.chrysalis_intel

merged to next

@jonbob jonbob merged commit 6b38b86 into E3SM-Project:master Apr 26, 2024
12 checks passed
@jonbob
Copy link
Contributor

jonbob commented Apr 26, 2024

merged to master

xylar added a commit to xylar/compass that referenced this pull request May 8, 2024
This merge updates the E3SM-Project submodule from [93e511d](https://github.com/E3SM-Project/E3SM/tree/93e511d) to [31e0924](https://github.com/E3SM-Project/E3SM/tree/31e0924).

This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):
- [ ]  (ocn) E3SM-Project/E3SM#6256
- [ ]  (ocn) E3SM-Project/E3SM#6224
- [ ]  (ocn) E3SM-Project/E3SM#6270
- [ ]  (ocn) E3SM-Project/E3SM#6293
- [ ]  (ocn) E3SM-Project/E3SM#6321
- [ ]  (ocn) E3SM-Project/E3SM#6262
- [ ]  (ocn) E3SM-Project/E3SM#6300
- [ ]  (ocn) E3SM-Project/E3SM#6334
- [ ]  (ocn) E3SM-Project/E3SM#6371
- [ ]  (ocn) E3SM-Project/E3SM#6288
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB bug fix PR MPAS-Ocean standalone Issues and features for standalone MPAS-Ocean code that dont impact E3SM. mpas-ocean
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants