Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fvcore_layout.rc >> input.nml #34

Open
wmputman opened this issue Aug 8, 2019 · 8 comments
Open

fvcore_layout.rc >> input.nml #34

wmputman opened this issue Aug 8, 2019 · 8 comments
Assignees

Comments

@wmputman
Copy link

wmputman commented Aug 8, 2019

In the gcm_run.j and the gcm_forecast.tmpl there are occasions when the fvcore_layout.rc does not properly cat to input.nml, this leaves input.nml as an empty file. The most prevalent symptom for this at the moment is that the FMS stack size does not get set properly and the model fails due to exceeding stack limits.

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

@wmputman I cannot see how that cannot work. I mean, that's basic Linux 101 there.

I could easily belt-and-suspender it with file existence checks, using /bin/cat, checking status, etc., which I suppose we should do everywhere, but it's cat, which is pretty boring and doesn't do much.

Is this only happening on, say, SLES10? That system does have an older cat (v. 8.12) versus that on SLES12 (v. 8.25). I can't imagine cat having bugs, but I can check the changelogs for GNU and see.

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

I looked at the Changelogs and the only mention of cat was in that for 8.17:

cp,mv,install,cat,split: now read and write a minimum of 64KiB at a time.
This was previously 32KiB and increasing to 64KiB was seen to increase
throughput by about 10% when reading cached files on 64 bit GNU/Linux.

That's not really a bug fix, though.

@wmputman
Copy link
Author

wmputman commented Aug 8, 2019

This has happened on SLES11 (once) and SLES12 (frequently) over the last month.

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

The only other thing I can see is in perhaps in FV3. For example, fv_control.F90 has this bit:

 697   │    f_unit=open_namelist_file()
 698   │    rewind (f_unit)
 699   │ ! Read Main namelist
 700read (f_unit,fv_grid_nml,iostat=ios)
 701   │    ierr = check_nml_error(ios,'fv_grid_nml')
 702call close_file(f_unit)

gfdl_cloud_microphys.F90 has:

3488   │         nlunit=open_namelist_file()
3489   │         rewind (nlunit)
3490   │      ! Read Main namelist
3491read (nlunit,gfdl_cloud_microphysics_nml,iostat=ios)
3492   │         ierr = check_nml_error(ios,'gfdl_cloud_microphysics_nml')
3493call close_file(nlunit)

Fairly similar. Open, rewind, read, check, close. However, later on in fv_control.F90 there is this:

 733if (size(Atm) == 1) then
 734   │          f_unit = open_namelist_file()
 735else if (n == 1) then
 736   │          f_unit = open_namelist_file('input.nml')
 737else 
 738write(nested_grid_filename,'(A10, I2.2, A4)') 'input_nest', n, '.nml'
 739   │          f_unit = open_namelist_file(nested_grid_filename)
 740endif
 741742   │    ! Read FVCORE namelist 
 743read (f_unit,fv_core_nml,iostat=ios)
 744   │       ierr = check_nml_error(ios,'fv_core_nml')
 745746   │    ! Read Test_Case namelist
 747   │       rewind (f_unit)
 748read (f_unit,test_case_nml,iostat=ios)
 749   │       ierr = check_nml_error(ios,'test_case_nml')
 750call close_file(f_unit)

This is about the only place I could find that there was an open_namelist_file() without a rewind() right after it. But, the check_nml_error() call should catch any iostat issues.

Now, fms itself seems to read namelist files a bit differently than FV3. From fms.F90:

 357if (file_exist('input.nml')) then
 358   │        unit = open_namelist_file ( )
 359   │        ierr=1; do while (ierr /= 0)
 360read  (unit, nml=fms_nml, iostat=io, end=10)
 361   │           ierr = check_nml_error(io,'fms_nml')  ! also initializes nml error codes
 362enddo
 36310    call mpp_close (unit)
 364endif

but it's equivalent I think (close_file() is essentially a wrapper on mpp_close())

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

I am trying one thing now. Per Rusty in the source:

1296   │ !-----------------------------------------------------------------------
1297   │ ! subroutine READ_INPUT_NML
1298   │ !
1299   │ !
1300   │ ! Reads an existing input.nml into a character array and broadcasts
1301   │ ! it to the non-root mpi-tasks. This allows the use of reads from an
1302   │ ! internal file for namelist settings (requires 2003 compliant compiler)  
1303   │ !
1304   │ ! read(input_nml_file, nml=<name_nml>, iostat=status)

This seems to be triggered by the INTERNAL_FILE_NML codepath. It would save file closes and opens if it works.

I'm building with the appropriate macro set and we'll see if it can even run.

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

Or wait, maybe not. FMS code is fun to read...

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

Well, that wasn't too hard. I can definitely activate the INTERNAL_FILE_NML path. You have to change a few CMakeLists.txt and edit the two microphysics files, but it seems zero-diff. Whether or not it helps is another ball of wax, but I suppose between @wmputman or @sdrabenh if one of you seems to more consistently hit the error, it could be something to try.

NOTE: I didn't change the compilation of MOM5 which of course would need the same ifdef activated, but baby steps first.

@mathomp4
Copy link
Member

mathomp4 commented Aug 8, 2019

@wmputman et al, I sent an email to Rusty about INTERNAL_FILE_NML:

My question is regarding namelist reading in FMS. Bill Putman seems to be having intermittent issues with it, so I took a look. I noticed your name in the INTERNAL_FILE_NML codepath.

My reading is that instead of say, all 96 processors reading the namelist every time it's processed in FMS, FV3, microphysics, etc., only root would read it and then broadcast the results to an internal file. Is that correct?

He replied with:

Your interpretation is correct. To use the internal file, you only need:

use mpp_mod,   only:  input_nml_file

read (input_nml_file, <namelist>, iostat=io)
ierr = check_nml_error (io, '<namelist>')

If you are using multiple namelist files, you can clear and re-read a new namelist as needed using the mpp_mod::read_input_nml subroutine.

As you run on many, many processors, @wmputman, it might be worth moving to INTERNAL_FILE_NML for some testing. It might cost some MPI time to broadcast the character array, but it's probably better than 1000s of processes all opening the same file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants