Running GEOS CTM #15

JulesKouatchou · 2019-08-13T15:57:44Z

I cloned GEOS CTM and was able to compile it. The ctm_setup script did not properly create the experiment directory because it was still referring to the old configuration (Linux/ instead of install/). I fix the ctm_setup file. The code is crashing during the initialization steps because it cannot create the grid. The code is failing on Line 9193 of MAPL_Generic.F90:

call ESMF_ConfigGetAttribute(state%cf,gridname,label=trim(comp_name)//CF_COMPONENT_SEPARATOR//'GRIDNAME:',rc=status)
VERIFY_(status)

I can quickly understand why there is a problem: the label should only be 'GRIDNAME:'.

I checked a couple of CVS tags I have and could not locate any MAPL version similar to the one in the git repository. I am wondering if MAPL has to be updated before GEOS CTM can run.

The text was updated successfully, but these errors were encountered:

kgerheiser · 2019-08-13T16:04:29Z

I have the CTM running with this version of MAPL. You just need to add something like this into GEOSCTM.rc:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

And for Dynamics:

  DYNAMICS.GRID_TYPE: Cubed-Sphere
  DYNAMICS.GRIDNAME: PE360x2160-CF
  DYNAMICS.NF: 6
  DYNAMICS.LM: 72
  DYNAMICS.IM_WORLD: 360

Though, there is some duplication and it could probably be cleaned up and added to ctm_setup.

JulesKouatchou · 2019-08-13T16:13:29Z

Thank you. Very interesting. If it works, I will change the set up script...

JulesKouatchou · 2019-08-13T16:57:08Z

Do you have by chance an experiment directory on discover I can look at? My code is still crashing. Thanks.

kgerheiser · 2019-08-13T17:27:13Z

I do have an experiment located at: /discover/nobackup/kgerheis/experiments/ctm_test_experiment

I think there might be a few other small things you need to change to get it to run. If you run into any errors I can probably quickly diagnose them as I've gone through this process several times.

JulesKouatchou · 2019-08-13T17:37:24Z

Kyle: Thank you. I will get back to you if I need more assistance.

JulesKouatchou · 2019-08-13T18:45:23Z

Kyle:

It seems that I am still missing something as my code continues to crash. My working directory is:

/gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/testTR

I believe that I have a rc setting issue that I cannot identify. I noticed that in my standard output file I have:

In MAPL_Shmem:
NumCores per Node = 96
NumNodes in use = 1
Total PEs = 96

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1

That is not correct as there should be 4 nodes in use.

kgerheiser · 2019-08-14T15:10:27Z

You need to register the grid for the GridManager before you can create the grid by adding this to AdvCore_GridComp.F90.

subroutine register_grid_and_regridders()
    use MAPL_GridManagerMod, only: grid_manager
    use CubedSphereGridFactoryMod, only: CubedSphereGridFactory
    use MAPL_RegridderManagerMod, only: regridder_manager
    use MAPL_RegridderSpecMod, only: REGRID_METHOD_BILINEAR
    use LatLonToCubeRegridderMod
    use CubeToLatLonRegridderMod
    use CubeToCubeRegridderMod

    type (CubedSphereGridFactory) :: factory

    type (CubeToLatLonRegridder) :: cube_to_latlon_prototype
    type (LatLonToCubeRegridder) :: latlon_to_cube_prototype
    type (CubeToCubeRegridder) :: cube_to_cube_prototype

    call grid_manager%add_prototype('Cubed-Sphere',factory)
    associate (method => REGRID_METHOD_BILINEAR, mgr => regridder_manager)
      call mgr%add_prototype('Cubed-Sphere', 'LatLon', method, cube_to_latlon_prototype)
      call mgr%add_prototype('LatLon', 'Cubed-Sphere', method, latlon_to_cube_prototype)
      call mgr%add_prototype('Cubed-Sphere', 'Cubed-Sphere', method, cube_to_cube_prototype)
    end associate

  end subroutine register_grid_and_regridders

and calling it in AdvCore SetServices (line 318)

if (.NOT. FV3_DynCoreIsRunning) then
         call fv_init2(FV_Atm, dt, grids_on_my_pe, p_split)
         call register_grid_and_regridders() ! add this line
end if

This register_grid_and_regridders routine duplicates code in DynCore and should probably be added to its own module that AdvCore and DynCore can call.

JulesKouatchou · 2019-08-14T16:51:05Z

Kyle: That helps to go further. Thanks. The code still crashes. I am now running in a debugging mode...

lizziel · 2019-08-14T17:21:46Z

Jules, keep reporting the problems you are running into since I had to go through these issues as well when getting GCHP to work with the new MAPL. I may be able to help.

lizziel · 2019-08-14T17:42:09Z

One note of caution so that you do not make the same mistake I did. Regarding the earlier comment to add this to your config file:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

I made the mistake of adding the prefix (in my case GCHP. rather than GEOSctm.) to the existing lines rather than adding them. This caused a silent bug in AdvCore_GridCompMod.F90 since it expects entries for IM, JM, and LM in the file, without prefixes. If those lines are not found then the calculated areas get messed up since it uses default values.

My fix was to change AdvCore_GridCompMod.F90 to expect lines with the prefix rather than without.

-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      ! Customize for GCHP (ewl, 4/9/2019)
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'GCHP.IM:', default= 32, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'GCHP.JM:', default=192, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'GCHP.LM:', default= 72, RC=STATUS )

JulesKouatchou · 2019-08-15T15:02:45Z

Lizzie,

Thank you for your comments. I am wondering if it was necessary to make any change to my GEOSCTM.rc file as the AdvCore source coide has:

     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )

In any case, my code is still crashing and I still trying to figure out why.

JulesKouatchou · 2019-08-15T15:31:48Z

Something unusual that I mentioned before is the following print out at the beginning of the code:

In MAPL_Shmem:
NumCores per Node = 84
NumNodes in use = 1
Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1

I went ahead in MAPL_ShmemMod.F90 and printed the name of all the processors (nodes) after the call:
call MPI_AllGather(name ,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,&
names,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,Comm,status)

All the entries of the variable "names" only had the name of the head node. Something is wrong but I do not know what. I am sure that that there is a new rc setting I need to add.

mathomp4 · 2019-08-15T16:12:38Z

@JulesKouatchou This sounds suspiciously like calling MPT with mpirun. How are you running your executable? If you're using MPT (which you probably are), you need to use either mpiexec_mpt or use esma_mpirun (which autodetects MPI stack and will run mpiexec_mpt)

ETA: For everyone, MPT does have an mpirun command but it is weird. It can do some interesting things, but it doesn't perform like you'd expect without a lot of extra work. mpiexec_mpt understands SLURM (ish) and does what is expected.

mathomp4 · 2019-08-15T16:21:32Z

@JulesKouatchou Actually, I might need to work with you on the ctm_setup script. There are some changes we had to make to gcm_setup for the move to CMake that aren't reflected in that script as I see it on this repo. It's possible it's getting confused and picking the wrong MPI stack or the like.

kgerheiser · 2019-08-15T17:13:31Z

@mathomp4 That is something I had to change in ctm_run.j, and why I asked you about MPT's mpirun. I changed RUN_CMD to mpiexec

JulesKouatchou · 2019-08-15T18:12:53Z

@mathomp4
I made changes in my ctm_setup in order to have a complete experiment directory. The script was still referring for instance to the Linux/ directory.
In my ctm_run.j file, RUN_CMD is set to "mpirun" not "mpiexec". I will make a change and see what happens.

JulesKouatchou · 2019-08-15T18:36:14Z

Things are moving in the right directing. I now have as expected:

In MAPL_Shmem:
NumCores per Node = 28
NumNodes in use = 3
Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 3

The code is still crashing but this time in:

GEOSctm.x 00000000048A9E47 fv_statemod_mp_co 3160 FV_StateMod.F90
GEOSctm.x 0000000004868BAF fv_statemod_mp_fv 2902 FV_StateMod.F90
GEOSctm.x 00000000004B2B6B geos_ctmenvgridco 926 GEOS_ctmEnvGridComp.F90

It appears in the manipulations of U & V. My guess is that U & V are not properly read in or I am missing a rc setting somewhere.

tclune · 2019-08-20T14:43:27Z

@JulesKouatchou Can you tell me which branch of GEOSgcm_GridComp is being used? Lines 3160 and 2902 don't seem right for this type of error. (I wanted to check on an uninitialized pointer that crops up in FV_StateMod from time to time.)

tclune · 2019-08-20T14:47:52Z

@JulesKouatchou Also - worth learning how to include references to code-snippets in tickets. Easier to show you when you are here, but you can also find it by googling. As an example, I'll show the sections around the line numbes you mentioned above:

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/b4446a6504226c3933fd34f4a842a44ba8b7333b/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/FV_StateMod.F90#L3154-L3168

JulesKouatchou · 2019-08-20T15:03:34Z

@tclune I do not know how to include references to code-snippets of external components/modules of the CTM. I will try to figure it out.

lizziel · 2019-08-20T16:47:11Z

@JulesKouatchou Check this out: https://help.github.com/en/articles/creating-a-permanent-link-to-a-code-snippet. Thanks @tclune (I did not know about this).

tclune · 2019-08-20T19:55:21Z

Note - this used to work better. Rather than a link you would actually see the lines of code in the ticket (and the email). I see that someone has raised the issue with GitHub. Started working wrongly (for some users) 2-3 weeks ago.

tclune · 2019-08-20T19:57:19Z

Oh - it's because we are linking to text from a different repo. For the same repo it works really nicely. E.g.

GEOSctm/src/Components/GEOSctm_GridComp/CTMdiffusion_GridComp/GmiDiffusionMethod_mod.F90

Lines 305 to 322 in 4b2ac7d

    
           DO ic = 1, numSpecies 
        
              ! Get field and name from tracer bundle 
        
              !-------------------------------------- 
        
              call ESMF_FieldBundleGet(DiffTR, ic, FIELD, RC=STATUS) 
        
              VERIFY_(STATUS) 
        
              call ESMF_FieldGet(FIELD, name=NAME, RC=STATUS) 
        
              VERIFY_(STATUS) 
        
              !Identify fixed species such as O2, N2, ad. 
        
              IF (self%FIRST) THEN 
        
                 IF (ic .EQ. numSpecies) self%FIRST = .FALSE. 
        
                 IF (TRIM(NAME) == 'ACET' .OR. TRIM(NAME) == 'N2'   .OR. & 
        
                     TRIM(NAME) == 'O2'   .OR. TRIM(NAME) == 'NUMDENS')  THEN 
        
                    self%isFixedConcentration(ic) = .TRUE. 
        
                 END IF 
        
              END IF

lizziel · 2019-08-20T19:58:10Z

Very nice. Not sure if you guys use Slack, but it shows up nicely in Slack chats as well.

kgerheiser · 2019-08-26T13:35:58Z

@JulesKouatchou

In fv_computeMassFluxes_r8 you need to initialize uc and vc to 0.0 at the beginning of the subroutine before they are assigned in lines 2916-2917 in FV_StateMod.F90.

  !add these two lines to initialize to 0
  uc = 0d0 
  vc = 0d0 

  uc(is:ie,js:je,:) = ucI
  vc(is:ie,js:je,:) = vcI

JulesKouatchou · 2019-08-26T16:12:35Z

Kyle,

Thank you for your inputs. I was able to run the CTM code for a day. I now need to run the code longer under various configurations.

tclune · 2019-08-26T17:49:13Z

@kgerheiser I'm wary that these are not the correct points to do initializations. Was this solution found by trial-and-error, or pulled from some other version of the code?

kgerheiser · 2019-08-26T18:06:05Z

I found this bug several weeks ago and tracked it down to uninitialized halo values.

The line uc(is:ie,js:je,:) = ucI only initializes the interior of the array, leaving the halos uninitialized. Then, later in the code compute_utvt uses those halo values while they are uninitialized and causes the crash.

kgerheiser · 2019-08-26T20:19:40Z

I looked at the r4 version of the subroutine and it also assigns the variables to 0 at the same place

lizziel · 2019-08-26T20:49:57Z

It looks like someone at Harvard added this fix to the GCHP version of FV3 (r8) several years ago. Apologies if it never made it up the chain. I missed it as well when upgrading FV3 recently so will need to add it back in, although it hasn't caused a crash so far.

JulesKouatchou · 2019-08-27T00:32:09Z

I want to provide an update. The code does not run when using regular compilation options:

Image PC Routine Line Source
GEOSctm.x 000000000261DDAE Unknown Unknown Unknown
libpthread-2.11.3 00002AAAAEAAF850 Unknown Unknown Unknown
GEOSctm.x 00000000015CE155 tp_core_mod_mp_pe 1053 tp_core.F90
GEOSctm.x 00000000015D932C tp_core_mod_mp_yp 915 tp_core.F90
GEOSctm.x 00000000015C2D27 tp_core_mod_mp_fv 165 tp_core.F90
GEOSctm.x 00000000012CB321 fv_statemod_mp_fv 2979 FV_StateMod.F90

When I compile with debugging options, the code runs for 8 days and crashes:

AGCM Date: 2010/02/08 Time: 16:00:00 Throughput(days/day)[Avg Tot Run]: 328.0 310.2 340.4 TimeRemaining(Est) 001:46:49 91.0% Memory Committed
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.

There seems to be a memory issue. Is there any setting I need to have?

My code is at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/GEOSctm

and my experiment directory at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/testTR

tclune · 2019-08-27T12:07:08Z

@lizziel has also reported some memory leak issues. @bena-nasa has tried to replicate with a more synthetic use case, but was unsuccessful.

I know that @wmputman has been using the advec core in his latest GCM development, but AFAIK he is using a slightly different version of FV than made it into Git. I.e., the git version of FV was really only vetted with the dycore and we are probably missing various minor fixes in the advec core that were only in CTM tags under CVS.

Someone knowledgeable needs to take a hard look at the diffs between the current FV and working versions in CTM's under CVS.

kgerheiser · 2019-09-03T18:22:29Z

I can reproduce the crash in tp_core with debugging turned off

kgerheiser · 2019-09-04T15:09:12Z

I'm not sure what to make of this:

I added a print statement to tp_core where the crash is occurring (divide by zero) to check if the divisor was really small, and adding the print statement made the crash go away.

And it turns out a4 on this line fmin = a0(i) + 0.25/a4*da1**2 + a4*r12 is on the order of 10^-16.

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/9ab6f4e18386d0e7802711f9dd70849080b61f78/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/tp_core.F90#L1054

tclune · 2019-09-04T17:16:00Z

We seem to be accumulating a sizable number of mysteries in the FV layer recently.

Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement.

I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind?

wmputman · 2019-09-04T17:21:05Z

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

…

-- Bill Putman William M Putman Global Modeling and Assimilation Office NASA Goddard Space Flight Center Cell: 240-778-5697 Desk: 301-286-2599 From: Tom Clune <[email protected]> Reply-To: GEOS-ESM/GEOSctm <[email protected]> Date: Wednesday, September 4, 2019 at 1:16 PM To: GEOS-ESM/GEOSctm <[email protected]> Cc: William Putman <[email protected]>, Mention <[email protected]> Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] Running GEOS CTM (#15) We seem to be accumulating a sizable number of mysteries in the FV layer recently. Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement. I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAJ3CE6YUFCSDGC36K7UFKFTQH7UNFA5CNFSM4ILMJQY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54J2EY-23issuecomment-2D527998227&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=zc4YovTlH63o4UdZMtWga6uSnzvyqhzy3gPtffS099M&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AJ3CE66BJLMUP4HWB4DXV2DQH7UNFANCNFSM4ILMJQYQ&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=rxXrbpxXeaUHZgub5JkLaY4zmu0zkPhgmnfTViys1g4&e=>.

mathomp4 · 2019-09-05T19:02:21Z

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

@wmputman,

Did you ever try using Rusty's INTERNAL_FILE_NML as seen in GEOS-ESM/GEOSgcm#34? Maybe it's just thousands of cores trying to open the same file that could cause issues?

wmputman · 2019-09-05T19:13:20Z

The trouble is that sometimes the input.nml file is completely empty, before the executable even begins.

…

-- Bill Putman William M Putman Global Modeling and Assimilation Office NASA Goddard Space Flight Center Cell: 240-778-5697 Desk: 301-286-2599 From: Matthew Thompson <[email protected]> Reply-To: GEOS-ESM/GEOSctm <[email protected]> Date: Thursday, September 5, 2019 at 3:02 PM To: GEOS-ESM/GEOSctm <[email protected]> Cc: William Putman <[email protected]>, Mention <[email protected]> Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] Running GEOS CTM (#15) There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3. @wmputman<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wmputman&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=4a_Pf6_xAbVX8NpzU9q21MfFftaKueNnWduu6_FGWJQ&e=>, Did you ever try using Rusty's INTERNAL_FILE_NML as seen in GEOS-ESM/GEOSgcm#34<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSgcm_issues_34&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=WOoXs3v-FXHYTP27olxJZC_Egho5g3mhVGFpYxef7Nk&e=>? Maybe it's just thousands of cores trying to open the same file that could cause issues? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAJ3CE62CUZVATY5FRPPKJDLQIFJT7A5CNFSM4ILMJQY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ALDSY-23issuecomment-2D528527819&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=-9QiXCZO8kcmt27rmDgJ45UrXzQ5qamY_y0qXmdlEYU&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AJ3CE64XWPO6UKSZTVUOQTTQIFJT7ANCNFSM4ILMJQYQ&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=4POwBSIGtkD8eV5LFoYo1uz9SjFu2nJFVDbBQ0ImsBA&s=fSePOJQnpG120cd-0BfFXBDXComTkLp-rtmzL-3QBbo&e=>.

mathomp4 · 2019-09-05T19:17:21Z

The trouble is that sometimes the input.nml file is completely empty, before the executable even begins.

That sounds like an issue in the scripting then. No matter what, we can never not do an append because the coupled model (@yvikhlya) actually has an input.nml file that controls all of MOM. Thus, we need to append the rc file to it.

I could belt-and-suspender it, with scripting. We could put in detection before GEOSxxx.x runs that if input.nml is empty, die, or try the append again? Hmm.

bena-nasa · 2019-09-05T20:49:29Z

I was able to run the CTM version on github jules pointed me to with a few modifications outlined in the comments here. THis appears to be based on Jason-3_0. I too was only able to run the debugged version but I can confirm that there does appear to be a memory leak. After 4 days the at c90 with the tracer case here is what I am seeing running from 21z on the 1st to 0z on the 5th of the month for the memory use on the root node of my compute session
Jason-3_0 based GEOSctm:
18.4% to 56.4%
MAPL on the develop branch and MAPL-2.0 branch on other repos
12.9% to 14.4%
Version of MAPL off develop that no longer uses CFIO
11.2% to 13.4%

So apparently not using the little cfio does not make a difference?
Also very hard to tell how correlated this is to ExtData

tclune · 2019-09-06T12:30:03Z

That's very encouraging on the memory front.

Might need to up the priority for investigating the no-debug failure now.

JulesKouatchou · 2019-09-06T13:38:09Z

When I compile in a no-debug mode, the code crashes on Line 999 of:

       FVdycoreCubed_GridComp/fvdycore/model/fv_tracer2d.F90

It is:
if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

If I comment out the line, the code can run. I have not figured out yet why the code crashes in the subroutine "fillz" that is inside:

      FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90

It seems that something is happening in the code segment (Line 87-92):

  do i=1,im
     if( q(i,1,ic) < 0. ) then
         **q(i,2,ic) = q(i,2,ic) + q(i,1,ic)*dp(i,1)/dp(i,2)**
         q(i,1,ic) = 0.
      endif
  enddo

kgerheiser · 2019-09-06T13:43:47Z

It crashes in fv_tracer2d? Based on your previous post, and from my testing it crashes in tp_core with a floating divide by zero. Also, what does it crash with? segfault, divide by zero, etc?

JulesKouatchou · 2019-09-06T16:55:57Z

Here is what I am getting:

MPT: #6 0x000000000173a21e in fv_fill_mod::fillz (
MPT: im=<error reading variable: Cannot access memory at address 0x14>,
MPT: km=<error reading variable: Cannot access memory at address 0x4>,
MPT: nq=<error reading variable: Cannot access memory at address 0x0>,
MPT: q=<error reading variable: Cannot access memory at address 0x7fffffff3590>, dp=...)
MPT: at /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/GEOSctm/src/Components/GEOSctm_GridComp/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90:89

I read somewhere that one way to remove the "error reading variable: Cannot access memory" error is to change the compilation options. Perhaps it is why the code can run with debugging options turned on.

kgerheiser · 2019-09-06T17:01:52Z

Looks like it might be MPT related. I was building with Ifort and Intel MPI. Maybe that's why I don't get that bug.

Right now I'm working on building with gfortran and OpenMPI in hopes that will expose some bugs.

mathomp4 · 2019-09-06T17:07:37Z

@JulesKouatchou Do you know what MPT environment variables are set with the CTM? It's possible one of the many we set for the GCM are needed for the CTM?

mathomp4 · 2019-09-06T17:21:58Z

Oh wow. I probably need the expertise of @tclune for my question with this code. So we have:

if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

and now the fillz routine:

 subroutine fillz(im, km, nq, q, dp)
   integer,  intent(in):: im                !< No. of longitudes
   integer,  intent(in):: km                !< No. of levels
   integer,  intent(in):: nq                !< Total number of tracers
   real , intent(in)::  dp(im,km)           !< pressure thickness
   real , intent(inout) :: q(im,km,nq)      !< tracer mixing ratio

Now q1 and dp2 are:

real ::   q1(bd%is :bd%ie ,bd%js :bd%je , npz   )! 3D Tracers
      real  dp2(bd%is:bd%ie,bd%js:bd%je,npz)

which means we are sending a weird 2d-slice of q1(:,j,:) to a 3-D q in fillz?

So does Fortran guarantee that q1(is:ie,j,npz) maps equally to q(ie-is+1,npz,1) when passed through the interface of a subroutine? They are the same size in the end (ie-is+1*npz), but ouch. I guess I'd be dumb and fill a temporary (is:ie,npz,1) array with q and pass that in.

JulesKouatchou · 2019-09-06T17:32:27Z

Matt: I had the same concern. I created two temporary variables loc_q1(ie-is+1,npz,1) and loc_dp2(ie-is+1,npz). The code still crashed at the same location.

kgerheiser · 2019-09-06T17:52:29Z

My explanation was sort of off before.

They are the same size which is the important thing. A temporary, automatic array, is created in the subroutine of the given size and the data is copied from your source array (which has the same number of elements) to the subroutine array and copied back when it returns. You're basically just interpreting the bounds differently.

Though if they don't match, the code will still happily compile and run.

tclune · 2019-09-06T19:49:58Z

@mathomp4 Unfortunately, this style is acceptable according to the standard. (But I'd much prefer not to fix F90 and F77 array styles.)

As @kgerheiser explains, the compiler is forced to make copies due to the explicit shape of the dummy arguments. Without aggressive debugging flags, the sizes don't even have to agree.

JulesKouatchou · 2019-09-09T17:52:27Z

I want to add that when I run the stand alone DynAdvCores, I do not have an issue. My guess is that the fillz subroutine is not called.

kgerheiser · 2019-09-10T17:42:41Z

Runs fine in non-debug mode with Gfortran + OpenMPI. Maybe it's a compiler bug.

So, we have:

ifort + Intel MPI (debug): works (with a memory leak somewhere)

ifort + Intel MPI (non-debug): divide by zero error in tp_core

ifort + MPT: Some sort of memory access issue in fiilz

Gfortran + OpenMPI: Works

kgerheiser · 2019-10-07T14:35:52Z

I've found that if you remove the entries associated with hord_* in fvcore_layout.rc that causes the crash to switch to tp_core. Though, I somehow suspect this isn't related to the actual bug.

I have been using TotalView and its memory debugger to catch the problem, but it yields nothing useful. I can see that memory is corrupted (at least according to TotalView), but if you add a print in the code the values are fine.

And due to the problem only being present when optimization is enabled when you step through the code it jumps around, so it's hard to see what's happening.

kgerheiser · 2019-10-07T20:08:34Z

I have found that both crashes are during the AVX instruction vdivpd, and that turning off vectorization using the -no-vec flag when compiling allows it to run.

mathomp4 · 2019-10-16T16:01:56Z

@kgerheiser I'm slowly getting caught up now on missed stuff. Is it only one file that needs -no-vec? I'd prefer not to do all of FV without it since, well, vectorization gave us some speed up.

kgerheiser mentioned this issue Aug 28, 2019

Fixes #53 - initialize uc and vc to 0 before assigning the non-halo part GEOS-ESM/GEOSgcm_GridComp#54

Closed

kgerheiser mentioned this issue Sep 19, 2019

Initialize uc and vc to 0d0 in computeMassFluxesR8 GEOS-ESM/FVdycoreCubed_GridComp#9

Merged

kgerheiser added the bug Something isn't working label Oct 7, 2019

Running GEOS CTM #15

Running GEOS CTM #15

Comments

JulesKouatchou commented Aug 13, 2019

kgerheiser commented Aug 13, 2019 • edited Loading

JulesKouatchou commented Aug 13, 2019

JulesKouatchou commented Aug 13, 2019 • edited Loading

kgerheiser commented Aug 13, 2019

JulesKouatchou commented Aug 13, 2019

JulesKouatchou commented Aug 13, 2019

kgerheiser commented Aug 14, 2019

JulesKouatchou commented Aug 14, 2019

lizziel commented Aug 14, 2019

lizziel commented Aug 14, 2019 • edited Loading

JulesKouatchou commented Aug 15, 2019

JulesKouatchou commented Aug 15, 2019

mathomp4 commented Aug 15, 2019 • edited Loading

mathomp4 commented Aug 15, 2019

kgerheiser commented Aug 15, 2019

JulesKouatchou commented Aug 15, 2019

JulesKouatchou commented Aug 15, 2019

tclune commented Aug 20, 2019

tclune commented Aug 20, 2019

JulesKouatchou commented Aug 20, 2019

lizziel commented Aug 20, 2019

tclune commented Aug 20, 2019

tclune commented Aug 20, 2019

lizziel commented Aug 20, 2019

kgerheiser commented Aug 26, 2019 • edited Loading

JulesKouatchou commented Aug 26, 2019

tclune commented Aug 26, 2019

kgerheiser commented Aug 26, 2019

kgerheiser commented Aug 26, 2019

lizziel commented Aug 26, 2019 • edited Loading

JulesKouatchou commented Aug 27, 2019

tclune commented Aug 27, 2019

kgerheiser commented Sep 3, 2019

kgerheiser commented Sep 4, 2019 • edited Loading

tclune commented Sep 4, 2019

wmputman commented Sep 4, 2019 via email

mathomp4 commented Sep 5, 2019

wmputman commented Sep 5, 2019 via email

mathomp4 commented Sep 5, 2019

bena-nasa commented Sep 5, 2019 • edited Loading

tclune commented Sep 6, 2019

JulesKouatchou commented Sep 6, 2019

kgerheiser commented Sep 6, 2019 • edited Loading

JulesKouatchou commented Sep 6, 2019

kgerheiser commented Sep 6, 2019

mathomp4 commented Sep 6, 2019

mathomp4 commented Sep 6, 2019

JulesKouatchou commented Sep 6, 2019

kgerheiser commented Sep 6, 2019 • edited Loading

tclune commented Sep 6, 2019

JulesKouatchou commented Sep 9, 2019

kgerheiser commented Sep 10, 2019 • edited Loading

kgerheiser commented Oct 7, 2019 • edited Loading

kgerheiser commented Oct 7, 2019 • edited Loading

mathomp4 commented Oct 16, 2019

kgerheiser commented Aug 13, 2019 •

edited

Loading

JulesKouatchou commented Aug 13, 2019 •

edited

Loading

lizziel commented Aug 14, 2019 •

edited

Loading

mathomp4 commented Aug 15, 2019 •

edited

Loading

kgerheiser commented Aug 26, 2019 •

edited

Loading

lizziel commented Aug 26, 2019 •

edited

Loading

kgerheiser commented Sep 4, 2019 •

edited

Loading

bena-nasa commented Sep 5, 2019 •

edited

Loading

kgerheiser commented Sep 6, 2019 •

edited

Loading

kgerheiser commented Sep 6, 2019 •

edited

Loading

kgerheiser commented Sep 10, 2019 •

edited

Loading

kgerheiser commented Oct 7, 2019 •

edited

Loading

kgerheiser commented Oct 7, 2019 •

edited

Loading