Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running GEOS CTM #15

Open
JulesKouatchou opened this issue Aug 13, 2019 · 55 comments
Open

Running GEOS CTM #15

JulesKouatchou opened this issue Aug 13, 2019 · 55 comments
Labels
bug Something isn't working

Comments

@JulesKouatchou
Copy link
Contributor

I cloned GEOS CTM and was able to compile it. The ctm_setup script did not properly create the experiment directory because it was still referring to the old configuration (Linux/ instead of install/). I fix the ctm_setup file. The code is crashing during the initialization steps because it cannot create the grid. The code is failing on Line 9193 of MAPL_Generic.F90:

call ESMF_ConfigGetAttribute(state%cf,gridname,label=trim(comp_name)//CF_COMPONENT_SEPARATOR//'GRIDNAME:',rc=status)
VERIFY_(status)

I can quickly understand why there is a problem: the label should only be 'GRIDNAME:'.

I checked a couple of CVS tags I have and could not locate any MAPL version similar to the one in the git repository. I am wondering if MAPL has to be updated before GEOS CTM can run.

@kgerheiser
Copy link

kgerheiser commented Aug 13, 2019

I have the CTM running with this version of MAPL. You just need to add something like this into GEOSCTM.rc:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

And for Dynamics:

  DYNAMICS.GRID_TYPE: Cubed-Sphere
  DYNAMICS.GRIDNAME: PE360x2160-CF
  DYNAMICS.NF: 6
  DYNAMICS.LM: 72
  DYNAMICS.IM_WORLD: 360

Though, there is some duplication and it could probably be cleaned up and added to ctm_setup.

@JulesKouatchou
Copy link
Contributor Author

Thank you. Very interesting. If it works, I will change the set up script...

@JulesKouatchou
Copy link
Contributor Author

JulesKouatchou commented Aug 13, 2019

Do you have by chance an experiment directory on discover I can look at? My code is still crashing. Thanks.

@kgerheiser
Copy link

I do have an experiment located at: /discover/nobackup/kgerheis/experiments/ctm_test_experiment

I think there might be a few other small things you need to change to get it to run. If you run into any errors I can probably quickly diagnose them as I've gone through this process several times.

@JulesKouatchou
Copy link
Contributor Author

Kyle: Thank you. I will get back to you if I need more assistance.

@JulesKouatchou
Copy link
Contributor Author

Kyle:

It seems that I am still missing something as my code continues to crash. My working directory is:

/gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/testTR

I believe that I have a rc setting issue that I cannot identify. I noticed that in my standard output file I have:

In MAPL_Shmem:
NumCores per Node = 96
NumNodes in use = 1
Total PEs = 96

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1

That is not correct as there should be 4 nodes in use.

@kgerheiser
Copy link

You need to register the grid for the GridManager before you can create the grid by adding this to AdvCore_GridComp.F90.

subroutine register_grid_and_regridders()
    use MAPL_GridManagerMod, only: grid_manager
    use CubedSphereGridFactoryMod, only: CubedSphereGridFactory
    use MAPL_RegridderManagerMod, only: regridder_manager
    use MAPL_RegridderSpecMod, only: REGRID_METHOD_BILINEAR
    use LatLonToCubeRegridderMod
    use CubeToLatLonRegridderMod
    use CubeToCubeRegridderMod

    type (CubedSphereGridFactory) :: factory

    type (CubeToLatLonRegridder) :: cube_to_latlon_prototype
    type (LatLonToCubeRegridder) :: latlon_to_cube_prototype
    type (CubeToCubeRegridder) :: cube_to_cube_prototype

    call grid_manager%add_prototype('Cubed-Sphere',factory)
    associate (method => REGRID_METHOD_BILINEAR, mgr => regridder_manager)
      call mgr%add_prototype('Cubed-Sphere', 'LatLon', method, cube_to_latlon_prototype)
      call mgr%add_prototype('LatLon', 'Cubed-Sphere', method, latlon_to_cube_prototype)
      call mgr%add_prototype('Cubed-Sphere', 'Cubed-Sphere', method, cube_to_cube_prototype)
    end associate

  end subroutine register_grid_and_regridders

and calling it in AdvCore SetServices (line 318)

if (.NOT. FV3_DynCoreIsRunning) then
         call fv_init2(FV_Atm, dt, grids_on_my_pe, p_split)
         call register_grid_and_regridders() ! add this line
end if

This register_grid_and_regridders routine duplicates code in DynCore and should probably be added to its own module that AdvCore and DynCore can call.

@JulesKouatchou
Copy link
Contributor Author

Kyle: That helps to go further. Thanks. The code still crashes. I am now running in a debugging mode...

@lizziel
Copy link

lizziel commented Aug 14, 2019

Jules, keep reporting the problems you are running into since I had to go through these issues as well when getting GCHP to work with the new MAPL. I may be able to help.

@lizziel
Copy link

lizziel commented Aug 14, 2019

One note of caution so that you do not make the same mistake I did. Regarding the earlier comment to add this to your config file:

GEOSctm.GRID_TYPE: Cubed-Sphere
GEOSctm.GRIDNAME: PE360x2160-CF
GEOSctm.NF: 6
GEOSctm.LM: 72
GEOSctm.IM_WORLD: 360

I made the mistake of adding the prefix (in my case GCHP. rather than GEOSctm.) to the existing lines rather than adding them. This caused a silent bug in AdvCore_GridCompMod.F90 since it expects entries for IM, JM, and LM in the file, without prefixes. If those lines are not found then the calculated areas get messed up since it uses default values.

My fix was to change AdvCore_GridCompMod.F90 to expect lines with the prefix rather than without.

-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      ! Customize for GCHP (ewl, 4/9/2019)
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'GCHP.IM:', default= 32, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'GCHP.JM:', default=192, RC=STATUS )
       _VERIFY(STATUS)
-      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+!      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )
+      call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'GCHP.LM:', default= 72, RC=STATUS )

@JulesKouatchou
Copy link
Contributor Author

Lizzie,

Thank you for your comments. I am wondering if it was necessary to make any change to my GEOSCTM.rc file as the AdvCore source coide has:

     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npx, 'IM:', default= 32, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npy, 'JM:', default=192, RC=STATUS )
     call MAPL_GetResource( MAPL, FV_Atm(1)%flagstruct%npz, 'LM:', default= 72, RC=STATUS )

In any case, my code is still crashing and I still trying to figure out why.

@JulesKouatchou
Copy link
Contributor Author

Something unusual that I mentioned before is the following print out at the beginning of the code:

In MAPL_Shmem:
NumCores per Node = 84
NumNodes in use = 1
Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1

I went ahead in MAPL_ShmemMod.F90 and printed the name of all the processors (nodes) after the call:
call MPI_AllGather(name ,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,&
names,MPI_MAX_PROCESSOR_NAME,MPI_CHARACTER,Comm,status)

All the entries of the variable "names" only had the name of the head node. Something is wrong but I do not know what. I am sure that that there is a new rc setting I need to add.

@mathomp4
Copy link
Member

mathomp4 commented Aug 15, 2019

@JulesKouatchou This sounds suspiciously like calling MPT with mpirun. How are you running your executable? If you're using MPT (which you probably are), you need to use either mpiexec_mpt or use esma_mpirun (which autodetects MPI stack and will run mpiexec_mpt)

ETA: For everyone, MPT does have an mpirun command but it is weird. It can do some interesting things, but it doesn't perform like you'd expect without a lot of extra work. mpiexec_mpt understands SLURM (ish) and does what is expected.

@mathomp4
Copy link
Member

@JulesKouatchou Actually, I might need to work with you on the ctm_setup script. There are some changes we had to make to gcm_setup for the move to CMake that aren't reflected in that script as I see it on this repo. It's possible it's getting confused and picking the wrong MPI stack or the like.

@kgerheiser
Copy link

@mathomp4 That is something I had to change in ctm_run.j, and why I asked you about MPT's mpirun. I changed RUN_CMD to mpiexec

@JulesKouatchou
Copy link
Contributor Author

@mathomp4
I made changes in my ctm_setup in order to have a complete experiment directory. The script was still referring for instance to the Linux/ directory.
In my ctm_run.j file, RUN_CMD is set to "mpirun" not "mpiexec". I will make a change and see what happens.

@JulesKouatchou
Copy link
Contributor Author

Things are moving in the right directing. I now have as expected:

In MAPL_Shmem:
NumCores per Node = 28
NumNodes in use = 3
Total PEs = 84

In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 3

The code is still crashing but this time in:

GEOSctm.x 00000000048A9E47 fv_statemod_mp_co 3160 FV_StateMod.F90
GEOSctm.x 0000000004868BAF fv_statemod_mp_fv 2902 FV_StateMod.F90
GEOSctm.x 00000000004B2B6B geos_ctmenvgridco 926 GEOS_ctmEnvGridComp.F90

It appears in the manipulations of U & V. My guess is that U & V are not properly read in or I am missing a rc setting somewhere.

@tclune
Copy link
Collaborator

tclune commented Aug 20, 2019

@JulesKouatchou Can you tell me which branch of GEOSgcm_GridComp is being used? Lines 3160 and 2902 don't seem right for this type of error. (I wanted to check on an uninitialized pointer that crops up in FV_StateMod from time to time.)

@tclune
Copy link
Collaborator

tclune commented Aug 20, 2019

@JulesKouatchou Also - worth learning how to include references to code-snippets in tickets. Easier to show you when you are here, but you can also find it by googling. As an example, I'll show the sections around the line numbes you mentioned above:

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/b4446a6504226c3933fd34f4a842a44ba8b7333b/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/FV_StateMod.F90#L3154-L3168

@JulesKouatchou
Copy link
Contributor Author

@tclune I do not know how to include references to code-snippets of external components/modules of the CTM. I will try to figure it out.

@lizziel
Copy link

lizziel commented Aug 20, 2019

@JulesKouatchou Check this out: https://help.github.com/en/articles/creating-a-permanent-link-to-a-code-snippet. Thanks @tclune (I did not know about this).

@tclune
Copy link
Collaborator

tclune commented Aug 20, 2019

Note - this used to work better. Rather than a link you would actually see the lines of code in the ticket (and the email). I see that someone has raised the issue with GitHub. Started working wrongly (for some users) 2-3 weeks ago.

@tclune
Copy link
Collaborator

tclune commented Aug 20, 2019

Oh - it's because we are linking to text from a different repo. For the same repo it works really nicely. E.g.

DO ic = 1, numSpecies
! Get field and name from tracer bundle
!--------------------------------------
call ESMF_FieldBundleGet(DiffTR, ic, FIELD, RC=STATUS)
VERIFY_(STATUS)
call ESMF_FieldGet(FIELD, name=NAME, RC=STATUS)
VERIFY_(STATUS)
!Identify fixed species such as O2, N2, ad.
IF (self%FIRST) THEN
IF (ic .EQ. numSpecies) self%FIRST = .FALSE.
IF (TRIM(NAME) == 'ACET' .OR. TRIM(NAME) == 'N2' .OR. &
TRIM(NAME) == 'O2' .OR. TRIM(NAME) == 'NUMDENS') THEN
self%isFixedConcentration(ic) = .TRUE.
END IF
END IF

@lizziel
Copy link

lizziel commented Aug 20, 2019

Very nice. Not sure if you guys use Slack, but it shows up nicely in Slack chats as well.

@kgerheiser
Copy link

kgerheiser commented Aug 26, 2019

@JulesKouatchou

In fv_computeMassFluxes_r8 you need to initialize uc and vc to 0.0 at the beginning of the subroutine before they are assigned in lines 2916-2917 in FV_StateMod.F90.

  !add these two lines to initialize to 0
  uc = 0d0 
  vc = 0d0 

  uc(is:ie,js:je,:) = ucI
  vc(is:ie,js:je,:) = vcI

@JulesKouatchou
Copy link
Contributor Author

Kyle,

Thank you for your inputs. I was able to run the CTM code for a day. I now need to run the code longer under various configurations.

@tclune
Copy link
Collaborator

tclune commented Aug 26, 2019

@kgerheiser I'm wary that these are not the correct points to do initializations. Was this solution found by trial-and-error, or pulled from some other version of the code?

@kgerheiser
Copy link

I found this bug several weeks ago and tracked it down to uninitialized halo values.

The line uc(is:ie,js:je,:) = ucI only initializes the interior of the array, leaving the halos uninitialized. Then, later in the code compute_utvt uses those halo values while they are uninitialized and causes the crash.

@kgerheiser
Copy link

I looked at the r4 version of the subroutine and it also assigns the variables to 0 at the same place

@lizziel
Copy link

lizziel commented Aug 26, 2019

It looks like someone at Harvard added this fix to the GCHP version of FV3 (r8) several years ago. Apologies if it never made it up the chain. I missed it as well when upgrading FV3 recently so will need to add it back in, although it hasn't caused a crash so far.

@JulesKouatchou
Copy link
Contributor Author

I want to provide an update. The code does not run when using regular compilation options:

Image PC Routine Line Source
GEOSctm.x 000000000261DDAE Unknown Unknown Unknown
libpthread-2.11.3 00002AAAAEAAF850 Unknown Unknown Unknown
GEOSctm.x 00000000015CE155 tp_core_mod_mp_pe 1053 tp_core.F90
GEOSctm.x 00000000015D932C tp_core_mod_mp_yp 915 tp_core.F90
GEOSctm.x 00000000015C2D27 tp_core_mod_mp_fv 165 tp_core.F90
GEOSctm.x 00000000012CB321 fv_statemod_mp_fv 2979 FV_StateMod.F90

When I compile with debugging options, the code runs for 8 days and crashes:

AGCM Date: 2010/02/08 Time: 16:00:00 Throughput(days/day)[Avg Tot Run]: 328.0 310.2 340.4 TimeRemaining(Est) 001:46:49 91.0% Memory Committed
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.

There seems to be a memory issue. Is there any setting I need to have?

My code is at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/GEOSctm

and my experiment directory at:

/discover/nobackup/jkouatch/GEOS_CTM/GitRepos/testTR

@tclune
Copy link
Collaborator

tclune commented Aug 27, 2019

@lizziel has also reported some memory leak issues. @bena-nasa has tried to replicate with a more synthetic use case, but was unsuccessful.

I know that @wmputman has been using the advec core in his latest GCM development, but AFAIK he is using a slightly different version of FV than made it into Git. I.e., the git version of FV was really only vetted with the dycore and we are probably missing various minor fixes in the advec core that were only in CTM tags under CVS.

Someone knowledgeable needs to take a hard look at the diffs between the current FV and working versions in CTM's under CVS.

@kgerheiser
Copy link

I can reproduce the crash in tp_core with debugging turned off

@kgerheiser
Copy link

kgerheiser commented Sep 4, 2019

I'm not sure what to make of this:

I added a print statement to tp_core where the crash is occurring (divide by zero) to check if the divisor was really small, and adding the print statement made the crash go away.

And it turns out a4 on this line fmin = a0(i) + 0.25/a4*da1**2 + a4*r12 is on the order of 10^-16.

https://github.com/GEOS-ESM/GEOSgcm_GridComp/blob/9ab6f4e18386d0e7802711f9dd70849080b61f78/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/tp_core.F90#L1054

@tclune
Copy link
Collaborator

tclune commented Sep 4, 2019

We seem to be accumulating a sizable number of mysteries in the FV layer recently.

Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement.

I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind?

@wmputman
Copy link

wmputman commented Sep 4, 2019 via email

@mathomp4
Copy link
Member

mathomp4 commented Sep 5, 2019

There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.

@wmputman,

Did you ever try using Rusty's INTERNAL_FILE_NML as seen in GEOS-ESM/GEOSgcm#34? Maybe it's just thousands of cores trying to open the same file that could cause issues?

@wmputman
Copy link

wmputman commented Sep 5, 2019 via email

@mathomp4
Copy link
Member

mathomp4 commented Sep 5, 2019

The trouble is that sometimes the input.nml file is completely empty, before the executable even begins.

That sounds like an issue in the scripting then. No matter what, we can never not do an append because the coupled model (@yvikhlya) actually has an input.nml file that controls all of MOM. Thus, we need to append the rc file to it.

I could belt-and-suspender it, with scripting. We could put in detection before GEOSxxx.x runs that if input.nml is empty, die, or try the append again? Hmm.

@bena-nasa
Copy link
Contributor

bena-nasa commented Sep 5, 2019

I was able to run the CTM version on github jules pointed me to with a few modifications outlined in the comments here. THis appears to be based on Jason-3_0. I too was only able to run the debugged version but I can confirm that there does appear to be a memory leak. After 4 days the at c90 with the tracer case here is what I am seeing running from 21z on the 1st to 0z on the 5th of the month for the memory use on the root node of my compute session
Jason-3_0 based GEOSctm:
18.4% to 56.4%
MAPL on the develop branch and MAPL-2.0 branch on other repos
12.9% to 14.4%
Version of MAPL off develop that no longer uses CFIO
11.2% to 13.4%

So apparently not using the little cfio does not make a difference?
Also very hard to tell how correlated this is to ExtData

@tclune
Copy link
Collaborator

tclune commented Sep 6, 2019

That's very encouraging on the memory front.

Might need to up the priority for investigating the no-debug failure now.

@JulesKouatchou
Copy link
Contributor Author

When I compile in a no-debug mode, the code crashes on Line 999 of:

       FVdycoreCubed_GridComp/fvdycore/model/fv_tracer2d.F90 

It is:
if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

If I comment out the line, the code can run. I have not figured out yet why the code crashes in the subroutine "fillz" that is inside:

      FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90

It seems that something is happening in the code segment (Line 87-92):

  do i=1,im
     if( q(i,1,ic) < 0. ) then
         **q(i,2,ic) = q(i,2,ic) + q(i,1,ic)*dp(i,1)/dp(i,2)**
         q(i,1,ic) = 0.
      endif
  enddo

@kgerheiser
Copy link

kgerheiser commented Sep 6, 2019

It crashes in fv_tracer2d? Based on your previous post, and from my testing it crashes in tp_core with a floating divide by zero. Also, what does it crash with? segfault, divide by zero, etc?

@JulesKouatchou
Copy link
Contributor Author

Here is what I am getting:

MPT: #6 0x000000000173a21e in fv_fill_mod::fillz (
MPT: im=<error reading variable: Cannot access memory at address 0x14>,
MPT: km=<error reading variable: Cannot access memory at address 0x4>,
MPT: nq=<error reading variable: Cannot access memory at address 0x0>,
MPT: q=<error reading variable: Cannot access memory at address 0x7fffffff3590>, dp=...)
MPT: at /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/GEOSctm/src/Components/GEOSctm_GridComp/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/FVdycoreCubed_GridComp/fvdycore/model/fv_fill.F90:89

I read somewhere that one way to remove the "error reading variable: Cannot access memory" error is to change the compilation options. Perhaps it is why the code can run with debugging options turned on.

@kgerheiser
Copy link

Looks like it might be MPT related. I was building with Ifort and Intel MPI. Maybe that's why I don't get that bug.

Right now I'm working on building with gfortran and OpenMPI in hopes that will expose some bugs.

@mathomp4
Copy link
Member

mathomp4 commented Sep 6, 2019

@JulesKouatchou Do you know what MPT environment variables are set with the CTM? It's possible one of the many we set for the GCM are needed for the CTM?

@mathomp4
Copy link
Member

mathomp4 commented Sep 6, 2019

Oh wow. I probably need the expertise of @tclune for my question with this code. So we have:

if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:))

and now the fillz routine:

 subroutine fillz(im, km, nq, q, dp)
   integer,  intent(in):: im                !< No. of longitudes
   integer,  intent(in):: km                !< No. of levels
   integer,  intent(in):: nq                !< Total number of tracers
   real , intent(in)::  dp(im,km)           !< pressure thickness
   real , intent(inout) :: q(im,km,nq)      !< tracer mixing ratio

Now q1 and dp2 are:

real ::   q1(bd%is :bd%ie ,bd%js :bd%je , npz   )! 3D Tracers
      real  dp2(bd%is:bd%ie,bd%js:bd%je,npz)

which means we are sending a weird 2d-slice of q1(:,j,:) to a 3-D q in fillz?

So does Fortran guarantee that q1(is:ie,j,npz) maps equally to q(ie-is+1,npz,1) when passed through the interface of a subroutine? They are the same size in the end (ie-is+1*npz), but ouch. I guess I'd be dumb and fill a temporary (is:ie,npz,1) array with q and pass that in.

@JulesKouatchou
Copy link
Contributor Author

Matt: I had the same concern. I created two temporary variables loc_q1(ie-is+1,npz,1) and loc_dp2(ie-is+1,npz). The code still crashed at the same location.

@kgerheiser
Copy link

kgerheiser commented Sep 6, 2019

My explanation was sort of off before.

They are the same size which is the important thing. A temporary, automatic array, is created in the subroutine of the given size and the data is copied from your source array (which has the same number of elements) to the subroutine array and copied back when it returns. You're basically just interpreting the bounds differently.

Though if they don't match, the code will still happily compile and run.

@tclune
Copy link
Collaborator

tclune commented Sep 6, 2019

@mathomp4 Unfortunately, this style is acceptable according to the standard. (But I'd much prefer not to fix F90 and F77 array styles.)

As @kgerheiser explains, the compiler is forced to make copies due to the explicit shape of the dummy arguments. Without aggressive debugging flags, the sizes don't even have to agree.

@JulesKouatchou
Copy link
Contributor Author

I want to add that when I run the stand alone DynAdvCores, I do not have an issue. My guess is that the fillz subroutine is not called.

@kgerheiser
Copy link

kgerheiser commented Sep 10, 2019

Runs fine in non-debug mode with Gfortran + OpenMPI. Maybe it's a compiler bug.

So, we have:

ifort + Intel MPI (debug): works (with a memory leak somewhere)

ifort + Intel MPI (non-debug): divide by zero error in tp_core

ifort + MPT: Some sort of memory access issue in fiilz

Gfortran + OpenMPI: Works

@kgerheiser
Copy link

kgerheiser commented Oct 7, 2019

I've found that if you remove the entries associated with hord_* in fvcore_layout.rc that causes the crash to switch to tp_core. Though, I somehow suspect this isn't related to the actual bug.

I have been using TotalView and its memory debugger to catch the problem, but it yields nothing useful. I can see that memory is corrupted (at least according to TotalView), but if you add a print in the code the values are fine.

And due to the problem only being present when optimization is enabled when you step through the code it jumps around, so it's hard to see what's happening.

@kgerheiser kgerheiser added the bug Something isn't working label Oct 7, 2019
@kgerheiser
Copy link

kgerheiser commented Oct 7, 2019

I have found that both crashes are during the AVX instruction vdivpd, and that turning off vectorization using the -no-vec flag when compiling allows it to run.

@mathomp4
Copy link
Member

@kgerheiser I'm slowly getting caught up now on missed stuff. Is it only one file that needs -no-vec? I'd prefer not to do all of FV without it since, well, vectorization gave us some speed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants