Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread 1 "gcclassic" received signal SIGSEGV, Segmentation fault. #2502

Open
YueZhang720 opened this issue Oct 8, 2024 · 4 comments
Open
Assignees
Labels
category: Debug Help Request for assistance debugging GEOS-Chem topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error)

Comments

@YueZhang720
Copy link

YueZhang720 commented Oct 8, 2024

Your name

Yue Zhang

Your affiliation

HKUST(GZ)

What happened? What did you expect to happen?

I'm running a full chemistry simulation from 2019/7/01-2019/08/01. My log file contains the following errors:

---> DATE: 2019/07/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
 Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4
real 181.08
user 1026.99
sys 29.57

What are the steps to reproduce the bug?

Then I try to use gdb to backtrace the error:

********************************************
* B e g i n   T i m e   S t e p p i n g !! *
********************************************

---> DATE: 2019/07/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
 Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4

Thread 1 "gcclassic" received signal SIGSEGV, Segmentation fault.
0x000000000105972a in blkslv (fj=<error reading variable: Cannot access memory at address 0x7fffff3ed118>, pomega=<error reading variable: Cannot access memory at address 0x7fffff3ed110>, fz=<error reading variable: Cannot access memory at address 0x7fffff3ed108>, ztau=<error reading variable: Cannot access memory at address 0x7fffff3ed100>, fsbot=<error reading variable: Cannot access memory at address 0x7fffff3ed0f8>, rfl=<error reading variable: Cannot access memory at address 0x7fffff3ed0f0>, pm=..., pm0=..., fjtop=..., fjbot=..., fibot=..., ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1372
1372          subroutine BLKSLV &
(gdb) backtrace
#0  0x000000000105972a in blkslv (fj=<error reading variable: Cannot access memory at address 0x7fffff3ed118>, 
    pomega=<error reading variable: Cannot access memory at address 0x7fffff3ed110>, 
    fz=<error reading variable: Cannot access memory at address 0x7fffff3ed108>, 
    ztau=<error reading variable: Cannot access memory at address 0x7fffff3ed100>, 
    fsbot=<error reading variable: Cannot access memory at address 0x7fffff3ed0f8>, 
    rfl=<error reading variable: Cannot access memory at address 0x7fffff3ed0f0>, pm=..., pm0=..., fjtop=..., fjbot=..., fibot=..., 
    ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1372
#1  0x0000000001062011 in miesct (fj=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, fjt=..., 
    fjb=..., fib=..., pomega=<error reading variable: value requires 1038528 bytes, which is more than max-value-size>, 
    fz=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, 
    ztau=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, fsbot=..., rfl=..., 
    u0=0.41925676950732299, ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1339
#2  0x00000000010683f6 in opmie (dtaux=..., 
    pomegax=<error reading variable: value requires 82944 bytes, which is more than max-value-size>, u0=0.41925676950732299, rfl=..., 
    amf=..., amg=..., jxtra=..., fjact=..., fjtop=..., fjbot=..., fibot=..., fsbot=..., fjflx=..., flxd=..., flxd0=..., ldokr=..., lu=47, 
    rc=0) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1243
#3  0x00000000010761d0 in photo_jx (u0=0.41925676950732299, sza=65.212326865588082, rfl=..., solf=0.96650251637029394, lprtj=.FALSE., 
--Type <RET> for more, q to quit, c to continue without paging--RET
    ppp=..., zzz=..., ttt=..., hhh=..., ddd=..., rrr=..., ooo=..., ccc=..., lwp=..., iwp=..., reffl=..., reffi=..., aersp=..., ndxaer=..., 
    l1u=48, anu=37, njxu=166, valjxx=..., skperd=..., swmsq=..., od18=..., ldark=.FALSE., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:587
#4  0x0000000001034cd3 in cloud_jx (u0=0.41925676950732299, sza=65.212326865588082, rfl=..., solf=0.96650251637029394, lprtj=.FALSE., 
    ppp=..., zzz=..., ttt=..., hhh=..., ddd=..., rrr=..., ooo=..., ccc=..., lwp=..., iwp=..., reffl=..., reffi=..., cldf=..., 
    cldcor=0.33000001311302185, cldiw=..., aersp=..., ndxaer=..., l1u=48, anu=37, njxu=166, valjxx=..., skperd=..., swmsq=..., od18=..., 
    iran=1, nica=0, jcount=0, ldark=.FALSE., wtqca=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_sub_mod.F90:183
#5  0x0000000000b75b4b in __cldj_interface_mod_MOD_run_cloudj._omp_fn.0 ()
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/cldj_interface_mod.F90:898
#6  0x00007ffff79d0166 in GOMP_parallel (fn=0xb73218 <__cldj_interface_mod_MOD_run_cloudj._omp_fn.0>, data=0x7ffffffd9810, num_threads=32, 
    flags=0) at /tmp/jingzhoujiang/spack-stage/spack-stage-gcc-14.2.0-ynlf3gta5n3oegqmhph3urmk6k26txce/spack-src/libgomp/parallel.c:178
#7  0x0000000000b72ba0 in run_cloudj (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/cldj_interface_mod.F90:415
#8  0x00000000006df8ee in do_photolysis (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/photolysis_mod.F90:538
#9  0x00000000004e8a50 in do_fullchem (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
--Type <RET> for more, q to quit, c to continue without paging--RET
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/fullchem_mod.F90:393
#10 0x00000000004395de in do_chemistry (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/chemistry_mod.F90:248
#11 0x000000000040b372 in geos_chem () at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:1456
#12 0x000000000040ed17 in main (argc=1, argv=0x7ffffffef2d2)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:32
#13 0x00007ffff7740d90 in __libc_start_call_main (main=main@entry=0x40eccf <main>, argc=argc@entry=1, argv=argv@entry=0x7ffffffeee28)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#14 0x00007ffff7740e40 in __libc_start_main_impl (main=0x40eccf <main>, argc=1, argv=0x7ffffffeee28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffffffeee18) at ../csu/libc-start.c:392
#15 0x0000000000405b15 in _start ()

What should I do next?

Please attach any relevant configuration and log files.

GC.log
gcfiles.zip

What GEOS-Chem version were you using?

14.4.3

What environment were you running GEOS-Chem on?

Local cluster

What compiler and version were you using?

gcc 14.2.0

Will you be addressing this bug yourself?

Yes, but I will need some help

In what configuration were you running GEOS-Chem?

GCClassic

What simulation were you running?

Full chemistry

As what resolution were you running GEOS-Chem?

2x2.5

What meterology fields did you use?

MERRA-2

Additional information

No response

@YueZhang720 YueZhang720 added the category: Bug Something isn't working label Oct 8, 2024
@yantosca
Copy link
Contributor

yantosca commented Oct 8, 2024

Thanks for writing @YueZhang720. I wonder if this is something specific to the gcc 14.2.0 compilers. I can try to replicate that but I will first need to build libraries with spack for 14.2.0 so it may take me a while to get to this.

There are a couple of errors like this:

pomega=<error reading variable: value requires 1038528 bytes, which is more than max-value-size>, 
    fz=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, 

so it may also be an issue on your cluster.

If you have an older version of the GCC compilers (like 12.2.0) available, try that and see if you get the same error.

@yantosca yantosca self-assigned this Oct 8, 2024
@yantosca yantosca added the topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error) label Oct 8, 2024
@yantosca
Copy link
Contributor

yantosca commented Oct 8, 2024

Also which version of gdb are you using? You can type gdb --version to get that information.

@YueZhang720 YueZhang720 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2024
@YueZhang720 YueZhang720 reopened this Oct 10, 2024
@YueZhang720
Copy link
Author

Also which version of gdb are you using? You can type gdb --version to get that information.
Thanks for your help. This is my gdb version:

jingzhoujiang@login01:~/gcruns$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I have tried gcc 12.2.0, but it still has a segmentation fault with more errors. Is there anything wrong with my software or environment setting?

At line 507 of file /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90
Fortran runtime error: Index '2146697216' of dimension 1 of array 'ooj' above upper bound of 48

Error termination. Backtrace:

Thread 30 "gcclassic" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffa5e289640 (LWP 3071783)]
0x0000000001063324 in opmie (dtaux=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, pomegax=<error reading variable: value requires 12610078956637388800 bytes, which is more than max-value-size>, u0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, rfl=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, amf=..., amg=<error reading variable: value requires 18419722475945328640 bytes, which is more than max-value-size>, jxtra=<error reading variable: value requires 18433233274827440128 bytes, which is more than max-value-size>, fjact=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, fjtop=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fjbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fibot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fsbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fjflx=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, flxd=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, flxd0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, ldokr=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, lu=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, rc=<error reading variable: Cannot access memory at address 0x7ff4000000000000>) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1040
1040              DTAU1(L) = DTAUX(L,K) * AMG(L)
(gdb) backtrace
#0  0x0000000001063324 in opmie (dtaux=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
    pomegax=<error reading variable: value requires 12610078956637388800 bytes, which is more than max-value-size>, 
    u0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    rfl=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, amf=..., 
    amg=<error reading variable: value requires 18419722475945328640 bytes, which is more than max-value-size>, 
    jxtra=<error reading variable: value requires 18433233274827440128 bytes, which is more than max-value-size>, 
    fjact=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
    fjtop=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    fjbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    fibot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    fsbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    fjflx=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
    flxd=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
    flxd0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    ldokr=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    lu=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
    rc=<error reading variable: Cannot access memory at address 0x7ff4000000000000>)
 at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1040
#1  0x7ff4000000000000 in ?? ()
#2  0x7ff4000000000000 in ?? ()
#3  0x7ff4000000000000 in ?? ()
#4  0x7ff4000000000000 in ?? ()
#5  0x7ff4000000000000 in ?? ()
#6  0x7ff4000000000000 in ?? ()
#7  0x7ff4000000000000 in ?? ()
#8  0x7ff4000000000000 in ?? ()
#9  0x7ff4000000000000 in ?? ()
#10 0x7ff4000000000000 in ?? ()
#11 0x7ff4000000000000 in ?? ()
#12 0x7ff4000000000000 in ?? ()
#13 0x7ff4000000000000 in ?? ()
#14 0x7ff4000000000000 in ?? ()
#15 0x7ff4000000000000 in ?? ()
#16 0x7ff4000000000000 in ?? ()
0x7ff4000000000000 in ?? ()
#18 0x7ff4000000000000 in ?? ()
#19 0x7ff4000000000000 in ?? ()
#20 0x7ff4000000000000 in ?? ()
#21 0x7ff4000000000000 in ?? ()
#22 0x7ff4000000000000 in ?? ()
#23 0x7ff4000000000000 in ?? ()
#24 0x7ff4000000000000 in ?? ()
#25 0x7ff4000000000000 in ?? ()
#26 0x7ff4000000000000 in ?? ()
#27 0x7ff4000000000000 in ?? ()
#28 0x7ff4000000000000 in ?? ()
#29 0x7ff4000000000000 in ?? ()
#30 0x7ff4000000000000 in ?? ()
#31 0x7ff4000000000000 in ?? ()
#32 0x7ff4000000000000 in ?? ()
#33 0x7ff4000000000000 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--RET
#34 0x7ff4000000000000 in ?? ()
#35 0x7ff4000000000000 in ?? ()
#36 0x7ff4000000000000 in ?? ()
#37 0x7ff4000000000000 in ?? ()
#38 0x7ff4000000000000 in ?? ()
#39 0x7ff4000000000000 in ?? ()
#40 0x7ff4000000000000 in ?? ()
#41 0x7ff4000000000000 in ?? ()
#42 0x7ff4000000000000 in ?? ()
#43 0x7ff4000000000000 in ?? ()
#44 0x7ff4000000000000 in ?? ()
#45 0x7ff4000000000000 in ?? ()
#46 0x7ff4000000000000 in ?? ()
#47 0x7ff4000000000000 in ?? ()
#48 0x7ff4000000000000 in ?? ()
#49 0x7ff4000000000000 in ?? ()
#50 0x7ff4000000000000 in ?? ()

@yantosca
Copy link
Contributor

Thanks @YueZhang720. This is an out-of-bounds error in Cloud-J:

At line 507 of file /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90
Fortran runtime error: Index '2146697216' of dimension 1 of array 'ooj' above upper bound of 48

@lizziel: Didn't we see a similar issue in Cloud-J before?

@yantosca yantosca added category: Debug Help Request for assistance debugging GEOS-Chem and removed category: Bug Something isn't working labels Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Debug Help Request for assistance debugging GEOS-Chem topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error)
Projects
None yet
Development

No branches or pull requests

2 participants