Skip to content

Running an AMIP Experiment

Ben Auer edited this page Jul 25, 2024 · 21 revisions

After you setup an AMIP experiment, you will have an experiment directory with many files and subdirectories:

  • AGCM.rc -- resource file with specifications of boundary conditions, initial conditions, parameters, etc.
  • CAP.rc -- resource file with run job parameters
  • GEOSgcm.x -- model executable
  • HISTORY.rc -- resource file specifying the fields in the model that are output as data
  • RC/ -- contains resource files for various components of the model
  • archive/ -- contains job script for archiving output
  • forecasts/ -- contains scripts used for data assimilation mode
  • fvcore_layout.rc -- settings for dynamical core
  • gcm_emip.setup -- script to setup an EMIP experiment (do not run unless you know what you are doing!)
  • gcm_run.j -- run script
  • logging.yaml -- settings for MAPL logger
  • plot/ -- contains plotting job script template and .rc file
  • post/ -- contains the script template and .rc file for post-processing model output
  • regress/ -- contains scripts for doing regression testing of model
  • src -- directory with a tarball of the model version's source code

Environment Setting

Before running the model, there is some more setup to be completed. The run scripts need some environment variables set in ~/.cshrc (regardless of which login shell you use -- the GEOS scripts use csh). Here are the minimum contents of a .cshrc:

 umask 0022
 unlimit
 limit stacksize unlimited

The umask 0022 is not strictly necessary, but it will make the various files readable to others, which will facilitate data sharing and user support. Your home directory ~/$USER is also inaccessible to others by default; running chmod 755 ~ is helpful.

Get the Restart Files

Copy the restart (initial condition) files and associated cap_restart into EXPDIR. For the example from our setting up page, we chose c48. You can get an arbitrary set of restarts by copying the contents of the directory:

/discover/nobackup/mathomp4/Restarts-J10/nc4/Reynolds/c48-NLv3

containing 2-degree cubed sphere restarts and their corresponding cap_restart which has:

20000414 210000

which says they are for 2000-04-14 at 21z.

NOTE: You should NOT use these for science as they are more for testing. If you wish to create your own restarts, you can use remap_restarts.py. If you do that, you'll need to rename there restarts and provide a cap_restart file.

The model requires the following restarts:

catch_internal_rst
fvcore_internal_rst
lake_internal_rst
landice_internal_rst
moist_internal_rst
openwater_internal_rst
pchem_internal_rst
seaicethermo_internal_rst

everything else can be bootstrapped.

Restart names

When you run remap_restarts.py, you'll get files like with names possibly like:

C48c.fvcore_internal_rst.20000414_21z.nc4
C48c.moist_internal_rst.20000414_21z.nc4
...

where the first field (here, C48c) might be different and the datestamp might be for a different yyyymmdd_hhz as well. But, GEOSgcm expects restarts to be named like:

fvcore_internal_rst
moist_internal_rst
...

as specified in AGCM.rc:

DYN_INTERNAL_RESTART_FILE:              fvcore_internal_rst
...
MOIST_INTERNAL_RESTART_FILE:            moist_internal_rst

cap_restart

The cap_restart file that has one line containing the starting date for your experiment, in the format:

yyyymmdd hhmmss

which should be set to the date of your restarts.

CAP.rc

In CAP.rc you'll see many configuration settings, but only these are usually edited (here is an example):

END_DATE:     29990302 210000
JOB_SGMT:     00000015 000000
NUM_SGMT:     20
HEARTBEAT_DT:       450

These four fields are in general:

  • END_DATE: Date to end the run (yyyymmdd hhmmss)
  • JOB_SGMT: How long each segment of the run is (yyyymmdd hhmmss)
  • NUM_SGMT: How many segments to run in this submission
  • HEARTBEAT_DT: The time step of the model in seconds

Without changes, gcm_run.j will run NUM_SGMT of segments of JOB_SGMT lengths per batch submission and then will resubmit itself until END_DATE is reached.

So, if you'd instead like to run for just one day, you can set NUM_SGMT: 1, JOB_SGMT: 00000001 000000 and END_DATE to the end date.

NOTE: HEARTBEAT_DT is usually set by gcm_setup in the HEARTBEAT question. If you want to change the HEARTBEAT here, you will also probably need to change DTs in AGCM.rc a la:

CHEMISTRY_DT: 450
GOCART_DT: 450
HEMCO_DT: 450
GF_DT: 450
UW_DT: 450

as these times must be at least as long as HEARTBEAT_DT. They can in some cases be longer, but not shorter.

BEG_DATE

In CAP.rc you'll also see BEG_DATE. Do not touch this. GEOSgcm is fine with not having an "exact" beginning date of a run. Your cap_restart is what really tells the model when your restarts are good for and when to start.

gcm_run.j

The main script you will submit to run an experiment is gcm_run.j. This script is a template that is filled in by gcm_setup with the appropriate values for your experiment. You should not need to edit this script, but you may want to look at it to see what it does. The script is divided into sections, each of which is described below. NOTE: Your script might not exactly match this one in any code blocks presented here.

Batch Scheduler Directives

The first section of the script contains the batch scheduler directives. These are either SLURM at NCCS or PBS at NAS. For this example, we'll focus on NCCS and SLURM.

#######################################################################
#                     Batch Parameters for Run Job
#######################################################################

#SBATCH --time=12:00:00
#SBATCH --nodes=3 --ntasks-per-node=45
#SBATCH --job-name=test-c48_RUN
#SBATCH --constraint=cas
#SBATCH --account=s1873
#@BATCH_NAME -o gcm_run.o@RSTDATE

Walltime

The first line, #SBATCH --time=12:00:00, is the wallclock time requested for the job. This is the maximum amount of time the job will be allowed to run. Note that the 12-hour run time is a maximum; the job will stop when it reaches the end of running NUM_SGMT segments of JOB_SGMT length. This is often more than is needed, so it's best to do some test runs and lower this as this will allow the scheduler to find nodes for you more quickly.

Nodes and Constraints

The second line, #SBATCH --nodes=3 --ntasks-per-node=45, specifies the number of nodes (3) and tasks per node (45) you would like. In this case, we chose Cascade Lake as our node type with #SBATCH --constraint=cas.

In total you need to make sure that number of nodes multiplied by the number of tasks per node is greater than or equal to the number of cores you need. For that, look at AGCM.rc and the NX and NY fields. For example, if you have:

NX: 4
NY: 24

then the total number of cores you need is 4*24 = 96. So, you need to make sure that you request enough resources to cover this. Here we are asking for 3 nodes with 45 tasks per node, which is 135 tasks total. 135 is greater than 96, so we are good.

NOTE: In some higher resolution, we use what is called an IOserver to handle IO for the model. This is a separate process that runs on a separate node or nodes. So, in this case we will ask for more nodes compare to just what the model itself needs. You'll see this number in the AGCM.rc in the IOSERVER_NODES field. If you aren't sure what to do, the recommendation is to make a new experiment with gcm_setup and ask it to enable the IOserver. It will then calculate the number of nodes needed for you.

Job Name

The next line, #SBATCH --job-name=test-c48_RUN, is the name of the job. This what you see when running squeue -u $USER.

Account

The next line, #SBATCH --account=s1873, is the account to charge the job to. This is filled by gcm_setup with the account you specified when you ran it.

Odd Output Line

The last line in this section, #@BATCH_NAME -o gcm_run.o@RSTDATE, is used when running EMIPs which is outside the scope of this document. You can safely ignore it.

The flow of the gcm_run.j script

Beyond the SLURM area of the script, the rest of the script is divided into sections a user rarely will need to edit. So what we will do here is describe the general flow of the gcm_run.j script. When you run sbatch gcm_run.j, these are the steps that will happen. Below you will see reference to variables defined in CAP.rc as described above.

  1. Preliminary Setup
    1. Set up various environment variables
    2. Create experiment subdirectories
    3. Set various experiment run variables used by the script
    4. Create scratch directory and copy RC files from the RC/ directory in the experiment
    5. Create History collection directories
    6. Link Boundary Conditions into scratch/
    7. Process restarts (mainly filling variable so the script can track them)
  2. Perform multiple iterations of the model
    1. Set various time variables for this iteration
    2. Run the model for JOB_SGMT length
    3. Copy resulting checkpoints to the restarts/ directory and tar up
    4. Rename the resulting _checkpoint files to _rst
    5. Copy HISTORY output to holding directories
    6. Update cap_restart with the new start date of these restarts
    7. Run post-processing
    8. Update iteration counter
  3. Repeat Step 8 until NUM_SGMT is reached
  4. Copy final restarts and cap_restart back to main experiment directory
  5. Resubmit the script to run the next batch of NUM_SGMT segments until END_DATE is reached
flowchart TD

A(Submit gcm_run.j)-->B[Preliminary Setup]
B-->I[Run the model for JOB_SGMT length]
I-->J[Copy resulting checkpoints to restarts directory and tar up]
J-->K[Rename the resulting _checkpoint files to _rst]
K-->L[Copy HISTORY output to holding directories]
L-->M[Update cap_restart with the new start date of these restarts]
M-->N[Run post-processing]
N-->O[Update iteration counter]
O-->P{NUM_SGMT reached?}
P -- No -->I

P -- Yes -->Q[Copy final restarts and cap_restart back to main experiment directory]
Q-->R{END_DATE reached?}
R -- No -->A
R -- Yes -->T(Exit)
Loading

Postprocessing

The postprocessing step above is handled by gcmpost.script and has a few stages:

  1. Parses files within the ../holding/$stream directories to the appropriate YYYYMM
  2. Performs monthly means if the YYYYMM directories are complete (i.e., contains all required files)
  3. Spawns archive job if monthly means are successful.
  4. Spawns plot job if desired seasons are complete (by default JJA and DJF, controlled by plot.rc)

Exiting script before postprocessing

Above we said you might want to run for a limited time (say one JOB_SGMT) for testing. One other edit that can be useful is in gcm_run.j. In that script you'll see:

  $RUN_CMD $TOTAL_PES $GEOSEXE $IOSERVER_OPTIONS $IOSERVER_EXTRA --logging_config 'logging.yaml'
...
if( -e EGRESS ) then
   set rc = 0
else
   set rc = -1
endif
echo GEOSgcm Run Status: $rc
if ( $rc == -1 ) exit -1

The first line is the actual run command of GEOSgcm.x, and then we look for an EGRESS file which GEOSgcm.x produces at a successful completion.

If you are testing and only care about running GEOSgcm and none of the subsequent post processing, you can put an exit after this code to tell the script to just stop here.

HISTORY.rc

A very good and complete document describing the History Component and the structure of the HISTORY.rc file can be found here:

https://github.com/GEOS-ESM/MAPL/wiki/MAPL-History-Component

Submit the Job

The script you submit, gcm_run.j, should be ready to go as is.
However, you may want to verify that wallclock time you requested is appropriate.

At NCCS, you submit the job with

sbatch gcm_run.j

You can keep track of it with the command:

squeue -u USERNAME

or or follow stdout with:

tail -f slurm-JOBID.out

JOBID being returned by the sbatch command and displayed with squeue.

Jobs can be killed with:

scancel JOBID

At NAS, you submit the job with qsub gcm_run.j. You can keep track of it with qstat, and jobs can be killed with qdel JOBID.

Running a Replay Experiment

If you would like to replay to MERRA2, you should open up AGCM.rc and find the 4 lines that start with #M2:

#M2 REPLAY_ANA_EXPID:   MERRA-2
#M2 REPLAY_ANA_LOCATION: /discover/nobackup/projects/gmao/merra2/data
#M2 REPLAY_MODE: Regular
#M2 REPLAY_FILE: ana/MERRA2_all/Y%y4/M%m2/MERRA2.ana.eta.%y4%m2%d2_%h2z.nc4

and remove the #M2:

    REPLAY_ANA_EXPID:   MERRA-2
    REPLAY_ANA_LOCATION: /discover/nobackup/projects/gmao/merra2/data
    REPLAY_MODE: Regular
    REPLAY_FILE: ana/MERRA2_all/Y%y4/M%m2/MERRA2.ana.eta.%y4%m2%d2_%h2z.nc4