Skip to content

Suggested NCCS Resources

Matthew Thompson edited this page May 15, 2023 · 32 revisions

The NASA Center for Climate Simulation (NCCS) has a computing infrastructure that allow users to run applications (using multiple cores and GPUs), perform visualization and store data. The main NCCS platform that new GMAO staff needs to be familair with is discover. We briefly describe in this document how to access discover, the initial setup procedures, and how use the system to complile and run your application.

1 General Overview of discover

A description of `discover is available at:

Using Discover

If you do not already have an account, please send an email to NCCS Support: support at nccs dot nasa dot gov

Per that page:

The Discover cluster is the main compute cluster for processing batch jobs requiring significant compute resources. It consists of several scalable compute units (SCUs) that offer a variety of processor types. There are a variety of nodes dedicated to batch computing and interactive data analysis.

2 Logging in to NCCS

As soon as you receive the credential (USERNAME, password, token) to use discover, you can access the paltform from your workstation by issuing the command:

ssh -XY <USERNAME>@login.nccs.nasa.gov

Once you are connected, you will be asked to authenticate your access using RSA SecurID authentication:

PASSCODE: Enter your hardware or software token code here
host: discover
password: YOUR_NCCS_PASSWORD

We recommend that you check the Bastion Host webpage to verify that you properly completed this step.

NOTE: If you do not have an RSA token yet and only have a PIV card, use the suggested SSH config below and that will then set things up for SSH to use a PIV card.

3 Initial Steps on discover

The webpage Logging-In & Passwords gives more details on the steps presented here,

3.1 Selecting your Shell

When you are connected to discover, you may want to select your default Shell, bash being the default. To switch to a different default shell (csh, tcsh, ksh), contact support at nccs dot nasa dot gov.

3.2 Passwordless SSH/SCP between NCCS Systems

Users have the ability to ssh or scp within the NCCS systems without typing their NCCS passwords by setting up authorization keys. This step is required to run applications.

From your home directory on discover, create a new authorized_keys by typing:

mkdir -p $HOME/.ssh
chmod 0700 $HOME/.ssh
cd $HOME/.ssh
ssh-keygen

Hit the enter/return key two times for the prompted questions. This will create a pair of private and public identity files, id_rsa and id_rsa.pub, under the .ssh directory.

Copy the file id_rsa.pub into authorized_keys in the same directory:

cat id_rsa.pub >> authorized_keys

Copy the contents of id_rsa.pub file from discover to dirac:

scp $HOME/.ssh/id_rsa.pub <USERNAME>@dirac.nccs.nasa.gov:~/.ssh/id_rsa.pub.discover

Access to dirac:

ssh dirac

and from there, type:

cat $HOME/.ssh/id_rsa.pub.discover >> $HOME/.ssh/authorized_keys
exit

3.3 Suggested .ssh/config for access to NCCS

Below is the recommended settings for .ssh/config on the system you use to access discover (i.e., the system you run ssh discover from).

Edit your local .ssh/config to have:

Host github.com
   ForwardX11 no

Host *
   ForwardX11 yes
   ForwardX11Trusted yes
   ForwardX11Timeout 500h
   ServerAliveInterval 30

Host login.nccs.nasa.gov
   User <USERNAME>
   ForwardX11 yes
   ForwardX11Trusted yes
   ForwardX11Timeout 500h
   ServerAliveInterval 30
   PKCS11Provider /usr/lib/ssh-keychain.dylib

host discover discover?? discover.nccs.nasa.gov dirac dirac.nccs.nasa.gov dataportal.nccs.nasa.gov adapt.nccs.nasa.gov
   User                <USERNAME>
   LogLevel            Quiet
   ProxyCommand        ssh -l <USERNAME> login.nccs.nasa.gov direct %h
   ForwardX11          yes
   ForwardX11Trusted   yes
   ForwardX11Timeout   500h
   Protocol            2
   ServerAliveInterval 30
   PKCS11Provider      /usr/lib/ssh-keychain.dylib

This config is equivalent to the "Direct Mode" of SSH access to NCCS discussed here.

Now you have the initial settings to proper use discover to run your GEOS related applications.

4 Other Important Resources

4.1 Using SLURM (Discover’s job scheduler)

NCCS provides SchedMD's Slurm resource manager for users to control their applications on discover. The SLURM tools allows users to schedule their jobs and request the computing resources (such as CPU time, memory, etc.) they need to execute their applications. Please refer to the documentaion below for more information:

Using SLURM

To submit jobs using SLURM, the webpage Running Jobs on Discover using Slurm explains how to use the queueing system or an interactive session (for better productivity and for quick access to the processor resources you need).

Getting interactive nodes

We recommend using only the Skylake and Cascade Lake nodes at NCCS for doing work. You can get these nodes interactively with these commands:

  • Cascade Lake
    xalloc --constraint=cas --nodes=N --ntasks-per-node=45 --job-name=Interactive --time=HH:MM:SS --account=ACCOUNT
    
  • Skylake
    xalloc --constraint=sky --nodes=N --ntasks-per-node=40 --job-name=Interactive --time=HH:MM:SS --account=ACCOUNT
    

You will need to fill in the actual number of nodes, --nodes=N, the time, HH:MM:SS and the account to run under, ACCOUNT. So for example 4 Cascade Lake nodes for 3 hours using account t1234 would be:

  xalloc --constraint=cas --nodes=4 --ntasks-per-node=45 --job-name=Interactive --time=03:00:00 --account=t1234

If you need a node quickly, you can often use the Debug QOS as it has a higher priority by adding:

--qos=debug

but you are limited to one job and for 1 hour.

If you have access to other partitions and QOSs, you can specify them with --partition=PART --qos=QOS.

4.2 File System & Storage

The home directory in any NCCS platform is quite small and is regularly backed up. We recommend that users keep in their home directories only source code files and avoid storing there any file and data that takes disc space. NCCS has a file storage system provide options to store files for short-term and/or long-term periods. We recommend the use on discover of the NOBACKUP file system to compile and run your application.

The disc resources are not unlimited. It is important to be self-aware of any file system you are using and know the maximum number of files you can have and the maximum amount of disc space you can use. The page Show Quota shows how to deterime the quota in each file system.

4.3 File Transfer

Running application on NCCS platforms requires data files and generate output files that are stored at different locations. Depeneding on the need, we need to transfer files from one storage location to another. The File transfer webpage provides all the options available and describes how transferring files is done.


Contact

For more information, you can contact either the NCCS Support at support_AT_nccs.nasa.gov or the SI Team at siteam_AT_gmao.gsfc.nasa.gov