Skip to content

Suggested NCCS Resources

Matt Thompson edited this page Jul 18, 2024 · 32 revisions

The NASA Center for Climate Simulation (NCCS) has a computing infrastructure that allow users to run applications (using multiple cores and GPUs), perform visualization and store data. The main NCCS platform that new GMAO staff needs to be familiar with is discover. We briefly describe in this document how to access discover, the initial setup procedures, and how use the system to complile and run your application.

1 General Overview of discover

A description of discover is available at:

Using Discover

If you do not already have an account, please send an email to NCCS Support: support at nccs dot nasa dot gov

Per that page:

The Discover cluster is the main compute cluster for processing batch jobs requiring significant compute resources. It consists of several scalable compute units (SCUs) that offer a variety of processor types. There are a variety of nodes dedicated to batch computing and interactive data analysis.

2 Logging in to NCCS

As soon as you receive the credential (USERNAME, password, token) to use discover, you can access the platform from your workstation by issuing the command:

ssh -XY <USERNAME>@login.nccs.nasa.gov

Once you are connected, you will be asked to authenticate your access using RSA SecurID authentication:

PASSCODE: Enter your hardware or software token code here
host: discover
password: YOUR_NCCS_PASSWORD

2.1 Suggested .ssh/config for access to NCCS

Below are the recommended settings for .ssh/config on the system you use to access discover (i.e., the system you ran ssh login.nccs.nasa.gov from above).

Edit your local .ssh/config (or create the file, if you don't have one) to have the below substituting your NCCS AUID for <USERNAME>:

Host github.com
   ForwardX11 no

Host *
   ForwardX11 yes
   ForwardX11Trusted yes
   ForwardX11Timeout 500h
   ServerAliveInterval 30

Host login.nccs.nasa.gov
   User <USERNAME>
   ForwardX11 yes
   ForwardX11Trusted yes
   ForwardX11Timeout 500h
   ServerAliveInterval 30
   PKCS11Provider /usr/lib/ssh-keychain.dylib

host discover discover?? discover-mil discover.nccs.nasa.gov dirac dirac.nccs.nasa.gov dataportal.nccs.nasa.gov adapt.nccs.nasa.gov
   User                <USERNAME>
   LogLevel            Quiet
   ProxyCommand        ssh -l <USERNAME> login.nccs.nasa.gov direct %h
   ForwardX11          yes
   ForwardX11Trusted   yes
   ForwardX11Timeout   500h
   Protocol            2
   ServerAliveInterval 30

This config is equivalent to the "PIV SSH" style SSH access to NCCS discussed here. If all works, you should no longer need to go through login.nccs.nasa.gov but can do ssh discover

NOTE: If you do not have a PIV card and only an RSA token, REMOVE the PKCS11Provider line above as they are for PIV access.

3 Initial Steps on discover

The webpage Logging-In & Passwords gives more details on the steps presented here,

3.1 Selecting your Shell

When you are connected to discover, you may want to select your default Shell, bash being the default. To switch to a different default shell (csh, tcsh, ksh), contact support at nccs dot nasa dot gov.

3.1.1 Shell configuration

Users are recommended to configure their shell start up files as below.

Recommended .bashrc for NCCS
umask 0022
ulimit -s unlimited

# Look for the OS version and set the module path accordingly
OS_VERSION=$(grep VERSION_ID /etc/os-release | cut -d= -f2 | cut -d. -f1 | sed 's/"//g')

# Run things in this if-block only if we're in an interactive shell
if [[ $- == *i* ]]
then

   # Only put module use or other module commands here
   # and in the correct OS version
   if [[ "$OS_VERSION" == "15" ]]
   then
      export LMOD_SYSTEM_NAME=SLES15
      module purge
      module unuse -a /discover/swdev/gmao_SIteam/modulefiles-SLES12
      module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES15
      module load GEOSenv
   else
      export LMOD_SYSTEM_NAME=SLES12
      module purge
      module unuse -a /discover/swdev/gmao_SIteam/modulefiles-SLES15
      module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES12
      module load GEOSenv
   fi

   # Add any other things you want with interactive shells here

fi
Recommended .tcshrc for NCCS
umask 0022
limit stacksize unlimited

# Look for the OS version and set the module path accordingly
set OS_VERSION=`grep VERSION_ID /etc/os-release | cut -d= -f2 | cut -d. -f1 | sed 's/"//g'`

# Run things in this if-block only if we are in an interactive shell
if ($?prompt) then

   # Only put module use or other module commands here
   # and in the correct OS version
   if ($OS_VERSION == 15) then
      setenv LMOD_SYSTEM_NAME SLES15
      module purge
      module unuse -a /discover/swdev/gmao_SIteam/modulefiles-SLES12
      module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES15
      module load GEOSenv
   else
      setenv LMOD_SYSTEM_NAME SLES12
      module purge
      module unuse -a /discover/swdev/gmao_SIteam/modulefiles-SLES15
      module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES12
      module load GEOSenv
   endif

   # Add any other things you want with interactive shells here

endif

3.2 Passwordless SSH/SCP between NCCS Systems

Users have the ability to ssh or scp within the NCCS systems without typing their NCCS passwords by setting up authorization keys. This step is required to run applications.

From your home directory on discover, create a new authorized_keys by typing:

mkdir -p $HOME/.ssh
chmod 0700 $HOME/.ssh
cd $HOME/.ssh
ssh-keygen

Hit the enter/return key two times for the prompted questions. This will create a pair of private and public identity files, id_rsa and id_rsa.pub, under the .ssh directory.

Copy the file id_rsa.pub into authorized_keys in the same directory:

cat id_rsa.pub >> authorized_keys

NOTE: If you have instead an id_ed25519.pub key use that instead. Both will work.

Passwordless access to dirac (if you have dirac account)

NOTE: This section is only needed if you have an account on dirac. If you do not, you can skip this

Copy the contents of id_rsa.pub file from discover to dirac:

ssh <USERNAME>@dirac.nccs.nasa.gov 'mkdir -p ~/.ssh && chmod 700 ~/.ssh'
scp $HOME/.ssh/id_rsa.pub <USERNAME>@dirac.nccs.nasa.gov:~/.ssh/id_rsa.pub.discover

Access to dirac:

ssh dirac

and from there, type:

cat $HOME/.ssh/id_rsa.pub.discover >> $HOME/.ssh/authorized_keys
exit

3.3 Suggested .ssh/config for NCCS

Below are the recommended settings for .ssh/config on discover:

Host github.com
   ForwardX11 no

Host *
   ForwardX11 yes
   ForwardX11Trusted yes
   ForwardX11Timeout 500h
   ServerAliveInterval 30

Now you have the initial settings to proper use discover to run your GEOS related applications.

4 Other Important Resources

4.1 Using SLURM (Discover’s job scheduler)

NCCS provides SchedMD's Slurm resource manager for users to control their applications on discover. The SLURM tools allows users to schedule their jobs and request the computing resources (such as CPU time, memory, etc.) they need to execute their applications. Please refer to the documentaion below for more information:

Using SLURM

To submit jobs using SLURM, the webpage Running Jobs on Discover using Slurm explains how to use the queueing system or an interactive session (for better productivity and for quick access to the processor resources you need).

Getting interactive nodes

We recommend using only the Milan and Cascade Lake nodes at NCCS for doing work. You can get these nodes interactively with these commands:

  • Cascade Lake (SLES12)
    salloc --x11 --constraint=cas --nodes=N --job-name=Interactive --time=HH:MM:SS --account=ACCOUNT
    
  • Milan (SLES15)
    salloc --x11 --constraint=mil --nodes=N --job-name=Interactive --time=HH:MM:SS --account=ACCOUNT
    

You will need to fill in the actual number of nodes, --nodes=N, the time, HH:MM:SS and the account to run under, ACCOUNT. So for example 4 Milan nodes for 3 hours using account t1234 would be:

  salloc --x11 --constraint=mil --nodes=4 --job-name=Interactive --time=03:00:00 --account=t1234

If you need a node quickly, you can often use the Debug QOS as it has a higher priority by adding:

--qos=debug

but you are limited to one job and for 1 hour.

If you have access to other partitions and QOSs, you can specify them with --partition=PART --qos=QOS.

X Forwarding

Note, the use of --x11 above will forward X11 to your local machine. If you do not want this, remove the --x11 option.

4.2 File System & Storage

The home directory in any NCCS platform is quite small and is regularly backed up. We recommend that users keep in their home directories only source code files and avoid storing there any file and data that takes disc space. NCCS has a file storage system provide options to store files for short-term and/or long-term periods. We recommend the use on discover of the NOBACKUP file system to compile and run your application.

The disc resources are not unlimited. It is important to be self-aware of any file system you are using and know the maximum number of files you can have and the maximum amount of disc space you can use. The page Show Quota shows how to deterime the quota in each file system.

4.3 File Transfer

Running application on NCCS platforms requires data files and generate output files that are stored at different locations. Depeneding on the need, we need to transfer files from one storage location to another. The File transfer webpage provides all the options available and describes how transferring files is done.


Contact

For more information, you can contact either the NCCS Support at support_AT_nccs.nasa.gov or the SI Team at siteam_AT_gmao.gsfc.nasa.gov

Clone this wiki locally