Skip to content

Getting started on Della

Kilian Lieret edited this page Jan 8, 2024 · 17 revisions

Installing the software on Della

Logging in to della

  1. Make sure you use the VPN

  2. Run

ssh <username>@[email protected]

Type checkquota to check quota.

Step 1: Installing the mamba environment

Limited disk space in home directory, environments to be created in /scratch/gpfs

Create a new folder in /scratch/gpfs

cd /scratch/gpfs
mkdir $USER

$USER is an environment variable that holds your user name (you do not have to replace it).

Install micromamba

export MAMBA_ROOT_PREFIX=/scratch/gpfs/$USER/micromamba
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
eval "$(./bin/micromamba shell hook -s posix)"
./bin/micromamba shell init -s bash -p /scratch/gpfs/$USER/micromamba/

Let's check

echo $MAMBA_ROOT_PREFIX

should point to /scratch/gpfs/$USER/micromamba

Step 2 Installing our environment

cd
git clone https://github.com/gnn-tracking/gnn_tracking.git
cd gnn_tracking/environments
micromamba env create --name gnn --file default.yml

Step 3: Testing the installation

cd ..
# you should be in the top dir of the gnn_tracking package now
pytest

Running Jupyter notebook from the mydella (web GUI)

Preparation (first time only)

If you haven't done already, install the software (first sections of this document). If you are using the global anaconda installation instead of micromamba, you can skip to the next section.

Some context for what we're doing (**Click me**) If we had used the global anaconda installation for our environment rather than micromamba, then everything would have worked out of the box. However, for micromamba we have to do a small hack so that the web GUI finds our installation, namely we have to modify our [`$PATH`](https://www.digitalocean.com/community/tutorials/how-to-view-and-update-the-linux-path-environment-variable) environment variable. However, from the web GUI, the only thing that allows us to do this is to load a custom [environment modules](https://modules.sourceforge.net/). So we have to write our own small module.
  1. Change directories to your scratch directory: cd /scratch/gpfs/$USER
  2. Create a new file called micromamba-module: touch micromamba-module
  3. Open an editor, e.g. with nano micromamba-module
  4. Add the following lines:
#%Module1.0
module-whatis   "Set PATH for my mamba env"
prepend-path    PATH "/scratch/gpfs/kl5675/micromamba/envs/gnn/bin/"

In the last line, replace kl5675 with your user name (assuming that you named your environment gnn).

Let's double check a few things ($ is used to separate input and output):

$ file /scratch/gpfs/kl5675/micromamba/envs/gnn/bin/python
/scratch/gpfs/kl5675/micromamba/envs/gnn/bin/: symbolic link to python3.10

If the python version is different that's fine (but should be >3.10).

Let's do another check: Type

$ module load $PWD/micromamba-module
$ which python
/scratch/gpfs/kl5675/micromamba/envs/gnn/bin/python

(the first command requires the absolute path to the module file, hence we prefix the path to the current working directory with $PWD). The output of the second command should point to the python file in your micromamba environment.

Finally, let's make sure that jupyter lab is available.

micromamba activate gnn
jupyter lab

If you see the jupyter lab starting (lot's of colorful output), then it's already installed. You can quit it by hitting Ctrl-C a bunch of times. If you instead only see a help message that ends with

Jupyter command jupyter-lab not found.

then do

micromamba install -c conda-forge jupyterlab

to install it.

OK, we're all set :)

Starting a notebook from the web GUI

  1. Connect to the Princeton VPN
  2. Go to https://mydella.princeton.edu/
  3. From the top bar, choose "Interactive apps" > "Jupyter"

You need to enter the following information:

  1. Number of hours: You decide. Your job will be killed after the time is up, but the fewer hours you request, the less you need to wait initially.
  2. Custom partition: default
  3. Node type: any
  4. Number of cores: depends on what you're doing
  5. Memory allocated: depends on what you're doing
  6. Anaconda 3 version: custom
  7. Custom environment module paths: /scratch/gpfs/kl5675/ where you replace kl5675 with your own username
  8. Modules to load instead of the default: micromamba-module
  9. Extra SLURM options: --gres=gpu:1. If you require 80GB GPUs instead of 40GBs, add --constraint=gpu80 (also see the last section of this document)
  10. How to handle conda environments from your home directory: Only use those conda envs that already have ipykernel installed
  11. Click 'Launch'

The status will first be "Queued", then changing to "Starting", then "Running".

  1. Click the "Connect to Jupyter" button.
  2. Navigate to an existing notebook or click "New" > "Python3 (ipykernel)". Do not choose one of the anaconda options.
  3. You should now have a Jupyter notebook in your environment.

Let's check that everything is working: Type

# check if import is working
import gnn_tracking
# check that we have a GPU
import torch
# should show True
torch.cuda.is_available()

After you finish your calculations, go back to the della web portal and click "Delete" for the running Jupyter session.

Debugging

You can access logs for the SLURM job that was created under the hood from the web GUI:

image

To do this, click on the link after "Session ID". You'll see a file browser. Click on output.log for the log. If this file doesn't exist yet, your job probably hasn't started yet (it's still queued). Whenever something is going wrong, you should copy this log and attach it to your report.

After you leave the "Interactive app" page, you will no longer find this link. However, the files are in your home directory on della. For example at /home/$USER/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/dd4824c3-866d-4cbe-b502-a004d610bfcc (the hashes will be different).

Please also include the files user_defined_context.json and job_script_options.json in your report.

This is how the log looks if it is successful.
Script starting...
Waiting for Jupyter server to open port 19096...
Starting main script...
TTT - Sat Jul  1 20:47:24 EDT 2023
Creating env launcher wrapper script...
Creating launcher wrapper script...
TTT - Sat Jul  1 20:47:24 EDT 2023
Creating custom Jupyter kernels...
cp: cannot stat '/home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/assets/python_/*': No such file or directory
TTT - Sat Jul  1 20:47:24 EDT 2023
Creating custom Jupyter kernels from user-created Conda environments...
Creating kernel for /home/kl5675/.conda/envs/*/...

EnvironmentLocationNotFound: Not a conda environment: /home/kl5675/.conda/envs/*

TTT - Sat Jul  1 20:47:24 EDT 2023
Creating custom Jupyter kernels from local anaconda installations...
Currently Loaded Modulefiles:
 1) micromamba-module
TTT - Sat Jul  1 20:47:24 EDT 2023
+ jupyter kernelspec list
Available kernels:
  sys_python27      /home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/share/jupyter/kernels/sys_python27
  sys_python36      /home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/share/jupyter/kernels/sys_python36
  sys_python37      /home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/share/jupyter/kernels/sys_python37
  sys_python37_2    /home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/share/jupyter/kernels/sys_python37_2
  sys_python38      /home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/share/jupyter/kernels/sys_python38
  python3           /scratch/gpfs/kl5675/micromamba/envs/gnn/share/jupyter/kernels/python3
TTT - Sat Jul  1 20:47:28 EDT 2023
+ jupyter notebook --config=/home/kl5675/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/65d258e3-ad5c-4d45-9ef5-b433fc3934dd/config.py
[I 20:47:30.006 NotebookApp] Serving notebooks from local directory: /
[I 20:47:30.006 NotebookApp] Jupyter Notebook 6.5.4 is running at:
[I 20:47:30.006 NotebookApp] http://della-l09g2:19096/node/della-l09g2/19096/
[I 20:47:30.006 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Discovered Jupyter server listening on port 19096!
Generating connection YAML file...
[I 20:47:46.288 NotebookApp] 302 POST /node/della-l09g2/19096/login (172.17.2.9) 1.220000ms
[I 20:47:46.401 NotebookApp] 302 GET /node/della-l09g2/19096/ (172.17.2.9) 0.460000ms
[W 20:48:56.752 NotebookApp] Notebook home/kl5675/Documents/23/git_sync/tutorials/notebooks/009_build_graphs_ml.ipynb is not trusted
[I 20:48:57.493 NotebookApp] Kernel started: 10d06d3a-1927-423f-ba19-4f5e27036228, name: python3

Starting a Jupyter notebook from the CLI (advanced)

Check out the simpler version above

  1. [your machine] Connect to the Princeton VPN
  2. Think of a random number between 6000 and 9000. We’ll use 8945 (but yours MUST be different)
  3. [your machine] Log in to della-gpu: ssh -L 8945:localhost:8945 [email protected]
  4. [della-gpu] Start tmux by typing tmux
  5. [della-gpu] Split window in two with Ctrl-B %
  6. [della-gpu] In the left window, allocate resources with salloc --nodes=1 --ntasks=1 --time=01:00:00 --cpus-per-task=1 ; this is going to log you into a compute node. It’s going to tell you its name, e.g., della-r4c4n13 (we’ll need that later)
  7. [della-r4c4n13] micromamba activate gnn
  8. [della-r4c4n13] jupyter notebook --port 8945 --no-browser ; Double check that in the URLs that are displayed you see localhost:8945
  9. [della-gpu] in the right window, type ssh -N -L 8945:localhost:8945 della-r4c4n13 (the name from before)
  10. [your machine] Open your browser and copy the link that is shown in step 8

image

Requesting resources with SLURM

This page has all the information on the della cluster.

In particular, you will need to modify the above command as follows. To get a 40GB OR 80GB node:

salloc --nodes=1 --ntasks=1 --time=01:00:00 --cpus-per-task=1 --gres=gpu:1

to get a node with 80GB:

salloc --nodes=1 --ntasks=1 --time=01:00:00 --cpus-per-task=1 --gres=gpu:1 --constraint=gpu80