The McKelvey Engineering Compute Cluster

McKelvey Engineering Compute Cluster Documentation

The McKelvey Engineering Compute Cluster

Getting Started

To request an account on the ENGR cluster, email support@seas.wustl.edu, and let us know which PI you are working with.

Special Notes

There are a handful of special cases you should be aware of when using the ENGR cluster - please keep the below in mind no matter what sort of jobs you run.

RIS Storage

Accessing RIS storage from the ENGR cluster is done at the path:

/storage1/piname/

where “piname” is the WUSTL Key of the owner of the storage. Access is granted via WUSTL Key - on the ENGR cluster, this translates to having a valid Kerberos ticket. If you’ve SSH’d into a ENGR host with your WUSTL Key, you will have a valid ticket. If you have logged in with an SSH Key, you will not.

To generate or refresh a Kerberos ticket, use the command

kinit

and enter your WUSTL Key when prompted. Kerberos tickets in the ENGR cluster have these properties:

If you habitually leave live connections to the terminal machines, you may want to get into the habit of “kinit”ing your ticket before submitting a job.

Long Jobs With RIS Storage

If you are using RIS storage with your job, and the ticket expires, this will break the job. In order to avoid this, you can generate a keytab file that allows the cluster to renew your Kerberos ticket for much longer - this keytab will last until you change your WUSTL Key password, at which point it must be regenerated.

Generating a Keytab

First, please email support@seas.wustl.edu and request that EIT provision you a secure home directory.

Once we have let you know it is provisioned, from ssh.seas.wustl.edu, execute the command

/project/compute/bin/create_keytab.sh

and follow the prompts. The script will tell you exactly what to type in - feel free to cut and paste each line.

After that, new jobs started will use that keytab to refresh the Kerberos ticket indefinitely - until you change your WUSTL Key.

When your WUSTL Key password changes, you must log back into ssh.seas.wustl.edu and rerun the script.

VNC Sessions and RIS Storage

At the current time, VNC jobs do not come pre-armed with your Kerberos key from your initial login to the web service. If you have created a keytab as described above, it will initialize when the VNC session starts. Otherwise, if you wish to access the RIS storage from inside a VNC session, please do

kinit

You will be prompted for your WUSTL Key. You should also do this before submitting a LSF job from inside a VNC session. After you authenticate, you will have an active Kerberos key and will be able to access the RIS storage mounts.

OpenOnDemand

The OpenOnDemand interface is at https://compute.engr.wustl.edu - log in with your WUSTL Key, using Duo 2FA.

Files

OpenOndemand provides a file browser interface. Please note at this time, the file browser cannot access RIS storage.

Within the file brwoser you can upload, download, and move files. You can also edit plain text files within the browser, or open the current location in a web-based terminal.

VNC

There are several VNC sessions available. Many are marked with a specific PI’s name, and are only accessible to users within that PI’s lab.

Cluster Desktop - Virtual GL

This will start on one of three GPU hosts with older cards, specifically set up to allow you to tie your VNC session to a GPU in order to allow GUI applications to correctly render using the GPU.

When you start your GUI application, prepend the command with ‘virtualgl’ - for example

virtualgl glxgears

Jupyter Notebooks

Custom iPython Kernels

The Jupyter notebooks start with the Anaconda used if you execute ‘module add seas-anaconda3’ in a terminal. From there you can build a custom Anaconda environment.

It’s not recommended to have Anaconda add itself to your .bashrc if you use VNC, as it interferes with the ability for the VNC environment to start.{: style=”color: red; opacity: 0.80;” }

Inside the new Anaconda environment, you can then execute

ipython kernel install --user --name=envname

Start a new Jupyter session, and you can then find that kernel from the ‘New’ dropdown within Jupyter.

VSCode

VSCode starts a Visual Studio Code interface in your browser.

The LSF Scheduler

Interactive Jobs

Interactive jobs come in several flavors.

You may also start interactive shell jobs. If you intend to have a long running job, we suggest SSHing (Or use the Clusters menu above for Shell Access) to ssh7.seas.wustl.edu and starting the screen command, which will place you in a false terminal that will continue to run after you disconnect.

If you do use screen, hold CTRL-ALT-D to detach from the screen and let things continue in the background. To reconnect later, reconnect to ssh7.seas.wustl.edu and use the command screen -r. If you have multiple screens, it will tell you what is running.

You can start your screen with screen -R screenname, replacing screenname with a descriptive name you can use to reconnect later with screen -r screenname.

Another good utility for this is tmux.

You can start an interactive shell with the command:

bsub -q interactive -XF -Is /bin/bash

Generally, most options to bsub will work with interactive jobs, such as requesting a GPU:

bsub -gpu "num=2:mode=exclusive_process:gmodel=TeslaK40c" -q interactive -Is /bin/bash

The -Is /bin/bash must be the last item on the command line.

Note that your job submission will sit and wait if resources are not available!

Viewing Job Queues

The bqueues command gives you basic command about the status of the job queues in the system. Job queues are groups of nodes that are tasked to run certain types of jobs - nodes can be in more than one queue. Most queues are for special types of jobs, or for nodes that are dedicated to a certain research group.

  [seasuser@ssh ~]$ bqueues
  QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
  admin            50  Open:Active       -    -    -    -     0     0     0     0
  dataq            33  Open:Active       -    -    -    -     0     0     0     0
  normal           30  Open:Inact        -    -    -    -     0     0     0     0
  interactive      30  Open:Active       -    -    -    -     0     0     0     0
  SEAS-Lab-PhD      1  Open:Active       -    -    -    -     0     0     0     0

Viewing Cluster Nodes

bhosts gives information about the status of individual nodes in the cluster.

[seasuser@ssh ~]$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
gnode01.seas.wustl ok              -      1      0      0      0      0      0
gnode02.seas.wustl ok              -     16      0      0      0      0      0
node01.seas.wustl. ok              -      1      0      0      0      0      0
node02.seas.wustl. ok              -      1      0      0      0      0      0


Viewing Running Jobs

bjobs gives information on running jobs.

[seasuser@ssh lsf]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
123     seasuse RUN   SEAS-CPU   ssh.seas.wu node01.seas CPU-Test   Jan 01 12:00


More information on bjobs options are here.


More information on bqueues is here, and bhosts is here.

Basic Batch Jobs

Basic LSF Job Submission

bsub submits job script files.

A simple example of a job script file looks like so:

#BSUB -o cpu_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J PythonJob
#BSUB -R "rusage[mem=25]"
python mycode.py


The lines beginning with “#” are often referred to as pragmas - part of a script that defines variables.

This file submits a Python script to be ran with the output going to a file called mycode_out.XXX (-o mycode_out.%J, where %J becomes the job number). When the job is done, I’m mailed a notification (-N), and when I check status with bjobs, the job name is PythonJob (-J PythonJob).

The first -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node – if you are not using GPU resources, it’s good cluster citizenship to avoid those nodes.

The last -R option requests an amount of RAM in GB your job will use. It’s good manners to do this on shared systems.

More information on bsub options are here.

To submit the job, if you have saved the above example as the file “cpujob”:

[seasuser@ssh lsf]$ bsub < cpujob
Job <123> is submitted to queue default

shows the job is submitted successfully.

Requesting Multiple CPUs

You can request that a job be assigned more than one CPU. It is imperative that the application you are running, or the code you are executing, respect that setting. To request multiple CPUS, add

#BSUB -n 4
#BSUB -R "span[hosts=1]"

to your script to request 4 CPUs as an example. Most hosts in the ENGR cluster have at least 16 cores, some up to 36 - the second line tells LSF to make sure all the CPUs requested are on the same host. To use CPUs on multiple hosts, your application/program must support MPI, addressed in another article.

Your application or code must be told to use the number of CPUs asked for. LSF sets an environment variable called LSB_DJOB_NUMPROC that matches the number of CPUs you requested. The method for doing this varies widely.

R, for example, can have this in its scripts:

cores <- as.numeric(Sys.getenv('LSB_DJOB_NUMPROC'))

to read that variable as the number of cores it can use. In Python, os.environ[‘LSB_DJOB_NUMPROC’] could be passed to the proper variable or function depending on the code in use.

Avoiding GPUs

If you are not using GPUs, it is good manners to avoid those nodes. You can do that like so:

#BSUB -R '(!gpu)'


Requesting Memory

You should prerequest the amount of RAM (in GB) your job expects to need – it’s good manners on shared systems. Do this with the pragma:

#BSUB -R "rusage[mem=25]"

This example requests 25GB of system memory. Your job won’t start until there’s a node capable of giving you that much RAM.

Requesting GPUs

The SEAS Compute cluster has support for GPU computing. Requesting a GPU requires you to choose the queue you wish to submit to – many GPUs are within faculty-owned devices and have limited availability.

A simple example of a GPU job script file looks like so:

#BSUB -o mygpucode_out.%J
#BSUB -R "select[type==any]"
#BSUB -gpu "num=1:mode=exclusive_process:gmodel=TeslaK40c"
#BSUB -N
#BSUB -J PythonGPUJob
python mygpucode.py

This file submits a Python script to be anassociated to the research group “SEAS-Lab-Group” with the output going to a file called mygpucode_out.XXX (-o mygpucode_out.%J, where %J becomes the job number). When the job is done, I’m mailed a notification (-N), and when I check status with bjobs, the job name is PythonGPUJob (-J PythonGPUJob). I’ve further requested a single Tesla K40c to run the job against (-gpu “num=1:mode=exclusive_process:gmodel=TeslaK40c”) that starts in exclusive process mode – so only that process can access the GPU.

In the open access cluster nodes, the available GPU types are:


The full name of a card must be used, and capitalization matters.

The complete list of pragmas available for requesting GPU resources is here: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/bsub.gpu.1.html

Requesting Specific Hosts or Groups of Hosts

If you have a need to run on a specific host, you can request one or more candidates by adding this to your submission file:

#BSUB -m "host1 host2"

where “host1 host2” are the hostnames of machines you’d want to allow your job to run on – you can specify one or more hosts, and LSF will choose either what’s available (in the case of multiple hosts) or wait for a specific request to become available.

Alternatively, you can choose a group of hosts (you can see existing groups with the bmgroups command):

#BSUB -m SEAS-CPUALL


The above pragma would pick any of the CPU only open boxes in the cluster.

Be very careful with this pragma - you can cause your jobs to run poorly or not execute as soon as they could if you pile your jobs on specific hosts, rather than choosing resources generically.


Deleting (Killing) Jobs

If you need to stop a job that is either running or pending, use the command

bkill jobnum

where “jobnum” is the number of the job to kill, as shown from the bjobs command.



Job Arrays

If you have a job that would benefit from running thousands of times, with input/output files changing against a job index number, you can start a job array.

Job arrays will run the same program multiple times, and a environment variable called $LSF_JOBINDEX will increment to provide a unique number to each job. An array job submission might look like so:

#BSUB -n 3
#BSUB -R "span[hosts=1]"
#BSUB -J "array[1-9999]" thisjobname

python thiswork.py -i $LSB_JOBINDEX

This job script requested 3 CPUs on a single host for each array task, and started 9,999 tasks, with the input file being the number of the task - so the first task job expected an input file named “1”.

Basic MPI Jobs

NOTE: MPICH is using the regular networking interface of any given node to transfer MPI communications. You should only use this in cases where your program does not rely on heavy MPI traffic to accomplish its work - minor status coordination of embarassingly parallel jobs is OK, large-scale attempts at memory sharing is not.

bsub submits job script files.

A simple example of a MPICH job script file looks like so:

#BSUB -G SEAS-Lab-Group
#BSUB -o mpi_test.%J
#BSUB -R '(!gpu)'
#BSUB -N
#BSUB -J MPIjob
#BSUB -a mpich
#BSUB -n 30

module add mpi/mpich-3.2-x86_64
mpiexec -np 30 /path/to/mpihello


The lines beginning with “#” are often referred to as pragmas - part of a script that defines variables.

This file submits a MPI “Hello World” program (available under /project/compute/bin) to the research group “SEAS-Lab-Group”. The output goes to a file “mpi_test.%J”, where %J becomes the job number.

When the job is done, I’m mailed a notification (-N), and when I check status with bjobs, the job name is MPIJob (-J).

Additionally, I`ve told the scheduler I want the “mpich” resource (-a mpich) and I want to use 30 CPUs (-n 30).

The -G option is not necessary if your research group does not have any lab-owned priority hardware, and the group name “SEAS-Lab-Group” is a nonexistent placeholder.

The -R option selects a machine type. At this time there are no specific types to request, however, this example requests that the scheduler avoid a GPU node – if you are not using GPU resources, it’s good cluster citizenship to avoid those nodes.

More information on bsub options are here.

To submit the job, if you have saved the above example as the file “mpijob”:

[seasuser@ssh lsf]$ bsub < mpijob
Job <123> is submitted to queue default


Avoiding GPUs

If you are not using GPUs, it’s good manners to avoid those nodes. You can do that like so:

#BSUB -R '(!gpu)'


Requesting Memory

You should prerequest the amount of RAM (in GB) your job expects to need – it’s good manners on shared systems. Do this with the pragma:

#BSUB -R "rusage[mem=25]"


This example requests 25GB of system memory. Your job won’t start until there’s a node capable of giving you that much RAM.

Deleting (Killing) Jobs

If you need to stop a job that is either running or pending, use the command

bkill jobnum

where “jobnum” is the number of the job to kill, as shown from the bjobs command.



Singularity/Docker Containers

Singularity is a type of container management that runs daemonless, executing much more like a user application. It can run/convert existing Docker containers, pull containers from Singularity, Docker, or personal container repositories.

Unlike Docker, it stores its container format in a single file, appended with .sif. Also unlike Docker, the default pull action for Singularity pulls the container to the current working directory, and not a central store location.

Documentation on using Singularity is here.

In these basic instructions, we will use Singularity as a user application within LSF - starting jobs as normal, but executing Singularity and a container as the main content of the job.

Interactive Singularity Jobs

Start your interactive job as you normally would, requesting specific resources (like GPUs) as normal. Once in your session, you can pull a new container or start a container that already exists locally.

To pull a container, make sure you are in a directory that has enough space, or your shared project directory, and do:

singularity pull alpine

or

singularity pull docker://alpine

will bring down the default Alpine (a small basic Linux environment) from either the Singularity or Docker container repository, respectively. After that:

singularity exec alpine /bin/sh

enters into the container. Alternatively:

singularity run alpine

will run the defined runscript for the container (do note Alpine does not have a run script - it will simply exit, having no task to complete).

Accessing File Resources Inside Containers

Singularity will, by default, mount your home directory, /tmp, /var/tmp, and $PWD. You can bind additional locations if needed, either to the original location or to specific locations as required by your container.

If your container expects to see your data, stored on the system as /project/group/projdata, under /data, you can do:

singularity run --bind /project/group/projdata:/data mycontainer

A quick way to recreate the general directories you would expect to see from /project or /storage1 on the compute node would be:

singularity run --bind /opt,/project,/home,/storage1 mycontainer

A single directory listed as part of a –bind option simply binds the directory in-place.

GPUs and Singularity

Singularity fully supports GPU containers. To run a GPU container, add the –nv flag to your command line:

singularity run --bind /project,/opt,/home,/storage1 --nv tensorflow_latest-gpu.sif

The python script indicated there would run as you would expect, just within the Tensorflow container.

Singularity Background Instances

Some containers are meant to provide services to other processes, such as the CARLA Simulator. Singularity containers can run as instances, meaning they will run as background processes. To run something as an instance:

singularity instance start --nv CARLA_latest.sif user_CARLA

8user_CARLA* here is the friendly name of the instance. To see running instances:

user@host> singularity instance list
INSTANCE NAME    PID    IP    IMAGE
user_CARLA       5222         /tmp/CARLA_latest.sif

To connect to a running instance, you can use either the run or shell subcommands:

singularity run instance://user_CARLA

Singularity in Batch Jobs

Using Singularity in a batch job is just as easy as running any other batch job. Craft your job script as you normally would, substiting a Singularity command to run the container rather than an on-host executable file.


Special Queues - CPU and GPU

Special CPU Queues

There are three special CPU queues in the cluster meant to service batch jobs with our best systems.

Queue Time Limit Interactive?    
cpu-compute 7 days n    
cpu-compute-long 21 days n    
cpu-compute-debug 4 hours y    

Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.

You may submit to these queues with the “-q” option, indicating the queue name after the flag.

Scratch Space

You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.

The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.

Special GPU Queues

There are three special GPU queues in the cluster meant to service batch jobs with our best GPUs, which at this time includes nVIDIA A40s and A100s.

Queue Time Limit Interactive?    
gpu-compute 7 days n    
gpu-compute-long 21 days n    
gpu-compute-debug 4 hours y    

Jobs will be killed at the specified time limits. Only the Debug queue allows interactive jobs under a short time limit to maximize availability of these limited resources.

You may submit to these queues with the “-q” option, indicating the queue name after the flag. The request string for each GPU model:

GPU LSF Device Name # Devices
nVIDIA A100 80GB NVIDIAA10080GBPCIe 8 (see MIG section below)
nVIDIA A40 48GB NVIDIAA40 12

Scratch Space

You may utilize the path /scratch on all of these machines to access a large slice of NVME scratch space. Data is cleaned up automatically after 10 days, and you are responsible for moving data from /scratch to a permanent storage location.

The directory /scratch/long is cleaned after 24 days, for jobs using the cpu-compute-long queue.

Multi-Instance A100 GPUs

Two of the 8 A100s are currently configured for Multi-Instance. This allows users to request a logical subset of a GPU, leaving the remainder available for other users. This is highly recommended for interactive debug sessions, and would yield a maximum of 14 10GB GPUs for use interactively.

Each GPU can be subdivided up to 7 ways, each recieving a certain portion of the RAM (listed in GB) and compute capacity (listed in sevenths of the whole) of each card.

MIG Option MGPU Size               Maximum MIG GPUs
mig=1 10GB/1C 10GB/1C 10GB/1C 10GB/1C 10GB/1C 10GB/1C 10GB/1C x 7 GPUs with 10GB RAM and 1 compute slice
mig=2 20GB/2C <— 20GB/2C <— 20GB/2C <— x x 3 GPUs with 20GB RAM and 2 compute slices
mig=3 40GB/3C <—- <—- <—- 40GB/3C <— <— <— 2 GPUs with 40GB RAM and 3 compute slices
mig=4 40GB/4C <—- <—- <—- <—- x x x 1 GPU with 40GB of RAM and 4 compute slices
mig=7 80GB/7C <—- <—- <—- <—- <—- <—- <—- 1 GPU with 80GB of RAM and 7 compute slices

The above table shows the effects of requesting MIG CPUs. In the event that multiple MIG sizes are requested, the availability of a requested size is dependent on what has already been subdivided. The rule of what’s possible follows a requirement of no overlapping vertical coverage on the above table, starting from the left hand side.

For example, if a job requests a “mig=3” GPU, subsequent jobs could either request (1) “mig=2”, or (3) “mig=1” jobs. The “mig=4” GPU is a special case, taking 4/7ths of the compute against half the RAM, limiting other users to either (1) “mig=2” GPUs or (2) “mig=1” GPUs.

“mig=3” and “mig=4” GPUs are granted 2 NVDEC units; “mig=2” GPUs are granted one; “mig=1” GPUs do not recieve a NVDEC unit.

Requesting a MIG would look like so, requesting a mig=1 GPU:

#BSUB -gpu “num=1:gmodel=NVIDIAA10080GBPCIe:mig=1”

Engineering Applications

ANSYS Fluent 21

ANSYS 21 is available as ‘ansys-workbench-v211’ and ‘fluent211’ in the default path within a VNC job.

Fluent 21 MPI Jobs (Batch)

You will want to have created an SSH key, and run /project/compute/bin/update_hostkeys.sh to pre-accept all cluster host keys.

A sample batch job for running Fluent is below, geared towards running an Ethernet-based MPI job on the cpu-compute queue. It expects to run from a current working directory of “/storage1/piname/Active/project/”, which should be modified everywhere in the script for your specific requirements.

#BSUB -R '(!gpu)'
#BSUB -n 16
#BSUB -o /storage1/piname/Active/project/ansys.out
#BSUB -J fluentJob
#BSUB -R "rusage[mem=10]"
#BSUB -R "span[ptile=8]"
#BSUB -q cpu-compute

export LSF_ENABLED=1
cd $LS_SUBCWD
FL_SCHEDULER_HOST_FILE=lsf.${LSB_JOBID}.hosts
/bin/rm -rf ${FL_SCHEDULER_HOST_FILE}
if [ -n "$LSB_MCPU_HOSTS" ]; then
    HOST=""
    COUNT=0
    for i in $LSB_MCPU_HOSTS
    do
      if [ -z "$HOST" ]; then
         HOST="$i"
      else
         echo "$HOST:$i" >> $FL_SCHEDULER_HOST_FILE
         COUNT=`expr $COUNT + $i`
         HOST=""
      fi
    done
fi

/project/research/ansys21/fluent21 2ddp -g -t16 -scheduler_tight_coupling -peth.ofed -i/storage1/piname/Active/project/testJournal.jou -pcheck -setenv=FLUENT_ARCH=lnamd64 -alnamd64 -env -setenv=LD_LIBRARY_PATH=/project/research/ansys21/gcc/lib2:/project/research/ansys21/gcc/lib2:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/lib -setenv=FLUENT_ARCH=lnamd64 -cnf=${FL_SCHEDULER_HOST_FILE}

rm lsf.${LSB_JOBID}.hosts

The various BSUB parameters are explained earlier in this documentation.

The first part of this script aids in capturing the hosts the job is destined to run on, and is required.

The second covers starting Fluent in 2ddp mode; indicating the number of tasks, scheduler options, Ethernet MPI, input file, parallel check function, and then setting various environmental variables and architecture settings.

The last line cleans up the MPI hosts file.

Fluent 21 MPI Jobs (GUI)

You will want to have created an SSH key, and run /project/compute/bin/update_hostkeys.sh to pre-accept all cluster host keys.

Start a VNC job as normal. Once there, you can reserve original Infiniband nodes with:

/project/research/ansys/reserve_fluent.sh X

where X is the number of cores you wish to reserve.

You may reserve nodes within the CPU Compute queue (Ethernet) with:

/project/research/ansys/reserve_fluent_cpu.sh X
/project/research/ansys/reserve_fluent_cpu_long.sh X

for either the 7 day or long 21 day queues. Keepin mind this reservation job will end independently of any other Ansys application utilizing it as a target for jobs.

The script will output information you need to continue:

Starting a 4 node Fluent Reservation
Reserving a 4 IB CPU Fluent Job
...starting job:

Your job is:
1234567 seasuser   PEND  ib1        ssh.engr.wustl.edu    -        Fluent-MPI-Waiter Nov  1 11:11

Please look for a file in the root of your home directory :
fluenthosts.1234567
(where 1234567 is the job number you are given above)
and pass that to Fluent as:
fluent -cnf=/home/research/username/fluenthosts.1234567
  (where username is your own WUSTL Key)

You must remember to kill this job when you are done with Fluent!
Use the command:
bkill 1234567
The nodes you reserve will not be released until you do.

Note the job number and the “fluenthosts.1234567” filename - your number will be different for your own started job.

Run Fluent in the VNC session as this example, with the above information, modifying for your own needs:

/project/research/ansys21/fluent21 3ddp -tX -cnf=/home/research/username/fluenthosts.1234567 -pib.ofed -gui_machine=$HOSTNAME -i test.jou

For the CPU Compute queues via Ethernet, change

-pib.ofed

to

-peth.ofed