If you are working in academia on projects which involve big data, it is likely that you will, sooner or later, make use of a high-performance computing (HPC) cluster.
Typically, these are managed with the SLURM workload manager, which provides a framework for job scheduling and resource allocation. In essence, it coordinates “supply and demand” in the cluster, in an efficient way.
As working on a HPC is not something that I do everyday,
here are a few tips that I keep re-discovering:
First, start with a small sample on your local machine. Hash out all the bugs, make sure your script is clean, simple, and it works.
Create a template job script that includes the parameters you need to edit, as well as an outline of how to execute your computer code.
For example, you want to load modules, activate environments, and print out the date-time when you start the execution of the script.
It could look something like this:
#!/bin/bash
#SBATCH --job-name=my_r_job
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=02:00:00
# Load required modules
module load R
# Run R script
Rscript my_script.R
Set up shortcuts for connecting to the cluster, downloading files, uploading files, and, most importantly, updating your code.
When you have these shortcuts, you don’t even need to connect your IDE to the cluster (albeit a great idea), but you can simply edit your code locally, save it, and execute the shortcut to update your code. Trust me, you will use this over and over again.
Filter the system messages of the cluster into a separate folder in your email inbox. You will get a lot of them.
scp
is your friend, as it helps you to upload and download files and folder easily.
There might only be different versions of your programming language available on the cluster. Prepare yourself for some issues with package compatibilities.
There will be tiny issues: For example, I noticed that, when calling R
from the terminal, it matters whether you have an .r
or .R
file.
Consider removing intermediary data objects to save RAM.
Streamline your code.
Consider creating a logger, such that you can monitor the progress better. You will need it. Also, print out the resources used by the system at intermediate steps in the program. This will be very helpful with debugging.
Submit the script using the sbatch
command to add your job to the queue and allocate necessary resources. Monitor the job status with squeue
to see the queue.
By following this workflow, you can efficiently submit and monitor your jobs, retrieve output, and identify and resolve any issues that may arise.