How to use haddock in SLURM system

Dear all,

I want to run haddock2.2 in high performance cluster with The Simple Linux Utility for Resource Management (SLURM) as scheduling system. From the introduction of the cluster, I need create SLURM script as following:

#!/bin/bash

NOTE: Lines starting with “#SBATCH” are valid SLURM commands or statements,

while those starting with “#” and “##SBATCH” are comments. Uncomment

“##SBATCH” line means to remove one # and start with #SBATCH to be a

SLURM command or statement.

#SBATCH -J slurm_job #Slurm job name

Set the maximum runtime, uncomment if you need it

##SBATCH -t 48:00:00 #Maximum runtime of 48 hours

Enable email notificaitons when job begins and ends, uncomment if you need it

##SBATCH --mail-user=user_name@ust.hk #Update your email address
##SBATCH --mail-type=begin
##SBATCH --mail-type=end

Choose partition (queue), for example, partition “standard”

#SBATCH -p standard

Use 2 nodes and 48 cores

#SBATCH -N 2 -n 48

Setup runtime environment if necessary

For example, setup MPI environment

source /usr/local/setup/pgicdk-15.10.sh

or you can source ~/.bashrc or ~/.bash_profile

Go to the job submission directory and run your application

cd $HOME/apps/slurm
mpirun ./your_mpi_application

Now I am confused by the “queue command” in haddock and the “MPI”. If I just create and run the SLURM script with “haddock2.2” command and use “csh” as queue command in “run.cns”, will the SLURM system give me all of the nodes and cpu cores which I request in “run.cns”?

Or I need use SLURM script as queue command in run.cns?

Best wishes

Hi there,

The queue setting in run.cns is used to launch the jobs created by HADDOCK. In short, you should replace csh by srun or sbatch, depending on your SLURM configuration, and with the appropriate flags.

Thanks for your reply.

In fact, I am confused by two ways:

The first way is just type sbatch in the shell of cluster. At the same time, is the “queue command” in run.cns “srun” or “csh”?

The second way is just type “haddock2.2” in the shell of cluster, and the “queue command” in run.cns is "sbatch ".

Best wishes,

First way would indeed be then csh - meaning your are sending the entire haddock process to the batch system - BUT, probably only works fine provide you are using only one node (my guess - never tested slurm)

As Alex said, you can request ONE node interactively via srun/sdev and then run HADDOCK with csh and the ncpu setting to the number of cores you requested. This is probably not efficient, but it’s simple.

To make full use of SLURM you’d have to edit the Queue.py file in Haddock/Main/ and add the proper SLURM headers and then put sbatch in the queue command. More complicated, but more efficient and practical in the long term.

Actually in the haddock package under Haddock/Main you will find the following file: QueueSubmit_slurm.py

Check and adapt its content, and to enable it, link QueueSubmit.py to QueueSubmit_slurm.py

But no warranty provided that it will work properly (again only tested once and not recently)

Hi all
back after ages to a great software :slight_smile:
I have successfully installed it to our new cluster (cns compilation was beautifully painless), which has SLURM as queue system and 20 nodes with 32 cores each node.
I did
set QUEUESUB=QueueSubmit_slurm.py
and I edit Haddock/Main/MHaddock.py to
#values for running via a batch system
jobconcat[“0”] = 5
jobconcat[“1”] = 2
jobconcat[“2”] = 2

Then I edit run.cns as follow:

{===>} queue_1=“sbatch”;
{===>} cns_exe_1="/SOFTW/DOCKING/HADDOCK/cns_solve_1.3/intel-x86_64bit-linux/bin/cns";
{===>} cpunumber_1=999;

as suggested in /Haddock/Main/QueueSubmit_slurm.py:

HADDOCK2.4 runs fine and I have 10 jobs R and 192 PD (queue).
Now before the sysadmin complain about this long queue, is there any chance to submit a single job as wrapper similar to HADDOCK2.4 manual - Frequently Asked Questions – Bonvin Lab?
I tried to edit ssub but it did not work.
I am trying examples/protein-protein in HAddock dir.

Any help?
Thanks
Andrea

actually I can increase jobconcat[“0”] = 20 to reduce the number of jobs.
but a single wrapper script would better

What you want then is to send the entire haddock process to the node - which means that one docking run will use exclusively one node.

And in run.cns you would use e.g. csh as queue command with the number of jobs set to 32 to use all the cores of the node.

We also have written a simply pilot mechanism to execute a large number of docking runs in a HPC system, running one docking run per node.

Check: https://github.com/haddocking/haddock-pilot

Hi Alexandre
so you are suggesting to put
{===>} queue_1=“csh”;
{===>} cns_exe_1="/SOFTW/DOCKING/HADDOCK/cns_solve_1.3/intel-x86_64bit-linux/bin/cns";
{===>} cpunumber_1=32;

and then create a wrap SLURM script in order to submit the whole job in a single node, i.e.

#SBATCH --job-name=HADDOCK
#SBATCH --partition=workq
#SBATCH --account spitaleri.andrea
#SBATCH --mem=60GB # amout of RAM in MB required (and max ram available). Xor “–mem-per-cpu”
#SBATCH --time=INFINITE ## OR #SBATCH --time=10:00 means 10 minutes OR --time=01:00:00 means 1 hour
#SBATCH --cpus-per-task=32
#SBATCH --nodes=1 # not really useful for not mpi jobs
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --error=“err”
#SBATCH --output=“out”

haddock2.4 2>&1 log

If so, do I need to re-configure Haddock using

set QUEUESUB=QueueSubmit_concat.py

and keep

jobconcat[“0”] = 5
jobconcat[“1”] = 2
jobconcat[“2”] = 2

Actually the second option, to increase jobconcat[“0”] = 20 should reduce the number of jobs in the queue.

Thanks

so you are suggesting to put
{===>} queue_1=“csh”;
{===>} cns_exe_1=“/SOFTW/DOCKING/HADDOCK/cns_solve_1.3/intel-x86_64bit-linux/bin/cns”;
{===>} cpunumber_1=32;

Yes

and then create a wrap SLURM script in order to submit the whole job in a single node, i.e.

#SBATCH --job-name=HADDOCK
#SBATCH --partition=workq
#SBATCH --account spitaleri.andrea
#SBATCH --mem=60GB # amout of RAM in MB required (and max ram available). Xor “–mem-per-cpu”
#SBATCH --time=INFINITE ## OR #SBATCH --time=10:00 means 10 minutes OR --time=01:00:00 means 1 hour
#SBATCH --cpus-per-task=32
#SBATCH --nodes=1 # not really useful for not mpi jobs
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --error=“err”
#SBATCH --output=“out”

haddock2.4 2>&1 log1

If so, do I need to re-configure Haddock using

set QUEUESUB=QueueSubmit_concat.py

and keep

jobconcat[“0”] = 5
jobconcat[“1”] = 2
jobconcat[“2”] = 2

No I would put all values to 1 in that case.
You will only have one job in the queue.

Also you could consider creating a tmp dir under /tmp on the node (if space allows), move the full run there and run locally (path should then be relative to ./) in the run.cns file, then move the completed run back to your home dir. In that way you would minimise network traffic. This is kind of what the pilot mechanism is doing.

Note that when having HADDOCK sends jobs to the queue using slurm we use the QueueSubmit_concat.py mechanism - if you set the cpunumber to say 50 you won’t fill the queue that much. And you run will still proceed quite fast.

Hi @amjjbonvin,
I am able to run my docking on a single node with 80 cpus, using:

{===>} queue_1="/bin/csh";
{===>} cns_exe_1="~/software/bin/cns_solve-2202230644.exe";
{===>} cpunumber_1=80;

Now, I am trying to use 2 nodes using Slurm:
I have looked into the QueueSubmit_slurm.py file and linked it to the QueueSubmit.py, as suggested in earlier postings. the new run.cns was modified to contain:

{===>} queue_1="8: /opt/slurm/bin/srun";
{===>} cns_exe_1="/home/p/pmkim/mkhatami/software/bin/cns_solve-2202230644.exe";
{===>} cpunumber_1=999999;

however, it makes all structures start running simultaneously on a single node! and my second node is free!
I tried editing queue_2, too:

{===>} queue_1="8: /opt/slurm/bin/srun";
{===>} cns_exe_1="/home/p/pmkim/mkhatami/software/bin/cns_solve-2202230644.exe";
{===>} cpunumber_1= 999999;
{===>} queue_2="8: /opt/slurm/bin/srun";
{===>} cns_exe_2="/home/p/pmkim/mkhatami/software/bin/cns_solve-2202230644.exe";
{===>} cpunumber_2= 999999;

but still, all jobs go to a single node, leaving the other one free.

here is my submission script, in case:

#!/bin/bash
#SBATCH --job-name=haddock-6pro
#SBATCH --output=%x-%j.out
#SBATCH --nodes=2           # number of nodes
#SBATCH --ntasks-per-node=4     # request 4 MPI tasks per node
#SBATCH --cpus-per-task=20        # => total: 20 x 4 = 80 CPUs/node
#SBATCH --mem=0                  # request all available memory on the node
#SBATCH --time=0-00:30           # time limit (D-HH:MM)
# SBATCH --constraint=[dragonfly5|dragonfly4] # restrict to AVX512 capable nodes.

module purge --force
module load CCEnv
module load arch/avx512   # switch architecture for up to 30% speedup 
module load StdEnv/2020 gcc/9.3.0  openmpi/4.0.3

module load python/2.7
HADDOCK="~/software/haddock2.4-2021-05"
HADDOCKTOOLS="$HADDOCK/tools"
PYTHONPATH="/usr/bin/python2.7:$HADDOCK"
alias haddock2.4="/usr/bin/python2.7 $HADDOCK/Haddock/RunHaddock.py"
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"


python2.7 $HADDOCK/Haddock/RunHaddock.py

Don’t use the QueueSubmit_slurm.py file! We should actually remove it. Old stuff. Use the concat script.

The way we run it on our cluster using slurm is to have haddock started on the master node and submitting jobs to slurm via the queue_1 command defined in run.cns.
This could be simply sbatch , possibly with some additional options.

I.e. we are not submitting the entire HADDOCK process to the queue.
This can be done as well, but then it can only run within the node it has been assigned (no MPI support).
And in that case the queue command in run.cns would simply be /bin/csh and setting the cpunumber to the number of core available/requested for the node.

For our system, we have a wrapper script that creates and submit a job file. This is what we are defining under queue_1 and setting the cpunumber to the number of concurrent jobs you want to have in the queue. We have different queue lengths that are reflected in the wrapper scripts. Within run.cns we thus defined queue_1 = submit_slurm short for targeting the short queue.

#!/bin/csh -f

# syntax highlighting : alt-x global-font-lock-mode
if ($# < 2) then
  echo "Usage : submit-slurm queue jobname"
  echo "queue : valid queue destination are : short | medium | long | haddock | verylong"
  exit 1
endif

# check queue
set queue=$1
if ($queue != "short" && $queue != "medium" && $queue != "long" && $queue != "verylong" && $queue != "minoes" && $queue != "haddock" ) then
  echo "Wrong queue destination"
  exit 1
endif

# define time limit based on queue
if ($queue == short ) then
  set timelimit=240
endif
if ($queue == medium ) then
  set timelimit=720
endif
if ($queue == long ) then
  set timelimit=1440
endif
if ($queue == verylong ) then
  set timelimit=7200
endif
if ($queue == haddock ) then
  set timelimit=720
endif

# check if job exists + make it executable
set jobname=$2
if (! -e $2) then
  echo "job file does not exist"
  exit 1
endif
if (! -x $jobname) chmod +x $jobname

# write slurm script
set slurmjob=$jobname.slurmjob.$$
if (-e $slurmjob) then
  \rm $slurmjob
endif
touch $slurmjob
set PWD=`pwd`

if ($jobname =~ *rmsd*.job) then
  set queue=haddock
endif
if ($jobname =~ *ene-residue*.job) then
  set queue=verylong
endif

echo "#!"$SHELL >> $slurmjob
set outfile=$PWD/$jobname.out.$$ >> $slurmjob
echo "#SBATCH --output="$outfile >> $slurmjob
set errorfile=$PWD/$jobname.err.$$ >> $slurmjob
echo "#SBATCH --partition=$queue" >> $slurmjob
echo "#SBATCH --time="$timelimit >> $slurmjob
echo "#SBATCH --cpus-per-task=1" >> $slurmjob
echo "#SBATCH --threads-per-core=1" >> $slurmjob
echo "cd "$PWD >> $slurmjob
echo "./"$jobname >>$slurmjob

chmod +x $slurmjob
sbatch $slurmjob
set success=$?
\rm $slurmjob
exit $success

Thank you @amjjbonvin.
I will try to adapt it to our cluster.