Haddock KeepAlive.py

Good morning -

I have a question regarding SBATCH submissions. I get the submission to run but the .out files are gzipped up prior to completion which I think is leading to the pdb files not being created. Basically, the submission just sits there with the .out files gzipped and not continuing to the next series of .out files.

I have changed the line in KeepAlive.py thinking that that would pause the the gzipping of the file:
from: time.sleep(15)
to: time.sleep(30)
I even tried: time.sleep(130)

However the .out files are always gzipped at the same point. I tested this by copying .out.gz files prior to changing the “time.sleep” to another location for comparison.

Here is where it stops (I don’t think that is relevant):
NBONDS: found 1947 intra-atom interactions
NBONDS: found 2015 intra-atom interactions
NBONDS: found 2030 intra-atom interactions
NBONDS: found 2016 intra-atom interactions
NBONDS: found 2385 intra-atom interactions
NBONDS: found 2203 intra-atom interactions

--------------- cycle= 50 --------------------------------------------------
| Etotal =83442.792 grad(E)=15.480 E(VDW )=369.692 E(ELEC)=0.079 |
| E(NOE )=83073.021 |

Basically what is the trigger for the movement from line 11 to line 13 of a .job file?

Any help would be greatly appreciated.

Ben

Hi Ben,

Are those lines the very end of your *.out scripts? Also, at which stage is this occurring? Does your queueing system have some sort of error logging, e.g. job.err files? If so, what’s their content?

As a side-note, your NOE violation energy is extremely high. Ι’d check your restraints files and make sure to either use less of them, or check if you have conflicting sets of restraints.

It could also be you have been kicked out of the queue because your job exceeded the CPU time…

Gents -

Thanks for your reply.

Yes, that is the very end of the .out files written by Haddock. This is occurring at stage it0 with the very first 16 files (I set the cpus to 16). It will not proceed to the next 16 files of the stage it0. It just sits there.

The queuing system does have logging files but nothing in the error file (haddock.err) and the log.out file (haddock.out) has :

------------------------------------------------------------
Structure 1: running
Structure 2: running
Structure 3: running
Structure 4: running
Structure 5: running
Structure 6: running
Structure 7: running
Structure 8: running
Structure 9: running
Structure 10: running
Structure 11: running
Structure 12: running
Structure 13: running
Structure 14: running
Structure 15: running
Structure 16: running

… (and that is the end of the log.out files as well).

From the .job files it is almost like the “kill -9” command is issued before the finish of the “protocols/cns1” line.

I thought it could be the CPU time, but I did not set anything in my script for that also the “TIMELIMIT” is “infinite” (see below). Nor does the queue actually stop. Meaning that even after the .out files are gzipped the queue still shows the job running on the node with no new files created (I let the queue remain for 1 hour and no new files created).

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 17 alloc node[1-17]

Here is my sbatch script:
#!/bin/bash
#SBATCH -J test #jobname
#SBATCH -o /home2/bgbobay/bird/receptome/haddock/CLE_test/ #set the output directory
#SBATCH -D /home2/bgbobay/bird/receptome/haddock/CLE_test/ #set the working directory
#SBATCH --mail-type=end #email me when it ends
#SBATCH --mail-type=begin #email me when it ends
#SBATCH --mail-user=ben.bobay@duke.edu #email address
#SBATCH --error=haddock.err #specify the error file
#SBATCH --output=haddock.out #specify the output file

/usr/bin/python -O /home2/bgbobay/programs/haddock/haddock2.1/Haddock/RunHaddock.py

Ben

Would you jobs complete if running say simply using csh as queue command?
Just as a test and thus not using the batch system

It does. It runs fine just fine on the headnode with csh.

I noticed that the MHaddock.py script is different from my local installation to that of the one on the cluster.

The cluster has extra lines in it when making the .job files:
setenv CURRIT %s
setenv RUN ./
setenv NEWIT $RUN/%s
setenv PREVIT $RUN/%s
setenv TEMPTRASH $RUN
python protocols/KeepAlive.py %s &
setenv TMPDIR /tmp/$$
mkdir $TMPDIR
echo STARTED >>%s
%s < %s >! $TMPDIR/%s
kill -9 %%1 >&/dev/null
gzip -f $TMPDIR/%s
python protocols/RemoveBadPDB.py %s
rm -f %s
mv -f $TMPDIR/%s.gz %s.gz
rm -rf $TMPDIR

vs

setenv CURRIT %s
setenv RUN ./
setenv NEWIT $RUN/%s
setenv PREVIT $RUN/%s
setenv TEMPTRASH $RUN
python protocols/KeepAlive.py %s &
%s < %s >! %s
python protocols/RemoveBadPDB.py %s
kill -9 %%1 >&/dev/null
gzip -f %s

Maybe a fresh install of Haddock and moving to 2.2?

Ben

Indeed, better upgrade to 2.2 - and see if it works - no sense spending time solving issues for the old version

PS: On the cluster, we write the out file to the /tmp dir to minimise communication and only zip and move them to the proper location once the job has completed. Should check that /tmp has enough space for this to work properly