I’m running HADDOCK2.4 locally, using qsub. The task is to dock two proteins: 800aa and 300aa to find some suggestions about their interaction. I use coarse-grained representation, 50000 structures in it0, 5000 in it1 and 5000 in it2.
The point computations became inefficient is (fragment of haddock output):
BEGIN: Mon Nov 15 08:36:18 2021
Read 5000x5000 distance matrix in 14 seconds
Writing 354 Clusters
Coverage 67.34% (3367/5000)
END: Mon Nov 15 08:36:33 2021 [14.86 seconds]
Clustering in /home/project/cg_prim_core_50k/run1/structures/it1/analysis DONE
Check file /home/project/cg_prim_core_50k/run1/structures/it1/analysis /cluster.out
waiting for the ene-residue file in it1/analysis…
It runs on a single CPU (but might be easily made parallel as it just writes energies between pairs of residues to a single file - the pipeline might merge several files afterwards) - does parallelization of this process makes sense?
Also, does analyzing 5000 structures in it1 makes sense? Or just should I make 100’000 in it0 and then analyze 1000 in it1? I really wanted to sample the space of possibilities as I have no clues about the interaction, and the result from webserver with a limited number of structures was inconclusive.
Inefficient in that ene-residue job, but not in terms of the development but a rather overall approach - HADDOCK runs impressively fast (I have one 128thread CPU).
Maybe I should decrease it1 to 1000 since it would have a considerable dataset generated in it0 to choose from anyway.
I skip water as I’m using coarse-grained
{===>} firstwater=“no”; ← when this option was on by default the procedure crashed in it1
{===>} waterdock=false; ← wasn’t sure if I want that so disabled also.
Inefficient in that ene-residue job, but not in terms of the development but a rather overall approach - HADDOCK runs impressively fast (I have one 128thread CPU).
I would simply skip that part, unless you need to extract those metrics. For sure for ab-initio docking it is very inefficient as you are basically calculating it for the entire surface.
Setting the analysis to clustering only will be much more efficient.
I skip water as I’m using coarse-grained
{===>} firstwater=“no”; ← when this option was on by default the procedure crashed in it1
Would be strange that it crashes at it1 since it is only performed after it1…
{===>} waterdock=false; ← wasn’t sure if I want that so disabled also.
You don’t want it - trust the default settings… And not possible with CG anyway