Ene-residue - takes long

Dear HADDOCK community,

I’m running HADDOCK2.4 locally, using qsub. The task is to dock two proteins: 800aa and 300aa to find some suggestions about their interaction. I use coarse-grained representation, 50000 structures in it0, 5000 in it1 and 5000 in it2.

The point computations became inefficient is (fragment of haddock output):

  • BEGIN: Mon Nov 15 08:36:18 2021
  • Read 5000x5000 distance matrix in 14 seconds
  • Writing 354 Clusters
  • Coverage 67.34% (3367/5000)
  • END: Mon Nov 15 08:36:33 2021 [14.86 seconds]
    Clustering in /home/project/cg_prim_core_50k/run1/structures/it1/analysis DONE
    Check file /home/project/cg_prim_core_50k/run1/structures/it1/analysis /cluster.out
    waiting for the ene-residue file in it1/analysis…

It runs on a single CPU (but might be easily made parallel as it just writes energies between pairs of residues to a single file - the pipeline might merge several files afterwards) - does parallelization of this process makes sense?

Also, does analyzing 5000 structures in it1 makes sense? Or just should I make 100’000 in it0 and then analyze 1000 in it1? I really wanted to sample the space of possibilities as I have no clues about the interaction, and the result from webserver with a limited number of structures was inconclusive.

I’d be grateful for any suggestions.

Our default sampling is 1000/200/200 or 10000/400/400 in ab-initio mode.
And from time to time may-be pushing to 50000 it0 / 1000 it1

Your sampling numbers are much higher…

This is of course coming at a price.

Especially if you have some info to drive the docking you don’t need such high sampling.

PS: The numbers below are actually quite efficient! clustering of 5000 models in 15s…

It1 and water are the more costly parts.

So not sure what you qualify as inefficient…

Inefficient in that ene-residue job, but not in terms of the development but a rather overall approach - HADDOCK runs impressively fast (I have one 128thread CPU).
Maybe I should decrease it1 to 1000 since it would have a considerable dataset generated in it0 to choose from anyway.

I skip water as I’m using coarse-grained
{===>} firstwater=“no”; ← when this option was on by default the procedure crashed in it1

{===>} waterdock=false; ← wasn’t sure if I want that so disabled also.

Inefficient in that ene-residue job, but not in terms of the development but a rather overall approach - HADDOCK runs impressively fast (I have one 128thread CPU).

I would simply skip that part, unless you need to extract those metrics. For sure for ab-initio docking it is very inefficient as you are basically calculating it for the entire surface.

Setting the analysis to clustering only will be much more efficient.

I skip water as I’m using coarse-grained
{===>} firstwater=“no”; ← when this option was on by default the procedure crashed in it1

Would be strange that it crashes at it1 since it is only performed after it1…

{===>} waterdock=false; ← wasn’t sure if I want that so disabled also.

You don’t want it - trust the default settings… And not possible with CG anyway

Ah, sorry, yes, it crashed for smaller test coarse grained run, after it1, yes.