I would like to perform large-scale scoring of my protein models using HADDOCK3. I have approximately 10,000 models that I need to evaluate. What would be the best strategy or recommended approach to handle this scale efficiently
Dear dongl,
Haddock3 is limited to 20 input files, but unlimited in the number of conformations in an input ensemble.
To process 10000 models, I would suggest to merge them together in a single file using pdb_mkensemble
.
Then, you can process them using a standard scoring workflow:
run_dir = "big_scoring_run"
molecules = "10000_ensemble.pdb"
# Generation of topologies
[topoaa]
# A energy minimisation step followed by a scoring using the HADDOCK scoring function
[emscoring]
# Clustering by Fraction of common contacts
[clustfcc]
clust_cutoff = 0.9 # Group together models having >= 90% similar contacts
min_population = 1 # This parameter allows to keep all models even the ones that are not clustered (singlotons)
# Grouping models by clusters
[seletopclusts]
top_cluster = 10000 # in case they are all different
top_models = 10000 # in case they are all fall in the same cluster
# A final analysis step to generate the plots
[caprieval]
For reference, please see what we did for the scoring challenge in CAPRI rounds using haddock3 10.1002/prot.26789.