In theory, your approach is correct and, despite usually warning the users about the risk of comparing the HADDOCK scores between two different docking runs, this scenario allows you a fair comparison (provided your restraints are good enough to sample the same interface in both docking runs)
However, I see that the cluster size is rather small in both cases (which RMSD threshold did you use? How many models did you generate?) and that the RMSD of both clusters from the best HADDOCK model (lowest HADDOCK score) is huge (14-15 angstroms). You should maybe revise your parameters for the post-docking analysis, to obtain larger clusters and hopefully also cluster the lowest HADDOCK score model.
You can find here more information about how to perform manual clustering of the solutions: