About SMILES database

Hi,
How can you make sure that the SMILES database used is complete? How can you account for a specific family of drugs of which there is only a small number of them in the way larger dataset? Thank you :slight_smile:

REINVENT is trained on a very large set of drug-like molecules, but this mostly has the effect of the model learning the SMILES grammar – it can in fact generate molecules that are very different from anything seen in the training set. So if you want the model to generate only molecules with certain properties (e.g. a family with a common scaffold), you can use appropriate scores to guide the generation, such as some measure of similarity or common substructure.