The amount of high-throughput screening (HTS) data readily available has significantly increased
because of the PubChem project (http://pubchem.ncbi.nlm.nih.gov/). There is considerable opportunity
for datamining of smallmolecules for a variety of biological systems using cheminformatic tools and the
resources available through PubChem. In thiswork,we trained a support vectormachine (SVM) classifier
using the Signature molecular descriptor on factor XIa inhibitor HTS data. The optimal number of
Signatures was selected by implementing a feature selection algorithm of highly correlated clusters. Our
method included an improvement that allowed clusters to work together for accuracy improvement,
where previous methods have scored clusters on an individual basis. The resulting model had a 10-fold
cross-validation accuracy of 89%, and additional validation was provided by two independent test sets.
We applied the SVMto rapidly predict activity for approximately 12million compounds also deposited in
PubChem. Confidence in these predictions was assessed by considering the number of Signatures within
the training set range for a given compound, defined as the overlap metric. To further evaluate
compounds identified as active by the SVM, docking studies were performed using AutoDock. A focused
database of compounds predicted to be active was obtained with several of the compounds appreciably
dissimilar to those used in training the SVM. This focused database is suitable for further study. The data
mining technique presented here is not specific to factor XIa inhibitors, and could be applied to other
bioassays in PubChemwhere one is looking to expand the search for smallmolecules as chemical probes.