Cluster mols

From PyMOLWiki
Jump to: navigation, search
Cluster mols py pymol.png

cluster_mols is a PyMOL plugin that allows the user to quickly select compounds from a virtual screen to be purchased or synthesized.

It helps the user by automatically clustering input compounds based on their molecular fingerprints [1] and loading them into the PyMOL window. cluster_mols also highlights both good and bad polar interactions between the ligands and a user specified receptor. Additionally there are a number of keyboard controls for selecting and extracting compounds, as well as functionality for searching online to see if there are vendors for a selected compound.


The basic work flow of can be broken up into three parts.

  1. Computing a similarity matrix from the input compounds
  2. Performing hierarchical clustering on the results from 1)
  3. Cutting the tree at a user-specified height and creating and sorting clusters

The results of 1 and 2 are saved to python pickle files so you do not have to recompute them in subsequent runs.

In addition, it also highlights both good and bad polar contacts between the ligand and a user specified protein using the 'show_contacts' module described below.

This script also integrates keyboard controls which allows for WASD movement through the clusters as well as keyboard shortcuts for pulling out compounds. See below for usage.


The most up to date version (recommended) of cluster_mols is available through BitBucket at:


This plugin has a number of dependencies that are required. And it is currently only supported on Linux and OSX.

Python packages (install using easy_install or pip)

  1. openbabel
  2. chemfp
  3. numpy
  4. scipy
  5. Tkinter
  6. fastcluster
  7. argparse (optional: for command line only)

Command line tools (These must be accessible through your PATH environment variable):

  1. babel -- from

Recent versions of cluster_mols do not require sdsorter, but it is still a very useful tool for dealing with sdf files.

  1. sdsorter --

Once you have the required dependencies, install it through PyMOL's Plugin menu.

PyMOL > Plugin > Install Plugin


The GUI is relatively straightforward, if you follow it from top to bottom, and then then left to right through the tabs.

The program requires that the input be a '.sdf' or '.sdf.gz' file. If your compounds are not in that format, use the 'babel' tool from OpenBabel to convert them.

GUI Options

Cluster mols screen 1 desc.png
Cluster mols screen 2 desc.png

In the 'Compute Similarities' tab, there are options for selecting a new ligand and for specifying how many CPUs you want to run the similarity calculation on. Clicking the 'Compute Similarity' button will start the similarity calculations. If you check the 'Ignore saved results?' box it will ignore any saved intermediate results files. This could be useful if you change the contents of the original input file while keeping the file name the same.

Depending on how many compounds there are, the similarity calculations may take between 1 and 10 minutes. If you launched PyMOL from the command line, you will be able to see the progress printing out in the console. The similarity results are saved to a file so if you want to re-cluster the same input file, you do not need to wait to recompute the similarities.

The first option on the Cluster Compounds tab defines how the clusters will be sorted. The default is to sort by the 'minimizedAffinity' which is inserted into the output sdf file after minimization with 'smina' (An enhanced version of AutoDock Vina. Available at: You can also sort the clusters by any SD tag that exists in the input file, or by the Title (alphabetically) or by the size of the cluster.

The second option is the height at which the hierarchical clustering tree is cut. The units are arbitrary, but a higher number leads to a small number of large clusters of less similar compounds, and lower cutoffs lead to more small clusters of more similar compounds. Play around with the cutoff until you get a clustering that you like. The third option is a check box for whether to group clusters with only one compound into one ‘singletons’ cluster. The forth option enables the show_contacts tool that is described below. There is also a field to enter a PyMOL selection string to compute the hydrogen bonds to. Finally, there is a button to create the clusters and load them into PyMOL.

Keyboard Controls

Once you have finished the similarity calculations and clustering mentioned above, you can navigate the clusters using the keyboard. Familiar to gamers, you can move through clusters using the WASD keys, (W for up, S for down, A for left, D for right). The one important caveat is that due to limitations in PyMOL, the WASD movement needs to be used with the Control (or Alt) key. Meaning Ctrl-W moves up. It seems weird, but you quickly get used to it.

Navigation Controls

Ctrl-W – Move up a cluster

Ctrl-S – Move down a cluster

Ctrl-A – Move to the previous compound in a cluster

Ctrl-D – Move to the next compound in the cluster

Ctrl-F -- Check for vendors

If you acquired your compounds from ZINCPharmer ( and/or your compounds have title that start with a ZINC ID ( or a MolPort ID (, you can hit 'Ctrl-F' to see if there are any vendors available.

Compound selection

In addition to moving through the clusters, you can also extract compounds that you like for later viewing using the following controls. Pressing F3 will append the current compounds into a new object with the suffix '_selected'.

F1 – Print title of currently selected molecule

F2 – Remove most recently added compound

F3 – Add currently visible compound to list (Most commonly used)

F4, F12 – Print List


show_contacts is an expanded version of list_hbonds[2] that shows both favorable and unfavorable contacts between ligands and a protein receptor. show_contacts has been integrated into cluster_mols as a function and is executed automatically when clustering. It can also be run by itself, not in the context of cluster_mols. In the standalone case, the usage is as follows:

show_contacts(selection,selection2,result="contacts",cutoff=3.6, bigcutoff = 4.0):

The arguments are as follows:

  1. selection -- pymol selection string for the protein
  2. selection2 -- pymol selection string for the ligands
  3. results -- prefix of the object that the distances should be shown in. (Default "contacts")
  4. cutoff -- Distance cutoff for what is considered an ideal hydrogen bond.
  5. bigcutoff -- Distance cutoff for a non-ideal hydrogen bond.

Output: The output of show_contacts are a set of pymol distance objects. They are color-coded and size coded to indicate different interactions between the ligand and protein. They are controlled by the parameter indicated.

  1. thin-purple lines -- all possible polar contacts (acc-acc, don-don, acc-don) -- bigcutoff
  2. thick-yellow lines -- All ideal hydrogen bonds -- cutoff
  3. thin-yellow lines -- Non ideal hydrogen bonds -- bigcutoff
  4. thick-red lines -- Polar clashes, i.e. Donor-Donor, Acceptor-Acceptor -- cutoff


A client-server architecture has been implemented as well to allow for offloading the calculations to a remote server. The runs on the server and listens for input from clients and returns the results. Use of the remote server can be enable or disabled with the boolean near the top of the script. Setting RUN_REMOTELY = True attempts to offload the work to the remote server and setting it to False runs it locally.

Citing ClusterMols

If you use ClusterMols in your work, please cite the following.

Baumgartner, Matthew (2016) IMPROVING RATIONAL DRUG DESIGN BY INCORPORATING NOVEL BIOPHYSICAL INSIGHT. Doctoral Dissertation, University of Pittsburgh.


The main script was conceived of by Matthew P Baumgartner (mpb21 [at] and Dr. David Koes while working in the lab of Dr. Carlos Camacho at the University of Pittsburgh. The script was implemented (and later rewritten) by MPB. The show_contacts functionality and the first version of the keyboard controls was written by DK.

Please send questions/comments/bug reports to matthew.p.baumgartner [at]