FilterByMol

Overview

This script filters through all the PDBs in the parent dir (you can easily the the directory it scans). For each molecule, it saves just the ligands/heteroatoms (excluding the waters). This gives you a simple way to filter through a database of proteins looking only at their ligands.

This script, as noted below, works on the objects at the level of a molecule. While we can iterate over atom number (ID), residue number (resi), etc we do not have any such "MOLID". So, we provide this simple workaround. You might need this file because if you have a residue (like #111 from 3BEP) that consists of a molecule and an atom then there's no other way to save the separate pieces (of molecule/atom) into two (or more files). As you can see in the following listing, if we iterate over the hetero atoms (and not waters) in 3BEP we get,

PyMOL>iterate bymol het, print resi, resn, ID, chain, segi, alt
111 5CY 6473 C  
111 5CY 6474 C  
111 5CY 6476 C  
111 5CY 6477 C  
111 5CY 6478 C  
111 5CY 6479 C  
111 5CY 6480 C  
111 5CY 6481 C  
111 5CY 6482 C  
111 5CY 6483 C  
111 5CY 6484 C  
111 5CY 6485 C  
111 5CY 6486 C  
111 5CY 6487 C  
111 5CY 6488 C  
111 5CY 6489 C  
111 5CY 6490 C

which does not allow us to separate the two pieces.

The Code

python

#
# This simple script will filter through all PDBs in a directory, and for each one
# save all the ligands/heterotoms (that aren't waters) to their own file.  This
# script operates at the level of molecules, not residues, atoms, etc.  Thus, if
# you have a ligand that PyMOL is treating as ONE residue, but is actually two
# separate molecules, or a molecule and an atom, then you will get multiple files.
#

from glob import glob
from os import path
from pymol import stored

theFiles = glob("../*.pdb");

for f in theFiles:
    # load the file
    cmd.load(f);
    # remove the protein and waters
    cmd.remove("polymer or resn HOH");

    cmd.select("input", "all")
    cmd.select("processed", "none")
    mol_cnt = 0

    while cmd.count_atoms("input"):
        # filter through the selections, updating the lists
        cmd.select("current","bymolecule first input")
        cmd.select("processed","processed or current")
        cmd.select("input","input and not current")

        # prepare the output parameters
        curOut = path.basename(f).split(".")[0] + "_" + str(mol_cnt).zfill(5) + "_het.pdb"
        curSel = "current"
        
        # save the file
        cmd.save( curOut, curSel );
        print "Saved " + curSel + " to " + curOut
        
        mol_cnt = mol_cnt + 1;

    # remove all to move to next molecule
    cmd.delete("*");        

python end