Difference between revisions of "Cealign plugin"

Revision as of 21:27, 12 January 2007

Introduction

This script is a Python implementation of the CE algorithm pioneered by Drs. Shindyalov and Bourne (See References). It is a fast, accurate structure-based protein alignment algorithm. There are a few changes from the original code (See Notes), and "fast" depends on your machine and the implementation. That is, on my machine --- a relatively fast 64-bit machine --- I can align two 400+ amino acid structures in about 0.300 s with the C++ implementation. In Python however, two 165 amino acid proteins took about 35 seconds!

When coupled to the Kabsch algorithm, this should be able to align any two protein structures, using just the alpha carbon coordinates.

This plugs into PyMol very easily. See The Code and The Examples for installation and usage.

Documentation is forthcoming.

Comparison to PyMol

Why should you use this?

PyMol's structure alignment algorithm is fast and robust. However, its first step is to perform a sequence alignment of the two selections. Thus, proteins in the twilight zone or those having a low sequence identity, may not align well. Because CE is a structure-based alignment, this is not a problem. Look at the following example. The image at LEFT was the result of CE-aligning two proteins (1C0M to 1BCO). The result is 88 aligned (alpha carbons) residues (not atoms) at 2.78 Angstroms. The image on the RIGHT shows the results from PyMol's align command: an alignment of 221 atoms (not residues) at an RMSD of 15.7 Angstroms. To make the alignment easier to see, cealign (actually the Kabsch code) colors the aligned residues differently.

Cealign's results
PyMol's results

Examples

Usage

cealign 1cll and i. 42-55, 1ggz and c. A
cealign 1kao, 1ctq
cealign 1fao, 1eaz

Results

EASY: 1FAO vs. 1EAZ; 88 residues, 1.16 Ang
EASY: 1CBS vs. 1HMT; 120 residues, 2.07 Ang
MODERATE: 1A15 vs 1B50; 56 residues, 6.67 Ang.
EASY: 1OAN vs. 1S6N; aligned to 2.26 Ang. RMSD.
HARD: 1RLW to 1BYN; 104 residues; 3.94 Ang.
HARD: 1TEN vs. 3HHR; 72 residues, 3.13 Ang.
HARD: 2SIM vs. 1NSB; 280 residues, 5.00 Ang.
HARD: 1CEW vs. 1MOL; 72 residues, 3.63 Ang.

Installation

Requirements

Numpy
Python 2.4+

Directions

uncompress the distribution file cealign-VERSION.tgz
cd cealign-VERSION
sudo python setup.py install
insert "run DIR_TO_CEALIGN/cealign.py" and "run DIR_TO_CEALIGN/qkabsch.py" into your .pymolrc file
load some molecules
run, cealign molecule1, molecule2
enjoy

The Code

In testing stages. Coming very soon.

Updates

2007-01-11

The first version of the C-module code is complete. I fixed handling (multiple) missing residues, the centering problem, and the problem of multiple chains. I'll package and provide the code soon.

2007-01-10

Trying to remedy missing residues. If a user's selections are protA and i. 10-20 and prot2 and i. 10-20, and if prot2 is missing residue 14, the SVD is undefined/inappropriate. I have to weed out residues that don't have partners in the PDB file. Alignments do this implicitly since the XYZ values it sees are only the ones with coordinates. Also, CE only works on individual chains. If someone can find a consistent method to map residues and chains to ints and then back to residues and chains -- that might work. Ha!

If more than a week lapses after this comment, I'll just wrap up the code and post the first version. There seems to be some interest in this plugin, so the more eyes the easier it may be to fix the bugs. I will also need testers for the Mac and Windows editions.

2007-01-08

Yeah! The C code that plugs into PyMol has been completed. It's a little slower than the plain C++ code I wrote, but that's what you get when passing data from PyMol to Python to C, fiddle with it, pass it back to Python to PyMol for some more quick math. The alignment times for the two proteins mentioned below (1B50 and 1C0M) on my machine with the new C module is about 1-3 second (with a full CPU load for other intensive tasks running in the background; this shows great improvement over the pure Python alignment times). Once the code is cleaned up (and I'm not too embarrassed to post it) and some bugs are worked out, I'll post it. The current bugs are:

Some alignments don't center right
Missing residues cause problems
Memory leaks galore, I'm sure

The code consists of:

qkabsch.py
cealign.py
ccealignmodule.c
ccealignmodule.h
setup.py

Also, I provide the option of aligning based solely upon RMSD or upon the better CE-Score. See the References for information on the CE Score.

References

Text taken from PubMed and formatted for the wiki. The first reference is the most important for this code.

Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998 Sep;11(9):739-47. PMID: 9796821 [PubMed - indexed for MEDLINE]
Jia Y, Dewey TG, Shindyalov IN, Bourne PE. A new scoring function and associated statistical significance for structure alignment by CE. J Comput Biol. 2004;11(5):787-99. PMID: 15700402 [PubMed - indexed for MEDLINE]
Pekurovsky D, Shindyalov IN, Bourne PE. A case study of high-throughput biological data processing on parallel platforms. Bioinformatics. 2004 Aug 12;20(12):1940-7. Epub 2004 Mar 25. PMID: 15044237 [PubMed - indexed for MEDLINE]
Shindyalov IN, Bourne PE. An alternative view of protein fold space. Proteins. 2000 Feb 15;38(3):247-60. PMID: 10713986 [PubMed - indexed for MEDLINE]

@@ Line 1: / Line 1: @@
-== Updates ==
-===2007-01-11===
-The first version of the C-module code is complete.  I fixed handling (multiple) missing residues, the centering problem, and the problem of multiple chains.  I'll package and provide the code soon.
-===2007-01-10===
-Trying to remedy missing residues.  If a user's selections are '''protA and i. 10-20''' and '''prot2 and i. 10-20''', and if prot2 is missing residue 14, the SVD is undefined/inappropriate.  I have to weed out residues that don't have partners in the PDB file.  Alignments do this implicitly since the XYZ values it sees are only the ones with coordinates.  Also, CE only works on individual chains.  If someone can find a consistent method to map residues and chains to ints and then back to residues and chains -- that might work.  Ha!
-If more than a week lapses after this comment, I'll just wrap up the code and post the first version.  There seems to be some interest in this plugin, so the more eyes the easier it may be to fix the bugs.  I will also need testers for the Mac and Windows editions.
-=== 2007-01-08===
-'''Yeah!'''
-The C code that plugs into PyMol has been completed.  It's a little slower than the plain C++ code I wrote, but that's what you get when passing data from PyMol to Python to C, fiddle with it,  pass it back to Python to PyMol for some more quick math.  The alignment times for the two proteins mentioned below (1B50 and 1C0M) on my machine with the new C module is about 1-3 second (with a full CPU load for other intensive tasks running in the background; this shows great improvement over the pure Python alignment times).  Once the code is cleaned up (and I'm not too embarrassed to post it) and some bugs are worked out, I'll post it. The current bugs are:
-# Some alignments don't center right
-# Missing residues cause problems
-# Memory leaks galore, I'm sure
-The code consists of:
-* qkabsch.py
-* cealign.py
-* ccealignmodule.c
-* ccealignmodule.h
-* setup.py
-Also, I provide the option of aligning based solely upon RMSD or upon the better CE-Score.  See the '''References''' for information on the '''CE Score'''.
 == Introduction ==
 This script is a Python implementation of the CE algorithm pioneered by Drs. Shindyalov and Bourne (See References).  It is a fast, accurate structure-based protein alignment algorithm.  There are a few changes from the original code (See Notes), and "fast" depends on your machine and the implementation.  That is, on my machine --- a relatively fast 64-bit machine --- I can align two 400+ amino acid structures in about 0.300 s with the C++ implementation.  In Python however, two 165 amino acid proteins took about 35 seconds!
@@ Line 36: / Line 10: @@
 == Comparison to PyMol ==
+'''Why should you use this?'''
 PyMol's structure alignment algorithm is fast and robust.  However, its first step is to perform a sequence alignment of the two selections.  Thus, proteins in the '''twilight zone''' or those having a low sequence identity, may not align well.  Because CE is a structure-based alignment, this is not a problem.  Look at the following example.  The image at LEFT was the result of CE-aligning two proteins (1C0M to 1BCO).  The result is '''88''' aligned (alpha carbons) residues (not atoms) at '''2.78 Angstroms'''.  The image on the RIGHT shows the results from PyMol's align command: an alignment of '''221 atoms''' (not residues) at an RMSD of '''15.7 Angstroms'''.  To make the alignment easier to see, cealign (actually the [[Kabsch]] code) colors the aligned residues differently.
@@ Line 43: / Line 19: @@
 </gallery>
-== Notes ==
-# The Python implementation is slow.  This is most likely due to the fact that I'm not a very good Python coder.  This is the initial version; if you can improve it, got for it.  That's what open source is all about.
+== Examples ==
-# This implementation requires the [[Kabsch]] algorithm I wrote to do the optimal superposition of the two structures once the residue pairings are determined.
+=== Usage ===
-# This implementation also uses the "CE-score" which is a statistically determined score that performs more reliably than does RMSD.  I also provide the RMSD if you don't like the CE-score.
+<source lang="python">
-# I deviate from the original publication in that I use Kabsch's algorithm to align the two structures; nothing iterative.
+cealign 1cll and i. 42-55, 1ggz and c. A
-# I deviate from Kabsch's algorithm by using the SVD solution, which is fast, accurate and easy to code (in comparison to the original elegant proof).
+cealign 1kao, 1ctq
-# This code is essentially a poor-man's translation of my C++ code.
+cealign 1fao, 1eaz
-# I deliberately left out the final optimization step (wiggling gaps on high scoring alignments) from the original paper.  It is not relevant for my project.  Someone else will have to code that.
+</source>
+=== Results ===
+<gallery>
+Image:Cealign1.png|EASY: 1FAO vs. 1EAZ; 88 residues, 1.16 Ang
+Image:Cealign2.png|EASY: 1CBS vs. 1HMT; 120 residues, 2.07 Ang
+Image:Cealign3.png|MODERATE: 1A15 vs 1B50; 56 residues, 6.67 Ang.
+Image:Align.png|EASY: 1OAN vs. 1S6N; aligned to 2.26 Ang. RMSD.
+Image:Cealign_ex_hard.png|HARD: 1RLW to 1BYN; 104 residues; 3.94 Ang.
+Image:1ten_3hhr.png|HARD: 1TEN vs. 3HHR; 72 residues, 3.13 Ang.
+Image:2SIM_1NSB.png|HARD: 2SIM vs. 1NSB; 280 residues, 5.00 Ang.
+Image:1CEW_1MOL.png|HARD: 1CEW vs. 1MOL; 72 residues, 3.63 Ang.
+</gallery>
 == Installation ==
@@ Line 61: / Line 50: @@
 # cd cealign-VERSION
 # sudo python setup.py install
-# insert "run DIR_TO_CEALIGN/cealign.py" into your '''.pymolrc''' file
+# insert "run DIR_TO_CEALIGN/cealign.py" and "run DIR_TO_CEALIGN/qkabsch.py" into your '''.pymolrc''' file
 # load some molecules
 # run, '''cealign molecule1, molecule2'''
@@ Line 71: / Line 60: @@
 In testing stages.  Coming very soon.
-== Examples ==
-<source lang="python">
-cealign 1cll, 1ggz
-cealign 1kao, 1ctq
-cealign 1fao, 1eaz
-</source>
-<gallery>
+== Updates ==
-Image:Cealign1.png|EASY: 1FAO vs. 1EAZ; 88 residues, 1.16 Ang
+===2007-01-11===
-Image:Cealign2.png|EASY: 1CBS vs. 1HMT; 120 residues, 2.07 Ang
+The first version of the C-module code is complete.  I fixed handling (multiple) missing residues, the centering problem, and the problem of multiple chains.  I'll package and provide the code soon.
-Image:Cealign3.png|MODERATE: 1A15 vs 1B50; 56 residues, 6.67 Ang.
-Image:Align.png|EASY: 1OAN vs. 1S6N; aligned to 2.26 Ang. RMSD.
+===2007-01-10===
-Image:Cealign_ex_hard.png|HARD: 1RLW to 1BYN; 104 residues; 3.94 Ang.
+Trying to remedy missing residues.  If a user's selections are '''protA and i. 10-20''' and '''prot2 and i. 10-20''', and if prot2 is missing residue 14, the SVD is undefined/inappropriate.  I have to weed out residues that don't have partners in the PDB file.  Alignments do this implicitly since the XYZ values it sees are only the ones with coordinates.  Also, CE only works on individual chains.  If someone can find a consistent method to map residues and chains to ints and then back to residues and chains -- that might work.  Ha!
-Image:1ten_3hhr.png|HARD: 1TEN vs. 3HHR; 72 residues, 3.13 Ang.
-Image:2SIM_1NSB.png|HARD: 2SIM vs. 1NSB; 280 residues, 5.00 Ang.
+If more than a week lapses after this comment, I'll just wrap up the code and post the first version.  There seems to be some interest in this plugin, so the more eyes the easier it may be to fix the bugs.  I will also need testers for the Mac and Windows editions.
-Image:1CEW_1MOL.png|HARD: 1CEW vs. 1MOL; 72 residues, 3.63 Ang.
-</gallery>
+=== 2007-01-08===
+'''Yeah!'''
+The C code that plugs into PyMol has been completed.  It's a little slower than the plain C++ code I wrote, but that's what you get when passing data from PyMol to Python to C, fiddle with it,  pass it back to Python to PyMol for some more quick math.  The alignment times for the two proteins mentioned below (1B50 and 1C0M) on my machine with the new C module is about 1-3 second (with a full CPU load for other intensive tasks running in the background; this shows great improvement over the pure Python alignment times).  Once the code is cleaned up (and I'm not too embarrassed to post it) and some bugs are worked out, I'll post it. The current bugs are:
+# Some alignments don't center right
+# Missing residues cause problems
+# Memory leaks galore, I'm sure
+The code consists of:
+* qkabsch.py
+* cealign.py
+* ccealignmodule.c
+* ccealignmodule.h
+* setup.py
+Also, I provide the option of aligning based solely upon RMSD or upon the better CE-Score.  See the '''References''' for information on the '''CE Score'''.
 == References ==