You are here: Wiki>LiSA Web>Paul (28 Apr 2015, UnknownUser)Edit

News

Aug 2013
yoshiko version 2.0

We have released yoshiko version 2.0, a mature version of our cluster editing algorithm. It has been implemented mostly by Emanuel Laude and contains a lot of helpful redcution rules from fixed parameter algorithmics and a powerful heuristic for large data sets.

24 Aug 2011
natalie version 2.0

We have released natalie version 2.0, a mature version of our network alignment algorithm.

27 Jun 2011
side chain placement

scp is a package that contains our code for exact side chain placement as used in a recent paper in Optimization Letters.

5 Mar 2010
heinz and BioNet

heinz now works together with the BioNet package.

17 Dec 2008
yoshiko/charles

The yoshiko/charles code is now online. The tool solves the cluster editing problem as well as its directed "cousin", the transitivity editing problem, to provable optimality.

1 Sep 2008
planet lisa moves to Amsterdam

The planet lisa project is now based at the Centre for Mathematics and Computer Science (CWI) in Amsterdam, The Netherlands.

20 Jul 2008
ISMB 2008

Our paper Identifying Functional Modules in Protein-Protein Interaction Data: An Integrated Exact Approach has won the oustanding paper award at ISMB 2008 in Toronto, Canada. See the heinz section for more information on our software to discover optimal subnetworks with respect to our signal-based scoring scheme of p-values.

Jul 2008
ISMB 2008

We will give a presentation at ISMB 2008 on Identifying Functional Modules in Protein-Protein Interaction Data: An Integrated Exact Approach. Our software is available as the heinz package.

Oct 07
Algorithmic Operations Research

Our theoretical lara paper has been accepted for publication in Algorithmic Operations Research.

27 Jul 07
BMC Bioinformatics

The paper describing the lara program has been accepted for publication in BMC Bioinformatics.

28 Jun 07
4SALE support for lara

lara has been integrated into the RNA alignment and editing framework 4SALE. Get the latest lara version that is compatible with 4SALE.

Paul

The planet LiSA library is now hosted in "Amsterdam"


Introduction

PAUL (Protein Alignment Using Lagrangian relaxation) [3] is a tool for the computation of protein structure alignments based on sparse protein distance matrices. It is based on an approach for the alignment of protein contact maps by Lancia and Caprara [1]. The output of PAUL serves as input for the sequence based alignment program T-Coffee [2], which constructs a global alignment that contains the conserved structural elements.

Section 'Quick Start' gives the basic program-call. More advanced program calls are given in section 'Sample Commands' and examples of those are given in the folder "example". The optional parameter file is described in section "Parameter File". In case there are still question concerning the use and handling of PAUL please read the frequently asked questions in section "FAQ".


Installation

T-Coffee Installation

In order to output a global alignment, PAUL requires T-Coffee. By default, PAUL uses the T-Coffee version coming along with this release and therefore no extra T-Coffee installation is required. Alternatively, one can specify the installation path to another T-Coffee version in the optional parameter file (s. section "Parameter File" and "Scenario 1" in section "Sample Commands").

DSSP Installation

PAUL uses the external program DSSP for SSE prediction. We recommmend to use PAUL with DSSP. In order to get DSSP go to http://swift.cmbi.kun.nl/gv/dssp/, fill a license agreement and download and install it. After installation, move the DSSP binary to the "dssp" folder and name it "dssp" (alternatively, set the "dssp_bin" paramter in the parameter file to your DSSP binary).

Environment variables

PAUL requires the environment variable PAUL_ROOT to be set to the root directory of the PAUL installation. When using bash as your default shell type

$export PAUL_ROOT=<PAUL's installation path>

or add this line to file .bashrc in your home directory to enable it for future sessions.


Quick Start

$ ./paul -i <input_file>

"": This file contains the names of the input files, one per line. You can either use fully-fledged path names or path names relative to the directory in which PAUL is executed. In the "standard mode" the input files are supposed to be pdb-files (actually you can provide the program with two other input filetypes, s. below (section "Sample commands") for further detail.

The typical (default) output of the program are four different files:

".pw.lib": The T-Coffee input library. This file is used by the T-Coffee program to compute the sequence alignment respecting the structurally conserved residues. Furthermore, it contains the aligned residues; residues are numbered starting from 1.

".pw.aln": Holds the sequence alignment of the input structures computed by T-Coffee.

"run_cm_alig.rsl": This file contains aligned residues, numbered from 1, resp. This file is just a different format of the alignment in ".pw.aln".

".pw.results": Holds information about the run and the results of a PAUL run, e.g. (bounds on the) score, run time, gap costs, sequence penalty, etc., see also "FAQ".


Sample Commands

This section gives some rather advanced program calls.

If you type

$./paul --help

or

$./paul -h

the complete list of options is printed. In the following, we draw a couple of scenarios to illustrate the function of these options. You can find all files of the following examples in folder "example" of this PAUL release. The example calls below should be executed in this folder. For more information on the files and more test calls consult "INFO" in "example".

Scenario 1: How to use the optional parameter file

The optional parameter file allows to change the default settings. Type

$../paul -P > my.params

to pipe the default parameters into the file 'my.params'. Edit them as they suit best your requirements and provide them to the program with

$../paul -p my.params -i pdb/input.txt

See section "Parameter file" for further details on the parameters.

Scenario 2: How to use different alignment modes

PAUL offers three different alignment modes, either based on Cα, Cβ, or all-atom inter-residue distances, where the distance between the two closest atoms of two residues is used. The default is the use of Cβ distances. This is the same as applying the mode cb in the following way

$../paul -i pdb/input.txt -m cb

Cα matrices can be used via the mode ca

$../paul -i pdb/input.txt -m ca

and all-atom distance matrices can be used via the mode all-atom

$../paul -i pdb/input.txt -m all-atom

Scoring function parameters are different for Cα, Cβ and all-atom distance matrices. For further information refer to the parameter files in "example/params". These are the default parameters for mode ca, cb and mode all-atom. Feel free to adjust sensitive parameters like the run time, number of Lagrange iterations or the distance threshold (longer run time and higher number of iterations in many cases improve accuracy, as well as a higher distance threshold paired with a longer run time). The default maximum run time is 30 CPU minutes.

Scenario 3: How to provide external sequence sources

Sometimes the residues in a pdb-file are not fully identified. If you are aware of the real sequence and do not want to edit the pdb file you can provide the sequence to the program in a fasta-file.

$../paul -p my.params -i pdb/input.txt -s fasta/input_fasta.txt -S

In doing so, you have to provide an external sequence source to any instance in the input file. The file either must contain all fasta entries in the same order as they appear in file 'input.txt' or it contains only links to fasta files. The sequence retrieval in that case is rather greedy; the program eats up as many fasta entries in all linked fasta files as there are instances in the input file 'input.txt'. The sequence order within these files again has to be the same as the order of instances in 'input.txt'.

(Instead of using option '-S' one can also set option "pdb-sequence-source" in section "pdb" in the parameter file to false.)

Scenario 4: How to provide external SSE sequence via dssp files?

By default, PAUL uses SSE information, which is not mandatory, but speeds up computation and slightly improves accuracy. For pdb files, the SSE of each residue is determined via dssp files. These files are computed on the fly using a dssp binary that comes with this release. Dssp files can also be provided externally. This is mandatory if the input files are no pdb files and the option "compute-dssp" in the parameter file is set to true. A file containing the dssp file names in the same order as given in the input file has to be provided via the -Y option.

$../paul -i pdb/input.txt -Y dssp/input_dssp.txt

Scenario 5: How to use other input formats (distance matrices)

Instead of pdb files, PAUL can also be provided with distance matrix files. As these files do not contain any sequence information we have to provide them by means of external sequence data. The command line looks quite the same as in 'Scenario 3' (note that we do not need to invoke option '-S'!):

$../paul -p my.params -i dm/input_CA.txt -s fasta/input_fasta.txt -Y dssp/input_dssp.txt

A distance matrix for a protein of length 5 might look like
   0  5.5 13.5  7.8  6.3
 5.5    0 18.3 17.9 20.2
13.5 18.3    0  4.4  5.1
 7.8 17.9  4.4    0  3.7
 6.3 20.2  5.1  3.7    0  

A protein's distance matrix file contains n rows, where n is the length of the protein. Each row contains the distances to the other n residues, which leads to n inter-residue distances per row. The distance matrix must be symmetric and have a zero diagonal. Note that the distance matrices provided to PAUL might either contain Cα, Cβ, or all-atom distances (in the example they are Cα distances).

Scenario 6: How to use other input formats (lists of distances)

In addition to pdb files as the common representation of protein structures PAUL also accepts files of lists of distances. As these files do not contain any sequence information we have to provide them by means of external sequence data. The command line looks quite the same as in 'Scenario 3' (note that we do not need to invoke option '-S'!):

$../paul -p my.params -i cm/input_CA.txt -s fasta/input_fasta.txt -Y dssp/input_dssp.txt

A file of lists of distances looks like that

 8 # number of residues
 7 # number of distances
0 3 3.5
0 5 4.7
1 3 8.3
2 5 5.5
3 6 3.5
4 7 6.6
5 7 4.3

From the third line on the distances are given. Thereby, the first entry in a line stands for the source residue of the distance and the second entry for its target residue. The comments in the first and second line are not mandatory.To comment a line preceed it with a '#' character. Note that the distances provided to PAUL might either be Cα, Cβ, or all-atom distances (in the example they are Cα distances).


Parameter File

We get an insight into the default parameters of PAUL by piping them into a file, type

$../paul -P > my.params

or

$../paul --print-parameters > my.params

We can edit them as they suit best our requirements and employ them by invoking them with '-p' option, e.g.,

$../paul -p my.params -i pdb/input.txt

The parameter file is subdivided into several sections; each one defines options for a certain feature of the program. We are going to give a short description to the most important sections.

save: Provides a bunch of options to control the output of the program. Enable the postscript options to get some nice pictures of the alignment and the aligned distances.

solver: This section is well suited to control the accuracy of the program and its running time. To change the precision of the computation increase the number of iterations using the option 'noofiterations'

pdb: This section gives the default instruction for the handling of pdb input files. As one can see, be default, the protein sequence for pdb instances is deduced from the pdb file itself. One can change this behavior by providing PAUL with an additional fasta file (see Scenario 3). For the alignment of Cα distance matrices contact-matrix-type is set to A, for the alignment of Cβ distance matrices to B, for all-atom distance matrices to Z.

lcmoa: This section holds all options to the actual pairwise structure alignment routine. Not to be confused with section alignment!

alignment: Everything connected to general alignment strategies is adjustable via this section.

Remark: Actually the computation time highly depends on the sizes of the input structures. For pairs of short proteins we suggest to decrease the runtime using utime_limit_pairwise in section alignment. You could also set the parameter utime_limit_pairwise to an arbitrary high value in order to make sure that the maximum number of iterations is reached.


References

[1] Caprara A, Carr R, Istrail S, Lancia G, Walenz B. 1001 optimal PDB structure alignments: integer programming methods for finding the maximum contact map overlap. J Comput Biol. 2004; 11(1): 27-52.

[2] Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302(1): 205-17.


FAQs

Do pdb files have to have the *.pdb file ending?

No. PAUL tries to determine the type of each input structure file by the file format.

What is the meaning of the items in the result-section in file *.pw.results?

dual_value: Value of the dual problem that yields the highest value of the original (primal) problem, i.e., max. score of the alignment between the affected distance matrices.

primal_value: Highest value of the original problem, i.e., structure score of aligned distances minus gap penalty for aligning two residues and minus sequence penalty.

overlap: Total number of aligned distances of two distance matrices in the highest scoring solution of the primal problem.

max_overlap: Highest number of aligned distances reached by any feasible solution to the primal problem. (Note: 'max_overlap' could be much greater than 'overlap' as we do not optimize the overlap count but the value of the primal solution (->'primal_value')).

sequence_score: the overall penalty for aligned residues.

gap_costs: the gap costs.

iterations: Number of iterations computed. In the case that the problem is solved to optimality, meaning dual_value equals primal_value, this value can be smaller than the maximal number of iterations being set. In addition, if any other limitation to the optimization process is violated the iteration progress stops, which in turn results in a smaller number of iterations. Most of these event are logged to the standard error stream.

computation time: the time needed to compute the alignment.

solution_status: either SOLVED or UNSOLVED. If dual_value = primal_value then solution_status is SOLVED and the alignment is provably optimal.

number_duals: the number of y-variables (=pairs of distances) in the integer linear program.

Why does Ctrl-C not work to kill PAUL?

Ctrl-C is reserved to interrupt the recent Iteration. In order to kill PAUL use the command line tool. [kill (in Unix)].


Download

Our development platform is 32 and/or 64 bit Linux. We do not support other platforms. If you are interested in compiling the code on your own to run it on other platforms please contact the authors. paul 2.0 is the newest, pairwise version of paul which is based on aligning distance matrices. The documentation available at this website refers only to this new version. paul 0.9 is a different, older program version that computes multiple alignment of protein contact maps. The documentation of paul 0.9 is available with the program only.

version date link comment
2.0 32 bit 7 April 2010 paul 2.0 New & improved version
2.0 64 bit 7 April 2010 paul 2.0 New & improved version
1.0 32 bit 28 May 2009 paul 1.0 Distance matrix based, used for GCB submission
1.0 64 bit 28 May 2009 paul 1.0 Distance matrix based, used for GCB submission
0.9 04 Jun 2007 paul 0.9 Multiple Contact Map alignment

data date link comment
PAUL alignments 19 July 2010 Paul alignments The PAUL alignments and corresponding alignment accuracies evaluated and reported in Bioinformatics paper


Topic revision: r20 - 28 Apr 2015, UnknownUser
 
  • Printable version of this topic (p) Printable version of this topic (p)