CAPE: Convergent and Parallel Evolution at the Amino Acid Sequence Level (c) Copyright July 1997 by Jianzhi Zhang and the Pennsylvania State University. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. CAPE is distributed free of charge by Jianzhi Zhang Institute of Molecular Evolutionary Genetics and Department of Biology 322 Mueller Laboratory The Pennsylvania State University University Park, PA 16802, USA Telephone: 814-8657030 Fax: 814-8637336 Email: jxz128@email.psu.edu Suggested citation: Zhang J, Kumar S (1997) Detection of convergent and parallel evolution at the amino acid sequence level. Mol. Biol. Evol. 14: 527-536. 1. Introduction CAPE is designed to test convergent and parallel evolution at the amino acid sequence level. It computes the probabilities that the observed convergent and parallel substitutions are attributable to random chance. This package contains one program: converg2.exe, which is written in C language. The program can be used on IBM PC compatible computers with windows 95. 2. Installation First make sure that the diskette you have received contains the following files. converg2.c (source code) converg2.exe (executable file) jtt.pro (matrix for JTT substitution model) poisson.pro (matrix for Poisson substitution model) lysozyme.aa (an example data file, see Stewart et al. 1987) manual (this file) To install CAPE on your computer's hard disk drive ("C" drive given here, for example), you should create a directory where the files of this package will be present. To do this, type the following c:\md cape (Enter) To copy the CAPE files onto your hard disk drive, insert the floppy disk containing the programs into your floppy drive ("A" drive given here, for example). Then, enter the following command c:\copy a:*.* c:\cape\*.* (Enter) 3. Input file To use the program, you need a input file containing the amino acid sequences and the tree topology of these sequences (see lysozyme.aa for an example). This file begins with two numbers: the number of sequences and the number of amino acid sites (sequence length). The second line will be the name of the first sequence, and the third line will be the first sequence, and so on. Only the one letter code (capitalized) for the 20 amino acids are allowed in the sequences. The sequences should be aligned and gaps or any other symbols be removed already. The last line of the file is the tree topology of the sequences. The tree format is the same as that used in PHYLIP package (Felsenstein 1995). Note that the tree is unrooted, so trification rather than bification is required for the deepest branching node. For example, the topology of the following tree can be expressed by (((1,3),2),6,((4,7),(5,8))) 11 |----------- 1 10 |-----------| |----------| |---------------- 3 | |------------------------ 2 | |----------------------------- 6 |---------------| |---------- 4 9 | |--------| | | 13 | | | |------ 7 |-----| 12 | |---- 5 |------| 14 |----- 8 Note that in the topology expression, the numbers refer to the order of the sequences given in the input file. Also note that in the topology expression, there are only numbers and ", " without any space. The tree of the lysozyme sequences in the example data file is (((1,2),3),4,(5,6)) (Stewart et al. 1987) 9 |------------------1 langur 8 |-----| |---| |------------------2 baboon 7 | | |---| |-------------------3 human | | | |--------------------4 rat | |------------5 cow |----------------| 10 |---------------------------6 horse You should also know the system of the notation of interior (ancestral) nodes because you will be asked to input the branches (by two nodes of a branch) on which convergent and parallel evolution is to be examined. The deepest node (the trification node specified in your expression of the tree topology) is denoted by N+1, where N is the number of sequences in the tree. In the above lysozyme example with N=6, the deepest node links the groups ((1,2),3), 4, and (5,6). So, "7" is given to the deepest node as shown in the tree. The notation of the nodes can be figured out from the output file (RESULT) of program ancestor.exe, which you may use to infer the ancestral sequences. In the file RESULT, I describe the branches by their two ends (nodes). Note that the ancestral (interior) nodes are numbered from N+1 to 2N-2. 3. Computation To compute the probabilities fc (the probability that the observed convergent substitutions are attributable to random chance) and fp (the probability that the observed parallel substitutions are attributable to random chance), you have to first decide on which two branches you are going to examine the convergent and parallel evolution, and determine the observed numbers of convergent and parallel substitutions on these branches. These observed numbers can be counted by inferring ancestral amino acid sequences at the interior nodes of the tree. For this purpose, you may use the program ancestor.exe of package ANCESTOR, distributed by Jianzhi Zhang. After you have obtained these information, you can compute fc and fp by type c:\cape\converg2 filename Foe example, to try the lysozyme.aa data, type c:\ancestor\converg2 lysozyme.aa You will be asked to choose substitution models. I have provided the matrices of Poisson and JTT models. You will also be asked to input the branches to be examined and the observed numbers of convergent and parallel substitutions. 4. Limitations of the program. The program is designed for testing convergent and parallel evolution on two branches. For tests on 3 branches, please contact me.