GZ-Gamma: Estimation of the Expected Number of Substitutions at each Amino Acid (Nucleotide) Site and the Parameter for Rate Variation among Sites. (c) Copyright Dec. 1997, 2000 by Jianzhi Zhang, Xun Gu and the Pennsylvania State University. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. GZ-Gamma is distributed free of charge by Jianzhi Zhang Institute of Molecular Evolutionary Genetics and Department of Biology 322 Mueller Laboratory The Pennsylvania State University University Park, PA 16802, USA Telephone: 814-865-7030 Fax: 814-863-7336 Email: jzhang@niaid.nih.gov and Xun Gu Institute of Molecular Evolutionary Genetics and Department of Biology 322 Mueller Laboratory The Pennsylvania State University University Park, PA 16802, USA Telephone: 814-865-1034 Fax: 814-8637336 Email: xungu@imeg.bio.psu.edu Suggested citation: Gu, X., and J. Zhang J (1997) A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113. 1. Introduction GZ-gamma is designed to estimate the expected number of substitutions of each amino acid (nucleotide) site, and the gamma shape parameter for the rate variation among sites, using a combination of ancestral sequence inference and maximum likelihood estimation when the phylogenetic relationships of these homologous sequences are known. This package contains two programs: gz-aa.exe for amino acid sequences, and gz-DNA.exe for DNA sequences, which are encoded in C language. The program can be used on IBM PC compatible computers with Window 95 and Window NT operating systems. 2. Installation First make sure that the diskette you have received contains the following files. gz-aa.c (source code) gz-DNA.c (source code) gz-aa.exe (executable file) gz-DNA.exe (executable file) jtt.pro (JTT substitution matrix, for amino acid sequences) atp6.aa (an example data file for amino acid sequences) cox1.dna (an example data file for DNA sequences) manual (this file) alpha (output file from running gz-aa.exe) To install GZ-gamma on your computer's hard disk drive ("C" drive given here, for example), you should create a directory where the files of this package will be present. To do this, type the following c:\md GZ-gamma (Enter) To copy the GZ-gamma files onto your hard disk drive, insert the floppy disk containing the programs into your floppy drive ("A" drive given here, for example). Then, enter the following command c:\copy a:*.* c:\GZ-gamma\*.* (Enter) 3. Input file To use the program, you need one input file containing the amino acid (or nucleotide) sequences and the tree topology of these sequences (see atp6.aa for an example). This file begins with two numbers: the number of sequences and the number of amino acid or nucleotide sites (sequence length). The second line will be the name of the first sequence, and the third line will be the first sequence, and so on. Each sequence should occupy a line without any interruption. Only the letters (capitalized) for the 20 amino acids (or 4 nucleotides) are allowed in the sequences. The gaps or any other symbols should have been already removed. The last line of the file is the tree topology of the sequences. The tree format is the same as that used in PHYLIP package (Felsenstein 1995). Note that the tree is unrooted, so trification rather than bification is required for the deepest branching node. For example, the topology of the following tree can be expressed by (((1,3),2),6,((4,7),(5,8))) 11 |----------- 1 10 |-----------| |----------| |---------------- 3 | |------------------------ 2 | |----------------------------- 6 |---------------| |---------- 4 9 | |--------| | | 13 | | | |------ 7 |-----| 12 | |---- 5 |------| 14 |----- 8 Note that in the topology expression, the numbers refer to the order of the sequences given in the input file. The tree of the atp6 and cox1 sequences in the example data files atp6.aa and cox1.dna is (((1,2),((3,4),(5,6))),(7,8),9) |------------------1 mouse |-----| |---| |------------------2 rat | | |---| | |----------------3 human | | | |--| | | |---| |-------------4 gibbon | | | |------------5 whale | | |-------| | | |------------6 cow | | |---------------- 7 opossum | |-----------| | |---------------- 8 wallaroo | |---------------------------------- 9 platypus 3. Computation Click the MS-DOS prompt in the window (Window 95 or Window NT), then For amino acid sequences, type c:\GZ-gamma\gz-aa filename For DNA sequences, type c:\GZ-gamma\gz-DNA filename where filename is the name of the data file. In the case of atp6.aa data, for example, type c:\GZ-gamma\gz-aa atp6.aa The detailed procedure for the computation has been described in Gu and Zhang (1997). The procedure involves 5 steps. (1) Estimation of pairwise distances among the sequences. The gamma distance with alpha=2.4 is used for protein data. This corresponds to the JTT model. Kimura's model is used for DNA data. (2) Estimation of tree branch lengths from the distances using the least-squares method. (3) Estimation of ancestral sequences using the Bayesian approach (Zhang and Nei 1997). The JTT-f model is used for protein data and Kimura's model is used for DNA data. You will be given three options for the inference of ancestral sequences. These options can deal with different levels of sequence divergence and number of OTUs. While option 3 may be slower than options 1 and 2, it always works, regardless of amount of data used. When options 1 and 2 do not work due to insufficient computer memory, option 3 should be used. (4) The expected number of substitutions for each site is estimated by the maximum likelihood approach. (5) The ML estimate of the gamma shape parameter (alpha) is obtained from the distribution of expected number of substitutions. 4. Output file The output of the gz-aa.exe or gz-DNA.exe is given in the file named "alpha". The estimate for the gamma shape parameter (alpha) is presented in the first line. Since then, the first column (#) indicates the position numbers of amino acid (nucleotide) sites, the second column (m') presents the minimum- required substitutions inferred by the conventional parsimony method (Fitch 1971); the third column (m) presents the minimum-required substitutions inferred by Zhang-Nei (1997)'s method, and the forth column (k) presents the expected numbers of substitutions estimated by Gu and Zhang (1997) which are used for estimating alpha. 5. Usefulness From the current program, we can obtain two results, the estimate of gamma shape parameter (alpha) for the rate variation among sites, and the expected number of substitutions of each amino acid (or nucleotide) site. These results are useful in molecular evolutionary analysis. (1) Distance estimation (2) Divergence time dating between genes and species (3) Phylogenetic reconstruction The estimate of alpha is useful to rule out the possibility that the phylogenetic tree inferred is not misleading by the negligence of rate variation among sites. An iteration is suggested as follows: first, estimate the alpha by the current program according to the tree reconstructed under the assumption of a uniform rate among sites. Second, re-compute the distance-matrix, considering the gamma distribution for the rate variation among sites, and infer the phylogenetic tree. (4) Profile of rate variability with sites The output file (alpha) can be used as the input for most commercially available software (e.g., EXCEL) so that the profile of rate variability with sites can be easily presented graphically by plotting k against the position of site. (5) Comparison of evolutionary rates between different regions (domains). 6. Further development This program is the first step for developing a user-friendly computer package for molecular evolutionary analysis at the genome level. We will continuously improve the performance of our program by adding more options and new results. For example, a program for estimating a default Neighbor-joining tree is under development so that users will not need to input the topology. GZ-GAMMA: A program package for estimating the parameter of substitution rate variation among sites of an amino acid or nucleotide sequence (c) Copyright June 1997 by Jianzhi Zhang, Xun Gu, and the Pennsylvania State University. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. GZ-GAMMA is distributed free of charge by Jianzhi Zhang Institute of Molecular Evolutionary Genetics and Department of Biology 322 Mueller Laboratory The Pennsylvania State University University Park, PA 16802, USA Telephone: 814-8657030 Fax: 814-8637336 Email: jxz128@psu.edu Suggested citation: Gu X, Zhang J (1997) A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. (submitted). 1. Introduction GZ-GAMMA is designed to estimate the shape parameter (alpha) of the gamma distribution that describes the substitution rate variation among sites from a set of homologous amino acid or nucleotide sequences whose phylogenetic relationships are known. This package contains two programs: AAgamma.exe and DNAgamma.exe, which are written in C language. The programs can be used on IBM PC compatible computers with windows. 2. Installation First make sure that the diskette you have received contains the following files. AAgamma.c (source code) DNAgamma.c (source code) AAgamma.exe (executable file) DNAgamma.exe (executable file) jtt.pro (JTT substitution matrix) atp6.aa (an example data file of amino acid sequences) cox1.dna (an example data file of nucleotide sequences) manual (this file) To install GZ-GAMMA on your computer's hard disk drive ("C" drive given here, for example), you should create a directory where the files of this package will be present. To do this, type the following c:\md GZ-GAMMA (Enter) To copy the GZ-GAMMA files onto your hard disk drive, insert the floppy disk containing the programs into your floppy drive ("A" drive given here, for example). Then, enter the following command c:\copy a:*.* c:\GZ-GAMMA\*.* (Enter) 3. Input file To use the program, you need a input file containing the amino acid or nucleotide sequences and the tree topology of these sequences (see file atp6.aa for an example). This file begins with two numbers: the number of sequences and the number of amino acid or nucleotide sites (sequence length). The second line will be the name of the first sequence, and the third line will be the first sequence, and so on. Only the one letter code (capitalized) for the 20 amino acids or 4 nucleotides are allowed in the sequences. The sequences should be aligned and gaps or any other symbols be removed already. The last line of the file is the tree topology of the sequences. The topology format is the same as that used in PHYLIP package (Felsenstein 1995). Note that the tree is unrooted, so trification rather than bification is required for the deepest branching node. For example, the topology of the following tree can be expressed by (((1,3),2),6,((4,7),(5,8))) 11 |----------- 1 10 |-----------| |----------| |---------------- 3 | |------------------------ 2 | |----------------------------- 6 |---------------| |---------- 4 9 | |--------| | | 13 | | | |------ 7 |-----| 12 | |---- 5 |------| 14 |----- 8 Note that in the topology expression, the numbers refer to the order of the sequences given in the input file. Also note that in the topology expression, there are only numbers and ", " without any space. The tree of the atp6 sequences in the example data file atp6.aa is (((1,2),((3,4),(5,6))),(7,8),9) |------------------1 mouse |-----| |---| |------------------2 rat | | |---| | |----------------3 human | | | |--| | | |---| |-------------4 gibbon | | | |------------5 whale | | |-------| | | |------------6 cow | | |---------------- 7 opossum | |-----------| | |---------------- 8 wallaroo | |---------------------------------- 9 platypus 3. Computation To estimate the alpha of an amino acid sequence data, type c:\gz-gamma\AAgamma filename For example, to try the atp6.aa data, type c:\gz-gamma\AAgamma atp6.aa To estimate the alpha of a nucleotide sequence data, type c:\gz-gamma\DNAgamma filename For example, to try the cox1.dna data, type c:\gz-gamma\AAgamma cox1.dna 4. Output The estimated alpha value will be appearing on DOS prompt.