![]() |
MAP HelpA MULTIPLE ALIGNMENT PROGRAM (MAP): copyright (c) 1992 Xiaoqiu Huang The distribution of the program is granted provided no charge is made and the copyright notice is included. E-mail: huang@cs.mtu.edu Proper attribution of the author as the source of the software would be appreciated: Huang, Xiaoqiu On Global Sequence Alignment, Computer Applications in the Biosciences, 10(3), 227-235, 1994. Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 The MAP program computes a multiple global alignment of sequences using iterative pairwise method. The underlying algorithm for aligning two sequences computes a best overlapping alignment bewteen two sequences without penalizing terminal gaps. In addition, long internal gaps in short sequences are not heavily penalized. So MAP is good at producing an alignment where there are long terminal or internal gaps in some sequences. The MAP program is designed in a space-efficient manner, so long sequences can be aligned. Users supply scoring parameters. In the simplest form, users provide 3 integers: ms, q and r, where ms is the score of a mismatch and the score of an i-symbol indel is -(q + r * i). Each match automatically receives score 10. In addition, an integer gs is provided so that any gap of length > gs in a short sequence is given a penalty of -(q + r * gs), the linear penalty for a gap of length gs. In other words, long gaps in the short sequence are given a constant penalty. This simple scoring scheme may be used for DNA sequences. NOTE: all scores are integers. In general, users can define an alphabet of characters to appear in the sequences and a matrix that gives the substitution score for each pair of symbols in the alphabet. The 127 ASCII characters are eligible. The alphabet and matrix are given in a file, where the first line lists the characters in the alphabet and the lower triangle of the matrix comes next. An example file looks as follows: ARNDC 13 -15 19 -10 -22 11 -20 -10 -20 18 -10 -20 -10 -20 12 Here the -22 at position (3,2) is the score of replacing N by R. This general scoring scheme is useful for protein sequences where the set of protein characters and Dayhoff matrix are specified in the file. Note that the characters in the alphabet must be exactly the same (including lower or upper cases) as ones appearing in sequences. The MAP program is written in C and runs under Unix systems on Sun workstations and under DOS systems on PCs. We think that the program is portable to many machines. Sequences to be aligned are stored in one file. A sample file of sequences looks like: >Human-beta VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >Horse-beta VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK DFTPELQASYQKVVAGVANALAHKYH >Human-alpha VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >Horse-alpha VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA VHASLDKFLSSVSTVLTSKYR >Sea-lamprey PIVDTGSVAPLSAAEKTKIRSAWAPVYSDYETSGVDILVKFFTSTPAAEEFFPKFKGLTT ADELKKSADVRWHAERIIDAVDDAVASMDDTEKMSSMKDLSGKHAKSFEVDPEYFKVLAA VIADTVAAGDAGFEKLLRMICIL LRSAY >Sperm-whale VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP GDFGADAQGAMNKALELFRKDIAAKYKELG YQG >Yellow-lupin GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSSFLKGGTSEVPQNNPE LQAHAGKVFKLVYEAAIQLEVTGVVASDATLKNLGSVHVSKGVVADAHFPVVKEAILKTIK EVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA The string after ">" is the name of the following sequence. To find the best alignment of sequences in file A, use a command of form map A gs ms q r > result where map is the name of the object code, gs is the minimum length of any gap in a short sequence charged with a constant gap penalty, ms is a negative integer specifying mismatch weight, q and r are non-negative integers specifying gap-open and gap-extend penalties, respectively. Output alignments are saved in the file "result". For using a scoring matrix defined in file S, use a command of form map A gs S q r > result Note that ms is replaced by the file S. Acknowledgments The function diff() from Gene Myers is modified and used here. The author thanks Chunwei Wang for pointing out the problem with existing multiple alignment software. The author also thanks Dave Gordon and John Hunt for suggesting that the alignment be produced in flat and interleaved formats so that it can be read by some phylogenetic analysis programs. |
|
||||