![]() |
(1) Human Genome Center, Department of Molecular and Human Biology, and
Department of Cell Biology,
Baylor College of Medicine, Houston, TX 77030, U.S.A.
(2) Department of Genetics, Eotvos University,
Muzeum krt. 4/a, Budapest, H-1088. Hungary
Internet: istvanl@bcm.tmc.edu
Search strategy: We recommend that you first perform a BLAST search (BLASTP, BLASTX, TBLASTN), possibly enhanced by the BEAUTY search utility. If you do not get significant results, try a FASTA search. In case of further unsatisfactory results, run a FASTA-SWAP (for queries longer than 1000 residues, a FASTA-PAT) search.
Available pattern databases:
Pattern scoring matrices:
Hopefully not later than January 20, we install hypertext links to the multiple alignments, the GenInfo IDs (gi's), locus names, and titles of the sequences included in each multiple sequence alignment/pattern. We apologize for this inconvenience.
BACKGROUND
FASTA-SWAP and FASTA-PAT are FASTA-based pattern database search tools that can be used to identify functions of highly variable proteins and/or proteins lacking known close relatives that may be missed by standard database search methods. Both programs compare protein query sequences against several databases of protein sequence patterns. Each pattern is derived from a multiple sequence alignment and expresses the conservation/variation inherent in the underlying set of aligned proteins. An increase in database search sensitivity and selectivity is achieved by assigning higher weights to conserved positions. Compared to standard databases, the lower redundancy of pattern databases reduces the probability of random hits to unrelated proteins (an increasing problem due to rapid database growth).
In each database each multiple alignment is coded by BINREP, a binary representation, which is rapidly converted to standard amino acid letters for displaying search results. Our tools analyze only the presence/absence NOT the actual frequencies of amino acids in an aligned position. The biological relevance of positional frequencies of amino acids is unclear. If, e.g., substitution of glycine for alanine occur in one sequence, chances are low that this position has highly glycine-specific functions. Another difficulty with positional frequencies is their unbiased estimation, which cannot be guaranteed by sequence weighting. For overly degenerate patterns (over 40 sequences), however, frequency-dependent methods (MoST, PROFILE, BLOCKS) may perform better than FASTA-SWAP or FASTA-PAT.
Protein families for our PIMA and EntrezClus10 pattern databases were constructed as follows. Pattern databases to be searched by either the FASTA-SWAP or FASTA-PAT were generated by first clustering the protein sequences in the NCBI's Entrez database (Rel. 14) by the maximal linkage method. In the PIMA Database, this generated 12,669 sequence families of 2 or more sequences, encompassing 97,521 total sequences. For the EntrezClus10 databases, only families with 10 or more sequences were used. In the PIMA database, each family was then multiply aligned using PIMA, our Pattern-Induced Multiple-sequence Alignment program (RF Smith and TF Smith, 1992, Protein Engng 5:35). In the EntrezClus10 Databases, families were aligned by i) the PIMA; ii) the CLUSTALW; or iii) the MAP program.
The EC Database is generated from sequences of known Enzyme Catalogue numbers. In the first step, sequences were agglomerated by using the 50 percent linkage method with a BLAST probability threshold of 0.1 (with an expectation value of 0.2!) to reduce the number of singletons and patterns with few sequences. In the second step, large clusters with degenerate patterns were split by a more restrictive 75 percent linkage clustering using a BLAST probability threshold of 0.001 with the same expectation value. This latter step eliminated uninformative degenerate families. The multiple alignments were generated by the MAP program.
The multiple alignments obtained by the PIMA program were then scanned for the presence of sequence fragments. If an alignment contained one or more fragments, then additional alignments were created by removing each of the fragments from the original alignment. Each of the alignments were then given a unique cluster identifier; alignments generated by removing fragments (as well as the original alignment) were given an unique extension based on the relative position of the fragments in the original alignment (e.g., 52.74, 52.94, 52.100; the original alignment has the highest numbered extension in each set). This process generated 22,416 multiple sequence alignments, with each alignment contributing a single pattern to the pattern database.
Gaps internal to a pattern are denoted by '='s.
Construction of New Pattern Log-Odds Scoring Matrices: two new log-odds scoring matrices that utilize all the possible 1,048,575 nonrepetitive combinations have been developed specifically for sequence-to-pattern searches. The large (20 by 1 million) matrices are calculated "on the fly" by rapid bitwise operations. In contrast to standard scoring matrices (PAM, BLOSUM) these new pattern-based scoring matrices distinguish between conserved and variable positions. In the weighted minimum-average method (WMM), a 20 by 20 matrix is are calculated from the frequencies of fully conserved and two-residue combinations in the databases of multiple alignments. Scores for higher-order combinations are given as the weighted average of these scores for all possible pairs of the of the query sequence and the residues constituting the library combination in the aligned position. Scores in the empirical matrix (EMMA) are calculated from the actual (target) frequency of the union of the query residue and the library combination.
Pattern scores are forced to obey the following rules: an overlapping match should not be assigned a negative score and should not score higher than the identity score to the matching residue in the library combination; a combination mismatch should not be assigned higher negative score than the minimum scores of to its subsets; etc.
Please note: Scores are scaled up by a factor of 10. Also, even without scaling there is no way to compare the scores obtained with scores from other matrices. A self-score of a pattern is always less than that of a 20-letter sequence. By default, query sequences are filtered with the programs XNU and SEG in order to eliminate longer runs of a particular residue or segments with short repetitive motifs like in collagens. Such regions frequently result in hundreds of biologically insignificant matches. Filtering replaces such regions with X characters.
|
||||