![]() |
Two of the custom sequence databases developed for use with BEAUTY were derived by first clustering the protein sequences in the NCBI's Entrez database (Rel. 14). This generated 12,669 sequence families of 2 or more sequences, encompassing 97,521 total sequences. Each family was then multiply aligned using PIMA, our Pattern-Induced Multiple-sequence Alignment program (RF Smith and TF Smith, 1992, Protein Engng 5:35).
The multiple alignments were then scanned for the presence of sequence fragments. If an alignment contained one or more fragments, then additional alignments were created by removing each of the fragments from the original alignment. Each of the alignments were then given a unique cluster identifier; alignments generated by removing fragments (as well as the original alignment) were given an unique extension based on the relative position of the fragments in the original alignment (e.g., 52.74, 52.94, 52.100; the original alignment has the highest numbered extension in each set).
Next, the resulting multiple alignments were scanned for information-dense regions using a program we have previously developed (ibid). This program extracts all local regions within a multiple alignment of length n or longer that have an information density (ID) above a threshold, T. To generate a list of conserved regions (domains) for each sequence in each family, T was set to 1.2 times the average ID of the entire alignment. The positions of the conserved regions within each sequence along with the information on sequence family membership were then collected and stored in a local database.
In addition, a database of annotated domains/sites was created by 1) scanning the Entrez database for those protein sequences with annotations describing known domains and sites within the sequence, 2) matching each Entrez sequence against the sequence motifs in the PROSITE pattern database and storing the location of each hit, 3) extracting the locations of the conserved blocks within the sequences represented in the BLOCKS database, and 4) extracting the locations of the domains identified in the sequences in the PRINTS protein fingerprint database.
Three different sequence databases have been constructed for use with BEAUTY:
BEAUTY incorporates information on sequence family membership, the location of the conserved domains, and the locations of any annotated domains and sites directly into BLAST search results:
1) A table is added to the BLAST output that lists for each database hit, the sequence family to which the database sequence belongs, the total number of sequences within each family matched in the search, and the total number of sequences in the family, e.g.,:
Clustered in sequence family: Sequences
Locus_ID Number Title M / T
gi|44804|lcl|2 256.36 protein threonine kinase pkn1 pkn2 2/3
pir||S21533|gi 256.36 protein threonine kinase pkn1 pkn2 2/3
gi|44804|lcl|2 256.40 a-raf b-raf v-mil raf proto-oncogene serine/th 2/41
pir||S21533|gi 256.40 a-raf b-raf v-mil raf proto-oncogene serine/th 2/41
sp|P13186|KIN2 79.54 protein kinase kin1 kin2 4/4
sp|P13186|KIN2 79.66 carbon putative serine/threonine protein catab 4/12
sp|P13186|KIN2 79.77 carbon serine/threonine protein catabolite ser 4/29
gi|297102|lcl| 73.75 cell division cdc2 control protein 2 kinase p 2/55
gi|407487|lcl| 360.30 extracellular signal-regulated mitogen-activat 11/31
sp|Q04899|KPT3 73.75 cell division cdc2 control protein 2 kinase p 2/55
gi|172183|lcl| 256.37 eukaryotic initiation protein factor eif-2 gcn 1/4
2) A figure is added showing the relative location of each hit (HSP) within the query sequence. In addition, the query sequence is matched against the PROSITE pattern database, and location of all pattern matches within the query sequence is displayed:
Locally-aligned regions (HSPs) with respect to query sequence:
Locus_ID Cluster
gi|44804|lcl|2 256.36 | __________ _______
sp|P13186|KIN2 79.54 | ____ ____ _______
sp|P27704|ERK3 360.31 | ___________
gi|4229|lcl|13 1393.9 | _______ ________
gi|393281|lcl| 6930 | ________
sp|P32361|IRE1 6930 | ________
gi|450233|lcl| 26.152 | ____ _______
pir||B40466|gi 360.30 | ________
sp|P08414|KCC4 79.32 | _______
gi|306479|lcl| 79.32 | _______
sp|P13185|KIN1 79.77 | ____ ____ _______
Prosite Hits: __
__________________________________________________
Query sequence: | | | | | | 224
0 50 100 150 200
__________________
Prosite hits:
PROTEIN_KINASE_TYR Tyrosine protein kinases specific active 138..150
__________________
3) A figure is added for each BLAST hit showing:
a) the positions of the local hits (HSPs) relative to the positions of
the known conserved regions within the database sequence, and
b) the location of any annotated domains and sites within
each matched sequence, e.g.,:
Cluster: 73.75 cell division cdc2 control protein 2 kinase p to homolog
Conserved regions:
Cluster 73.75 |_________________ _ |
Local hits (HSPs): _ ______ __
Annotated domains: _ _ _
__________________________________________________
Database sequence: | | | || 451
0 150 300 450
__________________
Annotated domains:
np-binding site ATP. 127..135
binding site ATP. 150
active site 242
__________________
Note: The vertical bars ('|') bounding the conserved regions represent the start and end positions of that region of the alignment scanned for conserved domains. These bounds are determined by the length of shortest sequence in that cluster and thus are usually shorter than the sequence hit (as in the above example). A comparison of the locations of the local hits (HSPs) relative to the locations of the conserved domains is therefore directly comparable only within these bounds.
In summary, by incorporating sequence family, conserved domain, and annotated domain and site information directly into BLAST search results, BEAUTY can greatly improve the identification of weak, but functionally significant, matches in BLAST database searches.
Reference: Kim C. Worley, Brent A. Wiese, and Randall F. Smith (1995). BEAUTY: An enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. Genome Research 5:173-184. Human Genome Sequencing Center, Baylor College of Medicine.
|
||||