BEAUTY Program Description

BEAUTY (BLAST Enhanced Alignment Utility) is an enhanced version of the NCBI's BLAST database search tool. BEAUTY, when used to search three new custom sequence databases that we have developed, incorporates information on sequence family membership, the location of the conserved domains, and the locations of any annotated domains and sites directly into BLAST search results. These enhancements make it much easier to detect weak, but functionally significant, matches in BLAST database searches.

Two of the custom sequence databases developed for use with BEAUTY were derived by first clustering the protein sequences in the NCBI's Entrez database (Rel. 14). This generated 12,669 sequence families of 2 or more sequences, encompassing 97,521 total sequences. Each family was then multiply aligned using PIMA, our Pattern-Induced Multiple-sequence Alignment program (RF Smith and TF Smith, 1992, Protein Engng 5:35).

The multiple alignments were then scanned for the presence of sequence fragments. If an alignment contained one or more fragments, then additional alignments were created by removing each of the fragments from the original alignment. Each of the alignments were then given a unique cluster identifier; alignments generated by removing fragments (as well as the original alignment) were given an unique extension based on the relative position of the fragments in the original alignment (e.g., 52.74, 52.94, 52.100; the original alignment has the highest numbered extension in each set).

Next, the resulting multiple alignments were scanned for information-dense regions using a program we have previously developed (ibid). This program extracts all local regions within a multiple alignment of length n or longer that have an information density (ID) above a threshold, T. To generate a list of conserved regions (domains) for each sequence in each family, T was set to 1.2 times the average ID of the entire alignment. The positions of the conserved regions within each sequence along with the information on sequence family membership were then collected and stored in a local database.

In addition, a database of annotated domains/sites was created by 1) scanning the Entrez database for those protein sequences with annotations describing known domains and sites within the sequence, 2) matching each Entrez sequence against the sequence motifs in the PROSITE pattern database and storing the location of each hit, 3) extracting the locations of the conserved blocks within the sequences represented in the BLOCKS database, and 4) extracting the locations of the domains identified in the sequences in the PRINTS protein fingerprint database.

Three different sequence databases have been constructed for use with BEAUTY:

BEAUTY incorporates information on sequence family membership, the location of the conserved domains, and the locations of any annotated domains and sites directly into BLAST search results:

1) A table is added to the BLAST output that lists for each database hit, the sequence family to which the database sequence belongs, the total number of sequences within each family matched in the search, and the total number of sequences in the family, e.g.,:

                Clustered in sequence family:                         Sequences
Locus_ID        Number   Title                                           M / T

gi|44804|lcl|2  256.36   protein threonine kinase pkn1 pkn2               2/3   
pir||S21533|gi  256.36   protein threonine kinase pkn1 pkn2               2/3   
gi|44804|lcl|2  256.40   a-raf b-raf v-mil raf proto-oncogene serine/th   2/41  
pir||S21533|gi  256.40   a-raf b-raf v-mil raf proto-oncogene serine/th   2/41  
sp|P13186|KIN2  79.54    protein kinase kin1 kin2                         4/4   
sp|P13186|KIN2  79.66    carbon putative serine/threonine protein catab   4/12  
sp|P13186|KIN2  79.77    carbon serine/threonine protein catabolite ser   4/29  
gi|297102|lcl|  73.75    cell division cdc2 control protein 2 kinase p    2/55  
gi|407487|lcl|  360.30   extracellular signal-regulated mitogen-activat  11/31  
sp|Q04899|KPT3  73.75    cell division cdc2 control protein 2 kinase p    2/55  
gi|172183|lcl|  256.37   eukaryotic initiation protein factor eif-2 gcn   1/4   

2) A figure is added showing the relative location of each hit (HSP) within the query sequence. In addition, the query sequence is matched against the PROSITE pattern database, and location of all pattern matches within the query sequence is displayed:

Locally-aligned regions (HSPs) with respect to query sequence:

Locus_ID        Cluster
gi|44804|lcl|2  256.36 |                __________ _______                
sp|P13186|KIN2  79.54  |  ____        ____         _______                
sp|P27704|ERK3  360.31 |                         ___________              
gi|4229|lcl|13  1393.9 |                   _______    ________            
gi|393281|lcl|  6930   |                          ________                
sp|P32361|IRE1  6930   |                          ________                
gi|450233|lcl|  26.152 |      ____                 _______                
pir||B40466|gi  360.30 |                             ________             
sp|P08414|KCC4  79.32  |                           _______                
gi|306479|lcl|  79.32  |                           _______   
sp|P13185|KIN1  79.77  |  ____        ____         _______                

Prosite Hits:                                          __                 
                        __________________________________________________
Query sequence:        |          |          |          |           |     | 224
                       0         50        100        150         200
__________________
Prosite hits:
   PROTEIN_KINASE_TYR   Tyrosine protein kinases specific active 138..150
__________________

3) A figure is added for each BLAST hit showing:
a) the positions of the local hits (HSPs) relative to the positions of the known conserved regions within the database sequence, and
b) the location of any annotated domains and sites within each matched sequence, e.g.,:

Cluster: 73.75 cell division cdc2 control protein 2 kinase p to homolog

Conserved regions:       
   Cluster 73.75                      |_________________    _       |

Local hits (HSPs):                    _   ______  __                      
Annotated domains:                    _  _         _                      
                        __________________________________________________
Database sequence:     |                |               |                || 451
                       0              150             300              450
__________________

Annotated domains:
   np-binding site      ATP.                                     127..135
   binding site         ATP.                                     150
   active site                                                   242
__________________

Note: The vertical bars ('|') bounding the conserved regions represent the start and end positions of that region of the alignment scanned for conserved domains. These bounds are determined by the length of shortest sequence in that cluster and thus are usually shorter than the sequence hit (as in the above example). A comparison of the locations of the local hits (HSPs) relative to the locations of the conserved domains is therefore directly comparable only within these bounds.

In summary, by incorporating sequence family, conserved domain, and annotated domain and site information directly into BLAST search results, BEAUTY can greatly improve the identification of weak, but functionally significant, matches in BLAST database searches.

Reference: Kim C. Worley, Brent A. Wiese, and Randall F. Smith (1995). BEAUTY: An enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. Genome Research 5:173-184. Human Genome Sequencing Center, Baylor College of Medicine.


Back to BCM Search Launcher: Protein Sequence/Pattern Searches

.
BCM HGSC