Reading BLAST and BEAUTY database search results:

NOTE:

Three different parameters influence the format of the search results. The primary factor is the tool used for the search (in this case, BLAST with BEAUTY post-processing). In addition, both the query sequence (the sequence the user inputs) and the database searched (for example, the nr protein database) affect the search output.

 

  1. Histogram of Expectation

  2. Sequences producing High-scoring Segment Pairs

  3. Alignment of HSPs with the Query Sequence

  4. Display of BLAST sequence alignments with Beauty Post-Processing

    1. Synopsis of sequence alignment with BEAUTY processing

    2. Statistical Data and Alignment

      1. How to read statistical data

      2. How to read the sequence alignment



  1. Histogram of Expection:

    Observed Numbers of Database Sequences Satisfying Various EXPECTation Thresholds (E parameter values)

    The Expect value (E parameter) represents the number of times this match or a better one would be expected to occur purely by chance in a search of the entire database. Thus, the lower the Expect value, the greater the similarity between the input sequence and the match. Any database sequence whose sequence alignment satisfies the E parameters will be reported in that line of the histogram. Thus, for each E parameter value, there is a graphical representation of the number of database sequences that fit that E value.

    This section can be used to determine how many database sequences achieved a (user-defined) level of statistical significance. The Histogram marks the point at which the E values drop into the range of statistical significance (in the sense that they are probably in some way related to the input sequence) with the following marker:

    >>>>>>>>>>>>>>>>>>>>> Expect = 10.0, Observed = 38 <<<<<<<<<<<<<<<<<

    An expect value of 10.0 is the default value of statistical significance, but this number can be adjusted by the user.

    Example of Histogram:


  2. Sequence Display:

    This section displays the HSP's (High Scoring Sequence Pairs) that satisfy the defined cutoff for statistical significance, up to a user-defined maximum number of results.

    This table summarizes the search results.

    Example of a sequence display:

     

    1. The first column contains a sequence identifier (). Clicking on the sequence identifier takes you to the alignment data for that HSP.

    2. Following the sequence identifier is the truncated title line of the sequence. It is sometimes possible to identify the function of the database from the title line.

    3. The second column contains the "High Score" for each database match. This is the score of the highest scoring HSP found within that database sequence.

    4. Columns three and four contain the 'Smallest Sum Probability'

      1. Column 3, the "P(N)" column, contains the lowest P-value assigned to a set of HSPs for each database sequence. The P-value represents the probablity (in the range of 0-1) of a given sequence occuring by chance. It is less accurate than the E-value and N-dependent.

      2. Column 4, the "N" column displays the number of HSPs in the set which was assigned the lowest P-value. If a sequence matches to the query in a number of regions with large gaps interspersed (as with genomic sequence exon matches to protein sequence) then N will indicate the number of regions (for example, the number of coding exons).

     


  3. BEAUTY Summary Figure of the Alignment of HSPs with the Query Sequence

    This graphical display is part of the BEAUTY post processing. It shows the placement of locally-aligned regions (HSPs) with respect to query sequence. It can be used to identify which regions of the sequence appear to be most and least highly conserved.

    For example, in the output displayed below, there is a pattern that two regions of the sequence(around 150 bp and 275 bp) may be consistently conserved between many of the HSP's

    In addition, query sequences are searched against Prosite, a database of protein families and domains, and the locations of these matches are shown. Note, however, that in the display below the Prosite hit appears to be false, as the sequence fails to overlap significantly with any of the HSPs. A true hit should show clear sequence overlaps with regions of HSP's.


  4. Display of BLAST sequence alignments with Beauty Post-Processing

    Each of the matched sequences summarized in the above figure is shown below in greater detail

     

    1. Synopsis of sequence alignment with BEAUTY processing:

      Example:

      i. To the left of the title line of the matched databse sequence are links to other databses with more information about this sequence.

      Note: If you would like additional information about any of the search browsers listed below, please click on the links to the homepages of each one.

         Retrieves Entrez links (e.g., Medline abstracts, FASTA-formatted sequence reports)
         Retrieves links to Related sequences (neighbors) from Entrez
         Retrieves links to the Sequence Retrieval System (SRS)
          Retrieves links to the Ligand Enzyme and Chemical Compound Database including pathway diagrmas

      ii. The figure below the sequence description is a graphical display of alignment locations:

      This figure displays the location of the locals hits (HSP's) and any annotated domains with respect to the database sequence (not the search sequence).

      Annotated Domains are displayed if there are annotations in the BEAUTY database. Below the graphical display is a list of the annotated domains, with the source of the annotation, the domain name, and its location in the database sequence.

      Example of annotated domain information:

       

    2. Statistical Data and Alignment

      1. How to read the statistical data

      2. How to read the sequence alignment

      Example:

      1. How to read the statistical data:

        Score:

        For the segment pair display, the score is the sum of the scoring matrix values in the segment pair being displayed. The "bits" valus is the raw score converted to bits of information by multiplying by lambda

        Expect:

        The number of times one might Expect to see such a match (or a better one) merely by chance

        P or Sum P:

        The P-value for observing such a match. 'Sum', when present, indicates that Sum statistics were used to caluculate Expect and P values. The value following sum P in parentheses - Sum P(*) - is the N paremeter indicating the number of HSPs used in the statistics.

        Identities:

        The number and fraction of total residues in the HSP which are identical

        Positives:

        The number and fraction of residues for which the alignment scores have positive values.

         

      2. How to read the sequence alignment:

        Row 1: The query sequence

        Row 3: The database sequence it is aligned with is the third row.

        Note: Lines of "------" indicate gaps in the sequence alignment. Thus, the total length of the sequence in Row 1 is 53 amino acids, and the total length of the sequence in Row 3 is 58 amino acids.

        Row 2:

        1. Identical residues are indicated by the capital letter of the amino acid.

        2. Similar, nonidentical residues with positive alignment scores are indicated with a +.

        3. Alignments with a zero or negative scores are indicated with a space (" ").


    GLOSSARY:

    HSP (High Scoring Sequence Pair)

    The High-scoring Segment Pair (HSP) is the fundamental unit of BLAST algorithm output. An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score. A set of HSPs is thus defined by two sequences, a scoring system, and a cutoff score; this set may be empty if the cutoff score is suffi ciently high. In the programmatic implementations of the BLAST algorithm described here, each HSP consists of a segment from the query sequence and one from a database sequence. The sensitivity and speed of the programs can be adjusted via the standard BLAST algorithm parameters W, T, and X (Altschul et al., 1990); selectivity of the programs can be adjusted via the cutoff score.

    BEAUTY Annotated Domain Database

    A database of annotated domains/sites was created for use with the BEAUTY Post-Processor by

    1. scanning the Entrez database for those protein sequences with annotations describing known domains and sites within the sequence

    2. matching each Entrez sequence against the sequence motifs in the PROSITE pattern database and storing the location of each hit

    3. extracting the locations of the conserved blocks within the sequences represented in the BLOCKS database

    4. extracting the locations of the domains identified in the sequences in the PRINTS protein fingerprint database

    5. extracting the locations of the domains identified in the sequences in PFAM, Protein families database of alignments and HMMs.

    E-values

    The E-value reported for each match in your sequence search represents that number of alternate alignments, with the same or better total score, that could be expected to occur within the database purely by chance. Thus, the lower your E-value, the better the match. The value depends upon the score given to the alignment, as well as the lengths of the search sequence and the database searched, and varies 0<E<1000

    P-value

    The P-value represents the probablity (in the range of 0-1) of a given sequence occuring by chance. It is less accurate than the E-value and N-dependent.

    The Statistics of Sequence Similarity Scores

     


.
BCM HGSC