Reading BLAST and BEAUTY database search results:
NOTE:
Three different parameters influence the format of the search
results. The primary factor is the tool used for the search
(in this case, BLAST with BEAUTY post-processing). In addition,
both the query sequence (the sequence the user inputs)
and the database searched (for example, the nr protein
database) affect the search output.
Histogram of Expectation
Sequences producing High-scoring
Segment Pairs
Alignment of HSPs with the Query
Sequence
Display of BLAST sequence alignments
with Beauty Post-Processing
- Synopsis of sequence alignment with BEAUTY processing
- Statistical Data and Alignment
- How to read statistical data
- How to read the sequence alignment
- Histogram of Expection:
Observed Numbers of Database Sequences Satisfying Various EXPECTation
Thresholds (E parameter values)
The Expect value (E parameter) represents the number
of times this match or a better one would be expected to occur
purely by chance in a search of the entire database. Thus, the
lower the Expect value, the greater the similarity between the
input sequence and the match. Any database sequence whose sequence
alignment satisfies the E parameters will be reported in that
line of the histogram. Thus, for each E parameter value, there
is a graphical representation of the number of database sequences
that fit that E value.
This section can be used to determine how many database sequences
achieved a (user-defined) level of statistical significance. The
Histogram marks the point at which the E values drop into the
range of statistical significance (in the sense that they are
probably in some way related to the input sequence) with
the following marker:
>>>>>>>>>>>>>>>>>>>>>
Expect = 10.0, Observed = 38 <<<<<<<<<<<<<<<<<
An expect value of 10.0 is the default value of statistical
significance, but this number can be adjusted by the user.
Example of Histogram:
- Sequence Display:
This section displays the HSP's
(High Scoring Sequence Pairs) that satisfy the defined cutoff
for statistical significance, up to a user-defined maximum number
of results.
This table summarizes the search results.
Example of a sequence display:
The first column contains a sequence identifier ( ). Clicking on the sequence identifier takes you to the alignment
data for that HSP.
Following the sequence identifier is the truncated
title line of the sequence. It is sometimes possible to identify
the function of the database from the title line.
The second column contains the "High Score"
for each database match. This is the score
of the highest scoring HSP found within that database sequence.
Columns three and four contain the 'Smallest Sum
Probability'
Column 3, the "P(N)"
column, contains the lowest P-value assigned to a set of
HSPs for each database sequence. The P-value represents the probablity
(in the range of 0-1) of a given sequence occuring by chance.
It is less accurate than the E-value
and N-dependent.
- Column 4, the "N"
column displays the number of HSPs in the set which was assigned
the lowest P-value. If a sequence matches to the query in a number
of regions with large gaps interspersed (as with genomic sequence
exon matches to protein sequence) then N will indicate the number
of regions (for example, the number of coding exons).
- BEAUTY Summary Figure of the
Alignment of HSPs with the Query
Sequence
This graphical display is part of the BEAUTY
post processing. It shows the placement of locally-aligned
regions (HSPs) with respect to query sequence. It can be used
to identify which regions of the sequence appear to be most and
least highly conserved.
For example, in the output displayed below, there is a pattern
that two regions of the sequence(around 150 bp and 275 bp) may
be consistently conserved between many of the HSP's
In addition, query sequences are searched against Prosite,
a database of protein families and domains, and the locations of these matches
are shown. Note, however, that in the display below the Prosite hit appears
to be false, as the sequence fails to overlap significantly with
any of the HSPs. A true hit should show clear sequence overlaps with
regions of HSP's.

- Display of BLAST sequence
alignments with Beauty Post-Processing
Each of the matched sequences summarized in the above figure
is shown below in greater detail
- Synopsis of sequence
alignment with BEAUTY processing:
Example:

i. To the left of the title line of the matched databse sequence
are links to other databses with more information about this sequence.
Note: If you would like additional
information about any of the search browsers listed below, please
click on the links to the homepages of each one.
ii. The figure below the sequence description is a graphical
display of alignment locations:
This figure displays the location
of the locals hits (HSP's) and any annotated domains with respect
to the database sequence (not the search sequence).
Annotated Domains are displayed
if there are annotations in the BEAUTY
database. Below the graphical display is a list of the annotated
domains, with the source of the annotation, the domain name, and
its location in the database sequence.
Example of annotated domain information:
- Statistical Data and Alignment
How to read the statistical data
How to read the sequence alignment
Example:
- How to read the statistical data:
Score:
For the segment pair display, the score is the sum of the scoring
matrix values in the segment pair being displayed. The "bits"
valus is the raw score converted to bits of information by multiplying
by lambda
Expect:
The number of times one might Expect to see such a match (or
a better one) merely by chance
P or Sum P:
The P-value for observing such a
match. 'Sum', when present, indicates that Sum statistics were
used to caluculate Expect and P values. The value following sum
P in parentheses - Sum P(*) - is the N paremeter indicating the
number of HSPs used in the statistics.
Identities:
The number and fraction of total residues in the HSP which
are identical
Positives:
The number and fraction of residues for which the alignment
scores have positive values.
- How to read the sequence alignment:
Row 1: The
query sequence
Row 3: The
database sequence it is aligned with is the third row.
Note: Lines of "------" indicate gaps in the sequence
alignment. Thus, the total length of the sequence in Row 1 is
53 amino acids, and the total length of the sequence in Row 3
is 58 amino acids.
Row 2:
Identical residues are indicated by the capital letter of the
amino acid.
Similar, nonidentical residues with positive alignment scores
are indicated with a +.
Alignments with a zero or negative scores are indicated with
a space (" ").
GLOSSARY:
HSP (High Scoring Sequence
Pair)
The High-scoring Segment Pair (HSP) is the fundamental unit
of BLAST algorithm output. An HSP consists of two sequence fragments
of arbitrary but equal length whose alignment is locally maximal
and for which the alignment score meets or exceeds a threshold
or cutoff score. A set of HSPs is thus defined by two sequences,
a scoring system, and a cutoff score; this set may be empty if
the cutoff score is suffi ciently high. In the programmatic implementations
of the BLAST algorithm described here, each HSP consists of a
segment from the query sequence and one from a database sequence.
The sensitivity and speed of the programs can be adjusted via
the standard BLAST algorithm parameters W, T, and X (Altschul
et al., 1990); selectivity of the programs can be adjusted via
the cutoff score.
BEAUTY Annotated Domain Database
A database of annotated domains/sites was created for use with
the BEAUTY Post-Processor by
scanning the Entrez database for those protein sequences
with annotations describing known domains and sites within the
sequence
matching each Entrez sequence against the sequence motifs
in the PROSITE
pattern database and storing the location of each hit
extracting the locations of the conserved blocks within
the sequences represented in the BLOCKS
database
extracting the locations of the domains identified in the
sequences in the PRINTS
protein fingerprint database
extracting the locations of the domains identified in the
sequences in PFAM,
Protein families database of alignments and HMMs.
E-values
The E-value reported for each match in your sequence search
represents that number of alternate alignments, with the same
or better total score, that could be expected to occur within
the database purely by chance. Thus, the lower your E-value, the
better the match. The value depends upon the score given to the
alignment, as well as the lengths of the search sequence and the
database searched, and varies 0<E<1000
P-value
The P-value represents the probablity (in the range of 0-1)
of a given sequence occuring by chance. It is less accurate than
the E-value and N-dependent.
The
Statistics of Sequence Similarity Scores
|