Alignment Report
Explore Report File
______________________________________________________________________
______________________________________________________________________
MATCH-BOX_server 1.3 13-May-99 20:01:05
Molecular Biology. University of NAMUR - BELGIUM
Internet address: matchbox@biq.fundp.ac.be
WEB: http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.html
______________________________________________________________________
EEEEE X X PPPP L O RRRR EEEEE
E X X P P L O O R R E
EEE X PPPP L O O RRR EEE
E X X P L O O R R E
EEEEE X X P LLLLL O R R EEEEE
______________________________________________________________________
REFERENCE
Match-Box_server: a multiple sequence alignment tool placing emphasis
on reliability. E. Depiereux, G. Baudoux, P. Briffeuil, I. Reginster,
X. De Bolle, C. Vinals and E. Feytmans (1997) CABIOS 13(3) 249-256
Please choose the font Monaco or Courrier11.
Table 1 : submitted sequences, as read by the mailer.
----------------------------------------------------
Please check that they are correct and do not contain embedded comments.
The current score matrix used for the sequence analysis is
blosum45.sco
>gi sequence number 1 190 aa
MATHHTLWMG LALLGVLGDL QAAPEAQVSV QPNFQQDKFL GRWFSAGLAS NSSWLREKKA ALSMCKSVVA
PATDGGLNLT STFLRKNQCE TRTMLLQPAG SLGSYSYRSP HWGSTYSVSV VETDYDQYAL LYSQGSKGPG
EDFRMATLYS RTQTPRAELK EKFTAFCKAQ GFTEDTIVFL PQTDKCMTEQ
>gi sequence number 2 168 aa
APEAQVSVQP NFQPDKFLGR WFSAGLASNS SWLQEKKAAL SMCKSVVAPA ADGGFNLTST FLRKNQCETR
TMLLQPGDSL GSYSYRSPHW GSTYSVSVVE TDYDHYALLY SQGSKGPGED FRMATLYSRT QTPRAELKEK
FTAFCKAQGF TEDSIVFLPQ TDKCMTEQ
>gi sequence number 3 189 aa
MAALRMLWMG LVLLGLLGFP QTPAQGHDTV QPNFQQDKFL GRWYSAGLAS NSSWFREKKA VLYMCKTVVA
PSTEGGLNLT STFLRKNQCE TKIMVLQPAG APGHYTYSSP HSGSIHSVSV VEANYDEYAL LFSRGTKGPG
QDFRMATLYS RTQTLKDELK EKFTTFSKAQ GLTEEDIVFL PQPDKCIQE
>gi sequence number 4 189 aa
MAALPMLWTG LVLLGLLGFP QTPAQGHDTV QPNFQQDKFL GRWYSAGLAS NSSWFREKKE LLFMCQTVVA
PSTEGGLNLT STFLRKNQCE TKVMVLQPAG VPGQYTYNSP HWGSFHSLSV VETDYDEYAF LFSKGTKGPG
QDFRMATLYS RAQLLKEELK EKFITFSKDQ GLTEEDIVFL PQPDKCIQE
>gi sequence number 5 184 aa
MMRILLALSL GVACCSLWVG AEVQVQPDFQ KEKVLGKWYG IGLASNSNWF KDRKSHMKMC TTIITPTADG
NLEVTATYPK MDRCETKSMT YFKTEQLGGF RAKSPRYGSE HDMRVVETNY DEYILMYTVK TKGSETNQIV
SLFGRDKDLR PELLDKFQNF AKSQGLADDN IIILPHTDQC MTEA
Table 2
--------
Frequency distribution of observed matches between all possible segments
of length 9.
(1) in ALL the sequences submitted
(2) in the same sequences after shuffling their residues
the distance between segments being calculated from the score matrix.
The differences between observed and random frequencies
are tested by a chi-square statistic; NS:p>0.05, S:p<=0.05
A significant difference indicates that similarity between AT LEAST SOME
sequences departs from randomness.
-----------------------------------------
Distance (1) (2) Proba.
----------------------------------------
288.000 1 0 NS
306.000 6 0 S
324.000 16 0 S
342.000 70 0 S
360.000 95 0 S
378.000 101 0 S
396.000 162 0 S
414.000 114 0 S
432.000 170 0 S
450.000 154 0 S
468.000 130 2 S
486.000 126 7 S
504.000 123 12 S
522.000 137 56 S
540.000 218 192 NS
558.000 337 453 NS
576.000 1415 1651 NS
594.000 2531 2874 NS
612.000 8405 8779 NS
630.000 18276 19544 NS
648.000 25118 26515 NS
666.000 54360 55296 NS
684.000 53085 54067 NS
702.000 74934 73179 NS
720.000 51321 49057 NS
738.000 14516 14127 NS
756.000 3522 3588 NS
774.000 142 184 NS
792.000 4 6 NS
Figure 1
----------
Comparison between the observed matches in the submitted sequences (*)
and in the same sequences after shuffling (o) for ALL THE SEQUENCES.
More matches than expected by random indicate
that similarity between at least some sequences departs from randomness.
Log | oooooo
Cumulated | oooooo
Frequency | o
| o
| o
| o
|
| o
| o
| *****
| **** o
| *
| ** o
| *
| o
| *
| o
| * o
|
| * o
| oooooooooooooooooooooooooo
|_____________________________________________________________
Distance calculated from the score matrix blosum45.
Table 3
--------
Frequency distribution of observed matches between all possible segments
of length 9.
(1) in the LESS RELATED pair of sequences
(2) in the same sequences after shuffling their residues
the distance between segments being calculated from the score matrix.
The differences between observed and random frequencies
are tested by a chi-square statistic; NS:p>0.05, S:p<=0.05
A significant difference indicates that similarity between the less
related sequences departs from randomness.
-----------------------------------------
Distance (1) (2) Proba.
-----------------------------------------
396.000 1 0 NS
414.000 5 0 S
432.000 7 0 S
450.000 7 0 S
468.000 8 0 S
486.000 12 0 S
504.000 20 2 S
522.000 17 3 S
540.000 26 6 S
558.000 30 25 NS
576.000 40 43 NS
594.000 165 171 NS
612.000 244 262 NS
630.000 838 869 NS
648.000 1730 1892 NS
666.000 2560 2591 NS
684.000 5345 5250 NS
702.000 5456 5527 NS
720.000 7613 7634 NS
738.000 5477 5623 NS
756.000 1739 1510 S
774.000 486 432 S
792.000 27 16 S
810.000 3 0 S
Figure 3
---------
Comparison between the observed matches for the LESS RELATED pair
of sequences.(*) and in the same sequences after shuffling (o).
The less related pair appears to be sequences 4 and 5.
More matches than expected by random indicate that similarity between
the less related sequences departs from randomness.
Log | ooooooo
Cumulated | oooo*
Frequency | o
| o
| o
| o
| *
| o
| o
| *
| o
| **
| ** o
| **
| * o
| **
| o
| * o
| o
| *
|oooooooooooooooooooooooooooo
|_____________________________________________________________
Distance calculated from the score matrix blosum45.
Table 4
--------
SIMILARITY MATRIX between the sequences
1) The coefficient Rij (0 <= Rij <=1) is the proportion of segments
of sequence i matching with at least one segment of sequence j.
2) An asterisk points to the pairs of sequences with more matches
than expected by random.
Sequences 1 2 3 4 5
1 gi 1.00* 0.91* 0.97* 0.97* 0.90*
2 gi 1.00* 1.00* 1.00* 1.00* 0.90*
3 gi 0.96* 0.89* 1.00* 1.00* 0.85*
4 gi 0.96* 0.90* 1.00* 1.00* 0.84*
5 gi 0.86* 0.86* 0.86* 0.88* 1.00*
= Computational notes =
1) Matches are defined with respect to a statistical cutoff.
To get an optimal discrimination, it is computed as the average
of the cutoff at which random noise appears and the one
at which it equals the signal observed between identical sequences.
Thus a very low and even numerically nul coefficient may
be associate to a significative difference when
only short segments of a pair of sequences appears very similar.
2) This matrix is not symmetrical:
If a sequence i is shorter than a sequence j, then sequence i
can be very similar to a part of sequence j, but sequence j
can be only partly similar to sequence i and Rij > Rji.
Table 5
--------
The similarity matrix is treated by principal coordinates analysis to
produce a graphical representation of the similarity between the sequences.
Sequences are represented in a three-dimensional space, each factor
being associated to a % of the total variability between the sequences.
The first factor is generally trivial, and the grouping of the sequences is
performed in the plane of factors 2 & 3.
Sequences | 95.9 % 3.6 % 0.8 % of variability between the sequences
gi | -0.988 0.021 0.157
gi | -1.002 0.070 0.029
gi | -0.988 0.142 -0.070
gi | -0.990 0.118 -0.081
gi | -0.926 -0.376 -0.038
Figure 4
----------
Graphical representation of the sequences in the plane of factors 2 & 3.
Superimposed labels are printed in uppercases and listed below.
Use the landscape output format to increase the resolution.
*: y= 0.157 x= -0.376 *: x= 0.142 y= 0.157
*___________________________________________________gi_______________________*
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| gi________ |
| |
| |
| |
| |
|gi________ |
| |
| |
*___________________________________________________________________GI========
*: y= -0.081 x= -0.376 *: x= 0.142 y= -0.081
Superimposed labels
___________________
Printed row col Superimposed
label # # labels
____________________________________
GI======== 20 69 gi________
___________________________________________________________________________
MATCH-BOX_server 1.3 13-May-99 20:01:05
Execution successful
___________________________________________________________________________
Alignment Report File
___________________________________________________________________________
MATCH-BOX_server 1.3 13-May-99 20:01:10
Molecular Biology - University of NAMUR - BELGIUM
Internet address: matchbox@biq.fundp.ac.be
WEB: http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.html
___________________________________________________________________________
A L II GGG N N
A A L G G N N N
A A L II G N N N
AAAAAAA L II G GGG N N N
A A L II G G N NN
A A LLLLLL II GGGGG N N
___________________________________________________________________________
Please choose the font Monaco or Courrier11.
Table 1: submitted set of 5 sequences
----------------------------------------
>gi sequence number 1 190 aa
MATHHTLWMG LALLGVLGDL QAAPEAQVSV QPNFQQDKFL GRWFSAGLAS NSSWLREKKA ALSMCKSVVA
PATDGGLNLT STFLRKNQCE TRTMLLQPAG SLGSYSYRSP HWGSTYSVSV VETDYDQYAL LYSQGSKGPG
EDFRMATLYS RTQTPRAELK EKFTAFCKAQ GFTEDTIVFL PQTDKCMTEQ
>gi sequence number 2 168 aa
APEAQVSVQP NFQPDKFLGR WFSAGLASNS SWLQEKKAAL SMCKSVVAPA ADGGFNLTST FLRKNQCETR
TMLLQPGDSL GSYSYRSPHW GSTYSVSVVE TDYDHYALLY SQGSKGPGED FRMATLYSRT QTPRAELKEK
FTAFCKAQGF TEDSIVFLPQ TDKCMTEQ
>gi sequence number 3 189 aa
MAALRMLWMG LVLLGLLGFP QTPAQGHDTV QPNFQQDKFL GRWYSAGLAS NSSWFREKKA VLYMCKTVVA
PSTEGGLNLT STFLRKNQCE TKIMVLQPAG APGHYTYSSP HSGSIHSVSV VEANYDEYAL LFSRGTKGPG
QDFRMATLYS RTQTLKDELK EKFTTFSKAQ GLTEEDIVFL PQPDKCIQE
>gi sequence number 4 189 aa
MAALPMLWTG LVLLGLLGFP QTPAQGHDTV QPNFQQDKFL GRWYSAGLAS NSSWFREKKE LLFMCQTVVA
PSTEGGLNLT STFLRKNQCE TKVMVLQPAG VPGQYTYNSP HWGSFHSLSV VETDYDEYAF LFSKGTKGPG
QDFRMATLYS RAQLLKEELK EKFITFSKDQ GLTEEDIVFL PQPDKCIQE
>gi sequence number 5 184 aa
MMRILLALSL GVACCSLWVG AEVQVQPDFQ KEKVLGKWYG IGLASNSNWF KDRKSHMKMC TTIITPTADG
NLEVTATYPK MDRCETKSMT YFKTEQLGGF RAKSPRYGSE HDMRVVETNY DEYILMYTVK TKGSETNQIV
SLFGRDKDLR PELLDKFQNF AKSQGLADDN IIILPHTDQC MTEA
The basic principle of Match-Box is to delineate boxes of similar segments in
ALL the sequences. In one box, any segment is significantly similar to any
other one.Similarity between segments is computed from the scoring matrix,
and the matching criterion is defined by a statistical cutoff.
The current score matrix used for the sequence analysis is
blosum45.sco
In the final alignment, the selected boxes are only a subset of all the
boxes found. Boxes incompatible with the proposed aligment, if any, are
rejected. Table 2 shows how many boxes have been selected and rejected.
in the final alignment, and their length. Table 3 displays selected boxes
In a successful alignment, rejected boxes are normally short boxes.
A large rejected box would be an indication of a possible misalignment.
Table 2: Boxes length distribution
------------------------------------
Length Frequency
Selected Rejected
39 1 0
127 1 0
Table 3
--------
Boxes selected for the optimal alignment
(1) box number
(2) pattern of gaps
(3) first residue number
(4) sequences
(5) last residue number
1 22 23 apeaqvsvqpnfqqdkflgrwfsaglasnsswlrekkaalsmcksvvapatdgg 76
1 0 1 apeaqvsvqpnfqpdkflgrwfsaglasnsswlqekkaalsmcksvvapaadgg 54
1 22 23 paqghdtvqpnfqqdkflgrwysaglasnsswfrekkavlymcktvvapstegg 76
1 22 23 paqghdtvqpnfqqdkflgrwysaglasnsswfrekkellfmcqtvvapstegg 76
1 17 18 wvgaevqvqpdfqkekvlgkwygiglasnsnwfkdrkshmkmcttiitptadgn 71
1 22 77 lnltstflrknqcetrtmllqpagslgsysyrsphwgstysvsvvetdydqyal 130
1 0 55 fnltstflrknqcetrtmllqpgdslgsysyrsphwgstysvsvvetdydhyal 108
1 22 77 lnltstflrknqcetkimvlqpagapghytyssphsgsihsvsvveanydeyal 130
1 22 77 lnltstflrknqcetkvmvlqpagvpgqytynsphwgsfhslsvvetdydeyaf 130
1 17 72 levtatypkmdrcetksmtyfkteqlggfraksprygsehdmrvvetnydeyil 125
1 22 131 lysqgskgpgedfrmatly 149
1 0 109 lysqgskgpgedfrmatly 127
1 22 131 lfsrgtkgpgqdfrmatly 149
1 22 131 lfskgtkgpgqdfrmatly 149
1 17 126 mytvktkgsetnqivslfg 144
2 22 151 rtqtpraelkekftafckaqgftedtivflpqtdkcmte 189
2 0 129 rtqtpraelkekftafckaqgftedsivflpqtdkcmte 167
2 22 151 rtqtlkdelkekfttfskaqglteedivflpqpdkciqe 189
2 22 151 raqllkeelkekfitfskdqglteedivflpqpdkciqe 189
2 16 145 rdkdlrpelldkfqnfaksqgladdniiilphtdqcmte 183
Table 4 : optimal multiple alignment with indices of reliability
----------------------------------------------------------------
Sequences number, length and name
_________________________________
1 190 gi 2 168 gi 3 189 gi
4 189 gi 5 184 gi
10 20 30 40 50 60 70
+ + + + + + +
1 MATHHTLWMGLALLGVLGDLQAapeaqvsvqpnfqqdkflgrwfsaglasnsswlrekkaalsmcksvva
2 ----------------------apeaqvsvqpnfqpdkflgrwfsaglasnsswlqekkaalsmcksvva
3 MAALRMLWMGLVLLGLLGFPQTpaqghdtvqpnfqqdkflgrwysaglasnsswfrekkavlymcktvva
4 MAALPMLWTGLVLLGLLGFPQTpaqghdtvqpnfqqdkflgrwysaglasnsswfrekkellfmcqtvva
5 -----MMRILLALSLGVACCSLwvgaevqvqpdfqkekvlgkwygiglasnsnwfkdrkshmkmcttiit
977444422222222222222222222222222244444444444444
80 90 100 110 120 130 140
+ + + + + + +
1 patdgglnltstflrknqcetrtmllqpagslgsysyrsphwgstysvsvvetdydqyallysqgskgpg
2 paadggfnltstflrknqcetrtmllqpgdslgsysyrsphwgstysvsvvetdydhyallysqgskgpg
3 pstegglnltstflrknqcetkimvlqpagapghytyssphsgsihsvsvveanydeyallfsrgtkgpg
4 pstegglnltstflrknqcetkvmvlqpagvpgqytynsphwgsfhslsvvetdydeyaflfskgtkgpg
5 ptadgnlevtatypkmdrcetksmtyfkteqlggfraksprygsehdmrvvetnydeyilmytvktkgse
4444444444444444444444444447799777777444444444444222222222224444777777
150 160 170 180 190 200 210
+ + + + + + +
1 edfrmatlySrtqtpraelkekftafckaqgftedtivflpqtdkcmteQ
2 edfrmatlySrtqtpraelkekftafckaqgftedsivflpqtdkcmteQ
3 qdfrmatlySrtqtlkdelkekfttfskaqglteedivflpqpdkciqe-
4 qdfrmatlySraqllkeelkekfitfskdqglteedivflpqpdkciqe-
5 tnqivslfg-rdkdlrpelldkfqnfaksqgladdniiilphtdqcmteA
777799999 444444444444444444444444444444444444444
Table 4 : Aligned residues (included in boxes) are printed in
lowercase. Other residues (uppercase) are NOT aligned.
Only the multiple alignment of the WHOLE set of sequences is performed.
RELIABILITY SCORES
A score for 1 to 9 is written below each position in the boxes.
It is related to the statistical significance of the alignment at this
position. A score of 5 corresponds to a similarity of equal occurence in
related and unrelated sequences.Lower the score is, higher the reliability
of the alignment. As an example, the following results have been obtained
on 20 families of known structures sharing between 9% and 44% of conserved
residues.
Percentage of correctly predicted aligned residues obtained in TESTS:
Reliability Minimum Maximum
Score % %
------------------------------------
6 41.3 86.8
5 48.8 100
4 73.9 100
3 84.6 100
2 100 100
------------------------------------
GAPS
When lowercase amino-acids are aligned to gaps, it means that the position
of the gaps is not completely defined. If two successive selected boxes
are overlapping by a maximum of k amino acids in one of the sequences,
the final alignment will show a gap aligned with lowercase amino acids.
Part of this gap, or the whole gap, can then be moved partially or totally
to the right by r positions (r being lower or equal to k).
It means that Match-Box is not able to fix exactly the position of this
gap, but that the gap can be placed somewhere to the right within
a range of k amino acids.
Please refer to the table 3 to get precisely the limits of the boxes.
You may resubmit a subset of your sequences in order to
refine within group alignment. Results of EXPLORE may help you
in defining groups of sequences.
REFERENCE
Match-Box_server: a multiple sequence alignment tool placing emphasis
on reliability. E. Depiereux, G. Baudoux, P. Briffeuil, I. Reginster,
X. De Bolle, C. Vinals and E. Feytmans (1997) CABIOS 13(3) 249-256.
A postscript file with the boxes outlined can be obtained .
___________________________________________________________________________
MATCH-BOX_server 1.3 13-May-99 20:01:10
Execution successful
___________________________________________________________________________