ClustalW Multiple Sequence Alignment -- for DNA or proteins
(version 1.8, June 1999)
References:
- Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G. and Gibson, T. J. (1998)
Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5.
- Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. and Higgins, D. G. (1997)
The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882.
- Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for
multiple sequence alignments. Methods Enzymol., 266, 383-402.
- Thompson, J. D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.
- Higgins, D. G., Bleasby, A. J. and Fuchs, R. (1992) CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8,189-191.
- Higgins, D. G. and Sharp, P. M. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5,151-153.
- Higgins, D. G. and Sharp, P. M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73,237-244.
If you have any questions or comments, please contact one of:
- Des Higgins, Higgins@EBI.ac.uk
- Julie Thompson, thompson@embl-heidelberg.de
- Toby Gibson, Gibson@EMBL-Heidelberg.DE
Basic Help
- How does the program recongnize the protein/DNA sequence?
The program tries to automatically to guess whether the sequence is amino acid or nucleotide. This is not always foolproof. If 85% or more of the characters in the sequence are from A, C, G, T, U or N, the sequence will be assumed to be nucleotide. This works in 97.3% of cases, but watch out!
- Choose the right sequence format:
7 formats are automatically recognised:
NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file.
- Input sequence data:
Copy and paste sequence data into the text box. All sequences must be in the same format. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in GCG/MSF format).
Example protein sequence in Fasta format:
>gi|730305|
MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFLGRWFSAGLAS
NSSWLREKKAALSMCKSVVAPATDGGLNLTSTFLRKNQCETRTMLLQPAG
SLGSYSYRSPHWGSTYSVSVVETDYDQYALLYSQGSKGPGEDFRMATLYS
RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ
>gi|404390|
APEAQVSVQPNFQPDKFLGRWFSAGLASNSSWLQEKKAALSMCKSVVAPA
ADGGFNLTSTFLRKNQCETRTMLLQPGDSLGSYSYRSPHWGSTYSVSVVE
TDYDHYALLYSQGSKGPGEDFRMATLYSRTQTPRAELKEKFTAFCKAQGF
TEDSIVFLPQTDKCMTEQ
>gi|895868
MAALRMLWMGLVLLGLLGFPQTPAQGHDTVQPNFQQDKFLGRWYSAGLAS
NSSWFREKKAVLYMCKTVVAPSTEGGLNLTSTFLRKNQCETKIMVLQPAG
APGHYTYSSPHSGSIHSVSVVEANYDEYALLFSRGTKGPGQDFRMATLYS
RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE
- Start alignment:
Click the radio button on the left of the ClustalW search to select, hit the Submit button to start.
- Default parameter settings:
The default parameter settings of Clustalw is described on the Parameter page.
- Output:
'*' indicates positions which have a single, fully conserved residu:
':' indicates that one of the following 'strong' groups is fully conserved: STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW
'.' indicates that one of the following 'weaker' groups is fully conserved: CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIH FYM
Advanced Help
- To change the parameter settings:
The default parameter settings is described on the Parameter page. On the Option page, different options are provided to let the users change the parameters for their different requirements.
- DNA sequence parameters for Fast/Approximate Pairwise Alignment:
K-tuple size:
This is the size of exactly matching fragment that is used. Increase for speed (max= 2 for proteins; 4 for DNA), decrease for sensitivity. For longer sequences (e.g. >1000 residues) you may need to increase the default.
Window size:
This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity.
Scoring method:
You will have two options -- percentage or absolute.
Top diagonals: The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity.
Gap penalty:
This is a penalty for each gap in the fast alignments. It has little effect on the speed or sensitivity except for extreme values.
- DNA sequence parameters for Multiple Alignment:
Gap opening penalty:
Increasing the gap opening penalty will make gaps less frequent.
Gap extension penalty:
Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalized.
Weight transition:
Gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
- Protein sequence parameters for Fast/Approximate Pairwise Alignment:
K-tuple size:
This is the size of exactly matching fragment that is used. Increase for speed (max= 2 for proteins; 4 for DNA), decrease for sensitivity. For longer sequences (e.g. >1000 residues) you may need to increase the default.
Window size:
This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity.
Scoring method:
You will have two options -- percentage or absolute.
Top diagonals:
The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity.
GAP PENALTY:
This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values.
- Protein sequence parameters for Multiple Alignment:
Protein weight matrix:
There are two options for the protein weight matrix: BOSUM and PAM. The actual matrix that is used depends on how similar the sequences to be aligned at this alignment step are. Different matrices work differently at each evolutionary distance. For further help, please refer the original ClustalW document.
Gap opening penalty:
Increasing the gap opening penalty will make gaps less frequent.
Gap extension penalty:
Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalised.
Hydrophilic gap penalities:
This will increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common.
Residue specific penalties:
These are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. For example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.
- Other parameters:
Quicktree:
Always on. It is for using FAST algorithm for the alignment guide tree.
Divergence cutoff:
For small divergence (say <10%) this option makes no difference. For greater divergence, this option corrects for the fact that observed distances underestimate actual evolutionary distances. This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching branch lengths in trees (especially long branches).
Gap separation distance:
Tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment.
End gap separation:
Treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you turn this off, end gaps will be ignored for this purpose. This is useful when you wish to align fragments where the end gaps are not biologically meaningful.
Output order:
To control the order of the sequences in the output alignments. If ALIGNED is set, the output order corresponds to the order in which the sequences were aligned (from the guide tree/dendrogram), thus automatically grouping closely related sequences. If it is switched to INPUT, the ouput will have the same order as the input sequences.
For further help on ClustalW, please refer to the Document of ClustalW.
|