PIMA Help
NAME
pima - Pattern-Induced Multi-sequence Alignment program
SYNOPSIS
pima [options] cluster_name seq_filename
[ref_seq_name sec_struct_seq_filename]
EXAMPLES
pima SAMPLE sample-family.fa
pima SAMPLE-STRUCT sample-struct.fa 1ldm pdb-dssp.ss
DESCRIPTION
pima performs a multi-sequence alignment of a set of
(presumably related) sequences using an extension of our
covering pattern construction algorithm (Smith and Smith
1990, 1992). All pairwise comparisons between sequences in
the set are performed and the resulting scores clustered
into one or more families using using two different linkage
rules: 1) maximal linkage (Smith and Smith, 1990) and 2)
sequential branching (see Smith and Smith, 1992). For the
latter, all pairwise scores are sorted high-to-low, the
first sequence from the highest scoring pair is chosen as
the "reference sequence", and the sequences clustered based
strictly on the order of similarity to the reference
sequence. Each cluster is then multiply-aligned using a
pattern-based alignment algorithm (Smith and Smith, 1992).
Patterns are constructed using one of two extended amino
acid alphabets (see below).
If secondary structure sequences are provided for one or
more of the primary sequences (one of which must be desig-
nated as a "reference sequence") then the sequences are
clustered using the sequentially branching rule and the set
multiply-aligned using a secondary structure- dependent gap
penalty algorithm (Smith and Smith, 1992).
Original Amino Acid Class Hierarchy Alphabet (Class1 alpha-
bet):
Amino Acid Classes Match score
-2
_______________ X __________________ 0
/ / \ \
_ f _ / ______r _______ \ 1
/ / \ / / / \ \ \
/ c \ e / m p \ _ j __ 2
/ / \ \ / \ / / \ / \ \ / \ \
/ a b d \ / l k o n i h \ 3
/ / \ / \ /|\ \ / / \ / \ / \ /\ / \ / \ \
C I V L M F W Y H N D E Q K R S T A G P 5
New 83 Character Pattern Alphabet (Patgen alphabet):
We have recently developed an alternate pattern alphabet
that includes the standard IUPAC codes for the 20 amino
acids plus additional characters for 63 combinations of
amino-acids. These combinations provide the highest amount
of information (i.e., most abundant as compared to random
expectation) observed in our database of aligned sequence
families (Ladunga I, Wiese B, and Smith RF, In preparation):
J IV f LV n AV t QK 1 PT 8 AE ( QP ; NH _ NE
U RK h AG Z QE u RQ 2 NG 9 AL ) AST < QS { IF
a DE i ILV o AT v DG 3 QH ! NT * ILM ? QL | SV
b IL j LF p PS w LP 4 LS # ES + KT @ MV } RP
c FY B ND q NS y EG 5 TV $ IT , GP [ EP ~ RH
d ST k LM r AP z RG 6 HY % DS / KS ] AGS . GK
e AS m GS s EK 0 NK 7 IM & RS : LT ^ GT X (wildcard)
For both alphabets, gaps are denoted by "g"s.
PARAMETERS
cluster_name
An arbitrary name used to label the cluster.
seq_filename
Name of the input file containing the sequences to be
clustered and multi-aligned. Sequences can be in any
of the following formats: IG/Stanford, GenBank/GB,
NBRF, EMBL, Pearson/Fasta, PIR/CODATA, Table
(LOCUS_NAMESEQUENCE [one seq/line]). LOCUS_NAMES
can not contain left or right parentheses. The format
of the output sequence files will match the format of
this input file.
ref_seq_name
[optional; if specified, then sec_struct_seq_filename
must also be specified]. Locus name of one of the pri-
mary sequences for which the secondary structure is in
the file seq_struct_seq_filename.
sec_struct_seq_filename
[optional; if specified, then ref_seq_name must also
be specified] Name of a file containing secondary
structure sequences for one or more of the primary
sequences in the set. The secondary structure
sequences in this file must be in one of the formats
listed above (see sequence_filename, above). The
locus name of each sequence must be the locus name of
it's corresponding primary sequence with the suffix
'.ss' (e.g. 1ldm.ss). An alpha-helix, 3-10 helix and
beta-strand must be designated 'h', 'g', and 'e',
repectively. All other characters in the secondary
structure sequences will be ignored with respect to
the the structure-dependent gap penalty. To allow
gaps to be placed between the first and the second and
the last elements of these structures, the first and
last 2 elements of each should be changed to another
character designation. In the secondary structure
sequence file pdb-dssp.ss provided with this package,
these end cap elements are designated 'i', 'f', and
'd', for alpha-helices, 3-10 helices and beta-strands,
respectfully.
OPTIONS
-c number Use a cluster score cutoff of number. This
is the lowest match score to be used to
incorporate a sequence into a cluster. The
default value of 0.0 will force all input
sequences into 1 cluster, but the final pat-
tern may be completely degenerate.
-d number Use a length dependent gap penalty of number.
This is the cost of extending a gap. The
default value is dependent on the matrix file
used.
-h This option will print a short help message
and quit.
-i number Use a length independent gap penalty of
number. This is the cost of opening a gap.
The default value is dependent on the matrix
file used.
-l number Use minimum local score of number. This is
the lowest score a quadrant can have before
an attempt is made to join this local align-
ment with the local alignment at the previous
step. The default value is dependent on the
matrix file used.
-m file Use matrix file with the name file. The
default matrix ( class1.mat ) uses the origi-
nal amino acid class hierarchy alphabet. The
matrix file patgen.mat uses the new 83 char-
acter pattern alphabet.
-n Do not use numerical extensions on each step
of the alignment.
-t number Use a secondary structure gap penalty of
number. This is the cost of a gap at a posi-
tion matching a secondary structure
character. The default value is dependent on
the matrix file used and is always 10 times
the value of the length independent gap
penalty of the matrix file.
-u characters Use characters as the list of secondary
structure characters instead of the default
characters of hge.
-w number Use a minimum local alignment width of
number instead of the default 15. A quadrant
with a width less than this value is ignored
and no attempt to join this local alignment
with the local alignment at the previous
step.
-M Only perform maximal linkage. This option
will also drop the -ML from the output file
names.
To see the default values for a give matrix run the program
pima-pm, enter the name of the matrix for which you want to
see the default values. Hit return until you see the
default value of the parameter you are interested and then
just interupt (control-C) the program.
OUTPUT FILES CREATED
cluster_name--ML|SB][.ext].cluster
The cluster tree(s)s created by the clustering
algorithm(s): maximal linkage clusters are labelled
with '-ML' appended to the cluster_name; sequential
branching clusters are labeled '-SB'. If more than one
cluster is generated from the input sequence set, each
cluster is given an extension (cluster_name-ML.1,
cluster_name-ML.2, etc). Each cluster in a cluster
file is represented as a nested list with sequence
names separated by a match score, e.g.:
CLUSTER_NAME-ML((A 200.0 B) 150.0 C)
File format: cluster_name-
[ML|SB][.ext]cluster_nested_list
cluster_name[-ML|-SB][.ext].pattern
The "root" AACC pattern constructed from each cluster.
File format: cluster_name-
[ML|SB][.ext]AACC_sequence
cluster_name[-ML|-SB][.ext].pima
The pattern-induced multiple-sequence alignment of
each clustered sequence set; includes the "nodal" pat-
terns used to align the sequences (the nodal patterns
have the locus name cluster_name-[ML|SB].ext -- exten-
sions added to the sequence names match the extension
of the nodal-pattern used to align the corresponding
sequence subset, e.g. seq_1-ML.1 and seq_2-ML.1 would
be aligned by nodal-pattern cluster_name-ML.1 .
File format: Will be created the same as the input
sequence file, sequence_filename.
REQUIRED AUXILLARY PROGRAMS/SCRIPTS/FILES
Programs: cluster-pima, pima-mso, pima-pm, extract-cluster-
loci, extract-records, extract-root-pat, print-cluster,
trim-root-num, print-pima, make-cluster, make-pattern
Files: class1.mat, patgen.mat
NOTES
Only minimal sequence information is maintained by the
sequence input and output routines. Additionally not every
aspect of the various sequence file formats is handled
correctly. If in doubt, please use sequence files that are
in Fasta or table format.
REFERENCES
Smith, Randall F. and Smith, Temple F. (1990). Automatic
generation of primary sequence patterns from sets of related
protein sequences. PNAS 87:118-122.
Smith, Randall F. and Temple F. Smith (1992). Pattern-
Induced Multi-sequence Alignment (PIMA) algorithm employing
secondary structure-dependent gap penalties for comparitive
protein modelling. Protein Engineering 5:35-41.
Randall F. Smith
Human Genome Center, Dept. of Molecular and Human Genetics,
Baylor College of Medicine, Houston TX 77096
rsmith@bcm.tmc.edu
Temple F. Smith
Molecular Bio-Enginnering Research Center
Boston Univ., 36 Cummington St, Boston, MA 02115
tsmith@darwin.bu.edu
Copyright (c) 1990, 1991, 1992, MBCRR, Dana-Farber Cancer Institute and Harvard University.
Copyright (c) 1993, 1994, Baylor College of Medicine.
|