SAPS output for unknown

[ISREC-Server] Date: Thu Oct 14 20:26:32 MET 1999


SAPS.  Version of April 11, 1996.
Date run: Thu Oct 14 20:26:32 1999


 SAPS (Statistical Analysis of Protein Sequences) evaluates by  statistical
 criteria a wide variety of protein sequence properties. A full description
 of the methods is given in the paper referred to below. The output is  or-
 ganized  in the following sections: file name, sequence printout, composi-
 tional analysis, charge distributional  analysis  (charge  clusters;  high
 scoring  (un)charged  segments; charge runs and patterns), distribution of
 other amino acid types (high scoring hydrophobic  and  transmembrane  seg-
 ments; cysteine spacings), repetitive structures (in the amino acid alpha-
 bet and in a 11-letter reduced alphabet),  multiplets  (counts,  spacings,
 and  clusters  in  the  amino  acid  and  charge  alphabets),  periodicity
 analysis, spacing analysis. Each section is annotated below under its sec-
 tion title.
      The SAPS program was developed in the group of Prof. Samuel Karlin at
 Stanford  University.  Correspondence relating to SAPS should be addressed
 to either Volker Brendel or Samuel Karlin at the Department  of  Mathemat-
 ics,  Stanford  University,  Stanford  CA 94305, U.S.A.; phone: (415) 723-
 2209; fax: (415) 725-2040; email: volker@gnomic.stanford.edu. Users of the
 program  should  cite  the  following  reference: Brendel, V., Bucher, P.,
 Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) Methods and  algorithms
 for statistical analysis of protein sequences.  Proc. Natl. Acad. Sci. USA
 89: 2002-2006.



********************************************************************************
Protein    1 (File: wwwtmp/.SAPS.29342.5242.seq)

SWISS-PROT ANNOTATION:
ID   unknown
DE   unknown, 184 bases, 89091B8 checksum.

number of residues:  184;   molecular weight:  20.9 kdal
 
         1  MMRILLALSL GVACCSLWVG AEVQVQPDFQ KEKVLGKWYG IGLASNSNWF KDRKSHMKMC 
        61  TTIITPTADG NLEVTATYPK MDRCETKSMT YFKTEQLGGF RAKSPRYGSE HDMRVVETNY 
       121  DEYILMYTVK TKGSETNQIV SLFGRDKDLR PELLDKFQNF AKSQGLADDN IIILPHTDQC 
       181  MTEA

--------------------------------------------------------------------------------
COMPOSITIONAL ANALYSIS (extremes relative to: swp23s.q)

 The composition of the input sequence is evaluated relative to the residue
 usage  quantile  table  specified with the `-s species' flag. Low usage in
 the 1% quantile is indicated by the label -- (e.g., Y--  means  that  the
 input  sequence uses tyrosine as little as the 1% least tyrosine contain-
 ing proteins in the reference set); low usage in the 5% quantile is indi-
 cated  by  the  label  `-'  (e.g., L-); high usage above the 95% quantile
 point is indicated by the label `+' (e.g., A+); and high usage  above  the
 99%  quantile  point  is indicated by the label `++' (e.g., LIVFM++). The
 usage is evaluated for all 20 amino acids, positive (KR) and negative (ED)
 charge,  total  charge  (KRED),  net  charge  (KR-ED),  major hydrophobics
 (LVIFM), and the groupings ST, AGP (encoded by CCN, GCN, and GGN  codons),
 and FIKMNY (encoded by AAN, AUN, UAN, and UUN codons).


A  : 10( 5.4%); C  :  5( 2.7%); D  : 12( 6.5%); E  : 11( 6.0%); F  :  7( 3.8%)
G  : 12( 6.5%); H  :  3( 1.6%); I  :  9( 4.9%); K  : 15( 8.2%); L  : 16( 8.7%)
M  :  9( 4.9%); N  :  7( 3.8%); P  :  6( 3.3%); Q  :  8( 4.3%); R  :  8( 4.3%)
S  : 11( 6.0%); T  : 15( 8.2%); V  : 10( 5.4%); W  :  3( 1.6%); Y  :  7( 3.8%)

KR      :   23 ( 12.5%);   ED      :   23 ( 12.5%);   AGP     :   28 ( 15.2%);
KRED    :   46 ( 25.0%);   KR-ED   :    0 (  0.0%);   FIKMNY  :   54 ( 29.3%);
LVIFM   :   51 ( 27.7%);   ST      :   26 ( 14.1%).

--------------------------------------------------------------------------------
CHARGE DISTRIBUTIONAL ANALYSIS

 The distribution of charges in the protein sequence is evaluated in  terms
 of  clusters, high scoring segments, and runs and periodic patterns. Clus-
 ters indicate regions of typically 30 to 60 residues  exhibiting  a  rela-
 tively  high charge concentration. For high scoring charge segments, posi-
 tive scores are assigned to charge residues of the  appropriate  type  and
 negative  scores  to all other residues. A significant cumulative positive
 score again indicates a region of high charge concentration.  The  cluster
 method  and  the  scoring method will generally pick out the same segments
 (with the scoring method  often  delimiting  the  segment  to  a  narrower
 range),  conferring  robustness  to  the  results.  Short segments of high
 charge concentration are displayed as runs (with  errors).  Periodic  pat-
 terns  focus  on  those  with charges every second or third position, with
 possible relevance to amphipathic  secondary  structures;  other  periodic
 patterns  are displayed in the general periodicity analysis section of the
 output.

 
         1  00+0000000 0000000000 0-00000-00 +-+000+000 0000000000 +-++000+00 
        61  00000000-0 00-000000+ 0-+0-0+000 00+0-00000 +0+00+000- 0-0+00-000 
       121  --0000000+ 0+00-00000 0000+-+-0+ 0-00-+0000 0+00000--0 0000000-00 
       181  00-0

A. CHARGE CLUSTERS.

 Positive, negative, and mixed charge clusters are distinguished.  In  each
 case, cmin indicates the minimum number of charges required for a signifi-
 cant charge cluster corresponding to the given window size; e.g.,  cmin  =
 9/30 or 12/45 or 15/60 means that significance requires at least 9 charges
 in a segment of 30 (or fewer) residues, or 12  charges  in  a  segment  of
 length  45,  or 15 charges in a segment of length 60. In the case of posi-
 tive and negative charge clusters, these counts refer to net charge, i.e.,
 charges  of  the  opposite  sign  within the window are counted as -1. The
 sizes of the clusters are optimized for display to indicate the segment of
 highest  charge  concentration,  but  a  minimum  size  of  20 residues is
 required.  A mixed charge cluster that begins and ends within 15  residues
 of the endpoints of a pure charge cluster is not displayed (since its sig-
 nificance rests mostly on the charged residues  comprising  the  displayed
 pure charge cluster), unless the -v (verbose output) flag is set, in which
 case both the pure and the mixed charge  cluster  are  displayed.  On  the
 other  hand,  pure charge clusters that are embedded in mixed charge clus-
 ters are displayed separately (indicated by a * preceding  the  specifica-
 tion of location).
      For each cluster are given its location in the sequence  (From,  to),
 the  quartile  of  the  location  (1st,  2nd,  3rd,  or 4th quarter of the
 sequence), length, count, and t-value (standard deviations above the mean;
 to  accommodate  the  multiple  tests  performed, the t-value significance
 threshold is set to 4.0 for sequences up  to  750  residues,  to  4.5  for
 sequences  of  length 750-1500 residues, and to 5.0 for longer sequences);
 also indicated are residues comprising at least 10% of the cluster.



Positive charge clusters (cmin = 10/30 or 14/45 or 17/60):  none


Negative charge clusters (cmin = 10/30 or 14/45 or 17/60):  none


Mixed charge clusters (cmin = 16/30 or 22/45 or 28/60):  none


B. HIGH SCORING (UN)CHARGED SEGMENTS.

 For each scoring scheme (scores assigned to residues as  displayed),  SAPS
 displays  segments of the sequence with aggregate score exceeding the par-
 ticular threshold values M_0.01 (1% significance level, segments  labeled
 with  **),  M_0.05 (5% significance level, segments labeled *), or other-
 wise as indicated. A minimal segment length is set as shown.  The expected
 score/letter should be sufficiently large negative, and the average infor-
 mation per letter should be sufficiently large positive in order  for  the
 scoring statistics to apply properly (the program prints out when the con-
 ditions are not met and skips evaluations).



______________________________________
High scoring positive charge segments:

score=   2.00 frequency=   0.125  ( KR )
score=   0.00 frequency=   0.000  ( BZX )
score=  -1.00 frequency=   0.750  ( LAGSVTIPNFQYHMCW )
score=  -2.00 frequency=   0.125  ( ED )

 Expected score/letter:  -0.750;    Average information/letter:   1.082
 Minimal length of displayed segments set to:  20

M_0.01= 10.48  (cv=  6.35, lambda=  0.82113, k=  0.29898, x=  4.13;
                90% confidence interval for segment length:  11 +-  11)
M_0.05=  8.50  (x=  2.15)

# of segments (>=20 residues) exceeding M_0.05: none


______________________________________
High scoring negative charge segments:

score=   2.00 frequency=   0.125  ( ED )
score=   0.00 frequency=   0.000  ( BZX )
score=  -1.00 frequency=   0.750  ( LAGSVTIPNFQYHMCW )
score=  -2.00 frequency=   0.125  ( KR )

 Expected score/letter:  -0.750;    Average information/letter:   1.082
 Minimal length of displayed segments set to:  20

M_0.01= 10.48  (cv=  6.35, lambda=  0.82113, k=  0.29898, x=  4.13;
                90% confidence interval for segment length:  11 +-  11)
M_0.05=  8.50  (x=  2.15)

# of segments (>=20 residues) exceeding M_0.05: none


___________________________________
High scoring mixed charge segments:

score=   1.00 frequency=   0.250  ( KEDR )
score=   0.00 frequency=   0.000  ( BZX )
score=  -1.00 frequency=   0.750  ( LAGSVTIPNFQYHMCW )

 Expected score/letter:  -0.500;    Average information/letter:   0.792
 Minimal length of displayed segments set to:  20

M_0.01=  7.93  (cv=  4.75, lambda=  1.09861, k=  0.33333, x=  3.19;
                90% confidence interval for segment length:  16 +-  14)
M_0.05=  6.45  (x=  1.70)

# of segments (>=20 residues) exceeding M_0.05: none


________________________________
High scoring uncharged segments:

score=   1.00 frequency=   0.750  ( LAGSVTIPNFQYHMCW )
score=   0.00 frequency=   0.000  ( BZX )
score=  -8.00 frequency=   0.250  ( KEDR )

 Expected score/letter:  -1.250;    Average information/letter:   0.259
 Minimal length of displayed segments set to:  20

M_0.01= 31.33  (cv= 20.49, lambda=  0.25450, k=  0.15869, x= 10.84;
                90% confidence interval for segment length:  44 +-  30)
M_0.05= 24.93  (x=  4.44)

# of segments (>=20 residues) exceeding M_0.05: none


C. CHARGE RUNS AND PATTERNS.

 The table below shows the charge runs and patterns searched for (*  stands
 for  +  or  -)  and  the required minimum number of matches to the pattern
 allowing for at most 0 (lmin0), 1 (lmin1),  or  2  (lmin2)  mismatches  or
 insertions/deletions (1% significance level). Occurrences are arranged in
 the order in which they appear in the sequence. For each  run  or  pattern
 are  displayed  its  length  (number  of matches) and a triplet giving the
 number of mismatches, insertions and deletions. 0-runs are further charac-
 terized  by  their  composition (residues comprising more than 10% of the
 run).
      Run count statistics are compiled for runs of lengths at least 2/3 of
 the minimal significant length (lmin0); given are the number and locations
 of such runs.


pattern  (+)|  (-)|  (*)|  (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)| (H.)|(H..)|
lmin0     5 |   5 |   7 |  29 |   9 |   9 |  12 |  10 |  10 |  14 |   6 |   7 | 
lmin1     6 |   6 |   9 |  36 |  11 |  11 |  15 |  13 |  13 |  17 |   7 |   9 | 
lmin2     7 |   7 |  10 |  39 |  12 |  12 |  16 |  14 |  14 |  19 |   8 |  10 | 
 (Significance level: 0.010000; Minimal displayed length:  6)
There are no charge runs or patterns exceeding the given minimal lengths.

Run count statistics:

  +  runs >=   3:   0
  -  runs >=   3:   0
  *  runs >=   5:   0
  0  runs >=  20:   0

--------------------------------------------------------------------------------
DISTRIBUTION OF OTHER AMINO ACID TYPES

 Routinely, SAPS indicates high scoring hydrophobic and transmembrane  seg-
 ments.  The display is as desribed above for high scoring charge segments.
 The scores for the hydrophobic segments correspond to a  digitized  hydro-
 pathy  scale.   The transmembrane scores were derived from target frequen-
 cies in putative transmembrane proteins (see the paper referred to  above;
 note, however, that the scores used in the program have been rederived and
 differ from the ones given in the paper). With the -a command  line  flag,
 the  user  can invoke a similar analysis for other residue types.  In view
 of the special role of cysteines for protein structure,  the  spacings  of
 the  cysteine residues in the sequence are displayed separately, with par-
 ticular emphasis on close pairs of cysteines and  distances  between  such
 pairs.


1. HIGH SCORING SEGMENTS.

__________________________________
High scoring hydrophobic segments:

   2.00 (LVIFM)   1.00 (AGYCW)   0.00 (BZX)  -2.00 (PH)  -4.00 (STNQ)
  -8.00 (KEDR)

 Expected score/letter:  -2.234;    Average information/letter:   0.729
 Minimal length of displayed segments set to:  15

M_0.01= 21.38  (cv= 12.88, lambda=  0.40484, k=  0.31321, x=  8.50;
                90% confidence interval for segment length:  17 +-  11)
M_0.05= 17.35  (x=  4.47)

# of segments (>=15 residues) exceeding M_0.05: none


____________________________________
High scoring transmembrane segments:

   5.00 (LVIF)   2.00 (AGM)   0.00 (BZX)  -1.00 (YCW)  -2.00 (ST)
  -6.00 (P)  -8.00 (H) -10.00 (NQ) -16.00 (KR) -17.00 (ED)

 Expected score/letter:  -4.152;    Average information/letter:   0.631
 Minimal length of displayed segments set to:  15

M_0.01= 47.62  (cv= 29.68, lambda=  0.17571, k=  0.23490, x= 17.94;
                90% confidence interval for segment length:  19 +-  13)
M_0.05= 38.34  (x=  8.66);     M_0.30= 27.30  (x= -2.38)

 1) From    4 to   21:  length= 18, score=43.00  * 
       4  ILLALSLGVA CCSLWVGA
    L:  5(27.8%);  A:  3(16.7%);  G:  2(11.1%);  S:  2(11.1%);
    V:  2(11.1%);  C:  2(11.1%);

# of segments (>=15 residues) exceeding M_0.30:  1


2. SPACINGS OF C.


H2N-13
CC   at   14
  -44-C-23-C-95-C-4-COOH

--------------------------------------------------------------------------------
REPETITIVE STRUCTURES.

 Repeats are indicated for two alphabets: the 20-letter amino  acid  alpha-
 bet,  and  a  reduced  11-letter  alphabet in which the major hydrophobics
 LVIF, the charged residues KR and ED, the small residues AG, the  hydroxyl
 group  residues  ST,  the amid group residues NQ, and the aromatics YW are
 treated as combined letters.  For each alphabet, three classes of  repeats
 are  distinguished: separated repeats, simple tandem repeats, and periodic
 repeats. The separated  repeats  are  largely  non-overlapping.  They  are
 displayed  in  groups  of  matching  blocks  (exceeding a given core block
 length of contiguous  exact  matches)  and  intervening  spacer  distances
 (which  may  be  negative,  signifying  a partial overlap). The core block
 length in case of the amino acid alphabet is set to 4 for sequences up  to
 500  residues,  to 5 for sequences between 500 and 2000 residues, and to 6
 for longer sequences (same values increased by 4 for  the  reduced  alpha-
 bet).   Simple  tandem  repeats  are  displayed  in  similar  layout,  but
 separately. Sequence segments that are highly repetitive  with  relatively
 short repeats are displayed as periodic repeats.


A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
Repeat core block length:  4

B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
   (i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
Repeat core block length:  8

--------------------------------------------------------------------------------

MULTIPLETS.

 Multiplets refer to homooligopeptides of any length (e.g., A2, Q7,  etc.);
 altplets  refer  to  reiterations  of  two  different  residues (e.g., RG,
 EAEAEA, etc.). The  multiplet  composition  of  the  protein  sequence  is
 evaluated  for  both the amino acid and the charge alphabet. (High) Aggre-
 gate altplet counts are evalued only for the charge alphabet.  The  multi-
 plet  sequence  is  displayed  whenever  the  total multiplet count of the
 sequence falls outside the expected range (i.e., beyond 3 standard  devia-
 tions of the mean). Printed are also the histogram of the spacings between
 consecutive multiplets (differences between starting positions) as well as
 clusters  of multiplets (multiplet clusters are determined in the same way
 as charge clusters are determined; the  binomial  test  is  applied  to  a
 compressed sequence over the alphabet {M,S}, where M signifies a multiplet
 and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is
 translated  as MSMSSMS..., and the binomial cluster test is applied to the
 latter sequence). Multiplets and altplets of specific residue content that
 individually show an unusually high count are indicated, and the positions
 of all multiplets exceeding a minimum length of 5 residues are shown.


A. AMINO ACID ALPHABET.

1. Total number of amino acid multiplets:  10  (Expected range:   0-- 20)

2. Histogram of spacings between consecutive amino acid multiplets:
   (1-5) 4   (6-10) 1   (11-20) 3   (>=21) 3

3. Clusters of amino acid multiplets (cmin = 10/30 or 13/45 or 16/60):  none


B. CHARGE ALPHABET.

1. Total number of charge multiplets:   3  (Expected range:   0-- 12)
   1 +plets (f+: 12.5%), 2 -plets (f-: 12.5%)
   Total number of charge altplets: 5 (Critical number: 14)

2. Histogram of spacings between consecutive charge multiplets:
   (1-5) 0   (6-10) 0   (11-20) 1   (>=21) 3

--------------------------------------------------------------------------------
PERIODICITY ANALYSIS.

 The program identifies periodic elements of periods between 1 and  10  for
 the amino acid alphabet, for the charge alphabet, and for a hydrophobicity
 alphabet. Each periodic element consists of an error-free core pattern (of
 length  at least 4 for the amino acid alphabet, 5 for the charge alphabet,
 and 6 for the hydrophobicity alphabet)  which  is  extended  allowing  for
 errors.   The  numbers  of  errors are given for each position in the con-
 sensus of a periodic pattern involving more than one letter. The displayed
 periodic patterns would generally not be statistically significant but are
 listed for the sake of a general interactive appraisal  of  the  sequence.
 Periodicities  of  exceptionally  high copy number are indicated with a !-
 mark.


A. AMINO ACID ALPHABET (core:  4; !-core: 5)

Location	Period	Element		Copies	Core	Errors

There are no periodicities of the prescribed length.

B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core:  5; !-core: 6)
   and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core:  6; !-core: 9)

Location	Period	Element		Copies	Core	Errors
   2-  13	 2	i0        	 6	 6  	/0/2/


--------------------------------------------------------------------------------
SPACING ANALYSIS.

 The spacings between consecutive residues of the same type (all  20  amino
 acids,  +  and - charge, and combined charge *) are evaluated for signifi-
 cantly large or small maximal and minimal spacings. The output is  ordered
 by  the beginning point of the significant spacing. Entries are identified
 by the residue type, spacing (number of amino acids between the identified
 positions),  rank  of  the  displayed  spacing  (e.g.,  50 alanines in the
 sequence induce 51 spacings, ranked by decreasing length from  1  to  51),
 and  p-value  (probability  of exceeding the displayed spacing). A maximal
 spacing with p-value 0.01 or less is  considered  significantly  large;  a
 maximal  spacing  with  p-value 0.99 or larger is considered significantly
 small. Similarly, a minimal spacing with p-value 0.99 or  larger  is  con-
 sidered  significantly  small,  and a minimal spacing with p-value 0.01 or
 less is considered significantly large (excluding doublets). If the  first
 maximal  spacing  (rank  1)  of a residue is significantly large or small,
 then also the second maximal spacing (rank 2) is evaluated. Large  maximal
 and small minimal spacings indicate clustering effects, whereas small max-
 imal and large minimal spacings indicate excessive evenness in the distri-
 bution of the residues.


There are no unusual spacings.

Back to ISREC home page