MISCLUSTER version 1.0

MISCORE Based CLUSTERing Algorithms for DNA Motif Discovery

Introduction

MISCLUSTER is a divisive hierarchical clustering method for do nova motif discovery. MISCLUSTER employ mismatch value for cluster initialization. The initialization method reduce the sequence search space and thus allows effective time reduction. Unlike most standard hierarchical clustering algorithm, our hierarchical clustering framework is tailored to suit motif discovery in considerations on the clusterability of the Kmers and also the effect of the background sequences. MISCLUSTER method work under the observation there is an upper bound on mismatch between the true binding sites of a protein. The pre-selected initial clusters are further branches using binary divisive algorithm to improve the specificity and sensitivity. Heuristic rules are introduced for sub-cluster selection and as branching stopping criterion.

Contact

For technical assistance or bugs please email to: A/Prof Dianhui Wang

Download and Installation

The program can be downloaded from the following link.

Executable file	Windows platform Download here.
Markov chain model generator (perl script)	Any platform. Download here.

MISCLUSTER is implemented using C/C++ and developed using Dev-C++ version 4.9.9.2 which is available at http://bloodshed.net. The program has also been compiled sucessfully in Microsoft Visual Studio 7.0.

To run the program, simply extract the executable file (e.g. .exe) and it can be run on the command prompt. The usages of Markov chain perl script can be found in section below.

Usages

MISCLUSTER assumes at least one true site appear in each of the input DNA sequence in establishing the core motifs. The command line is
miscluster [parameters]

Parameter	Required	Description
-h	N	shows the help menu.
-f [filename]	Y	filename is the input sequences file name in fasta format.
-l x	Y	Specify the expected motif length x. Valid values are [6, 20].
-m y	Y	Specify the maximum mismatch between the kmers. Integer value between (0, l/2)
-b [filename]	Y	Where filename is the file name that store the 7th order markov chain generated using perl script. We have prepared the background file for several species (see the next sub-section).
-z [filename]	Y	The 7th order foreground markov chain generated using perl tool.
-k [opt]	N	Specify the cluster initialization method. Value options are: 1-using core motif, 2-using enumeration.
-r	N	If specify, only the given input sequences strand will be used in motif discovery.
-o [filename]	Y	Output file name that stores the discovered motif.
-thr [value]	N	Specify the post-scanning threshold (0, 1]. Default value is 0.75. This threshold value is very important as it determines the final motifs set returned after the candidate motifs have been selected at the end of hierarchical branching. The guideline for this parameter is discussed in the paper.
-s	N	The default scoring function to select core motif is based on MAP score. Once this switch is specified, the Kmer surprise score will be used instead.
-v [value]	N	Specify the number of initial clusters [10,100]. The default is 40.
-t [value]	N	Specify the number of top ranked motifs to return. The default is 10.
-j	N	If specified, the initial sequences will be selected based on markov chain model.
-i [value]	N	Number of initial sequences use to generate the core motifs. Warming!!! the computational time would be great if too many initial sequences are used.
-c [value]	N	Minimum complexity value of a motif. Valid values are [0, 10]. The default value is 0.15.

Generating background/foreground Markov Chain Model

The perl script is modified from Mahony et al. (2006). It generates the markov chain model and the frequency of each Kmer in the input sequences.
The command line to use the perl script is,
perl BackExtract_modi.pl -seq fastafile -x len [-out outfilename]

fastafile is the fasta file name that stores the background sequences.
len is the markov order (default is 3). In MISCLUSTER, the seventh order markov chain model is used.
outfilename is the output file name.

Saccharomyces cerevisiae/Yeast	Download
Human	Download
Mouse	Download
Escherichia coli	Download

The Yeast and E. Coli background sequences are retrieved from RSAT (Helden et al. 2003). The Human and mouse background sequences are retrieved from http://biowulf.bu.edu/zlab/promoser/.

Using MISCLUSTER

Step 1:
MISCLUSTER require users to prepare input sequences that believe to be co-regulated by the same protein. The co-regulation can be established through gene expression analysis or other in vivo method such as CHIPchip, CHIP-PET, etc (ref). These methods have been widely use for genomewide motif discovery.

Step 2:
Users have to prepare the background and input sequences markov model using the perl script. The markov model is based on 7th order markov model which can be generated using the Perl script provided above.

Step 3:
Decides the motif length (-l) and mismatch value (-m). For example, the command line to discover motifs from upstream DNA sequences for genes believed to be co-regulated by GAL4 protein is,

miscluster -f GAL4_YPD.fsa -l 17 -m 7 -k 2 -o gal4.output -thr 0.75 -b yeast_7c.back -z gal4.fore

The mismatch value is usually set between (0, l//2]. One should attempt with smaller mismatch value initially before proceeding to larger values.

Step 4:
The output file gal4.output contains the top 10 motifs returned by MISCLUSTER. The following is the top 1 motif.

Motif rank #1
Motif Score: 1.528609    Complexity: 0.343109    
Relative Entropy: 18.602956 bits
No. of sites:13
Con : CGGGCTACTCTCCTCCG  Deg. Con: CGGGCWACTSTCSTCCG
rCon: CGGAGGAGAGTAGCCCG

Position Count Matrix
       1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
A       3   0   1   3   1   4   8   1   3   0   0   3   0   1   0   0   0
C       9   0   1   3   9   2   3  12   0   6   1   4   7   1  10  13   0
G       0  13  11   4   3   3   0   0   0   5   0   3   4   2   0   0  13
T       1   0   0   3   0   4   2   0  10   2  12   3   2   9   3   0   0

>iYGR295C-1  AGGGCTCCTCTACTTCG    1    852    0.8690
>iYML052W    CGGCGCACTCTCGCCCG    1    112    0.8621
>TEL15R-2    AGGGCTCCTCTACTTCG    1    275    0.8690
>iYDR544C    AGGGCTCCTCTACTTCG    1    244    0.8690
>iYHR091C    TGACCTACTTTTTTCCG    1    316    0.8000
>iYBR018C    CGCTCAACAGTGCTCCG    1    273    0.8621
>iYBR018C    CGGACAACTGTTGACCG    0    186    0.9034
>iYBR018C    CGGTCAACAGTTGTCCG    1    186    0.9103
>iYDR008C    CGGTCCACTGTGTGCCG    1    503    0.8828
>iYLR080W    CGGAGATATCTGCGCCG    1    578    0.7793
>iYLR080W    CGGCGGTCTTTCGTCCG    1    559    0.8552
>iYBR019C    CGGAAGACTCTCCTCCG    1    264    0.9310
>iYBR019C    CGGGCGACAGCCCTCCG    1    246    0.8621

Descriptions:

Motif Score is the maximum apriori probability (MAP) score of the motif.
Complexity is the motif complexity value.
Relative entropy is given by .
Each line in the list of predicted sites is arranged in the following order. a) Sequence name, b) Actual site (forward/reverse complement), c) strand of the site, 1-forward, 0-reverse complement; d) the position where the site appear in the input strand given; e) normalized average mismatch score between [0, 1].

Materials and Results

The prediction results of Tompa's datasets are available here. a) Conserved datasets; b) Non-conserved datasets.

MISCLUSTER detail statistic for Tompa’s dataset. Download here.

The detail results for the 10 real datasets for MEME, AlignACE, SOMBRERO and MISCLUSTER. Download here.

References

1. Mahony, S.; Benos, P. V.; Smith, T. J. & Golden, A. Self-organizing neural networks to support the discovery of DNA-binding motifs Neural Networks, 2006, 19, 950-962.
2.van Helden, J. Regulatory sequence analysis tools Nucl. Acids Res., 2003, 31, 3593 - 3596