|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
MISCORE Based CLUSTERing Algorithms for DNA Motif Discovery | |||||||||||||||||||||||||||||||||||||||||||||||||||||
Introduction | |||||||||||||||||||||||||||||||||||||||||||||||||||||
MISCLUSTER is a divisive hierarchical clustering method for do nova motif discovery. MISCLUSTER employ mismatch value for cluster initialization. The initialization method reduce the sequence search space and thus allows effective time reduction. Unlike most standard hierarchical clustering algorithm, our hierarchical clustering framework is tailored to suit motif discovery in considerations on the clusterability of the Kmers and also the effect of the background sequences. MISCLUSTER method work under the observation there is an upper bound on mismatch between the true binding sites of a protein. The pre-selected initial clusters are further branches using binary divisive algorithm to improve the specificity and sensitivity. Heuristic rules are introduced for sub-cluster selection and as branching stopping criterion. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Contact | |||||||||||||||||||||||||||||||||||||||||||||||||||||
For technical assistance or bugs please email to: A/Prof Dianhui Wang |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Download and Installation | |||||||||||||||||||||||||||||||||||||||||||||||||||||
The program can be downloaded from the following link. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
MISCLUSTER is implemented using C/C++ and developed using Dev-C++ version 4.9.9.2 which is available at http://bloodshed.net. The program has also been compiled sucessfully in Microsoft Visual Studio 7.0. To run the program, simply extract the executable file (e.g. .exe) and it can be run on the command prompt. The usages of Markov chain perl script can be found in section below. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Usages | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
MISCLUSTER assumes at
least one true site appear in each of the input DNA sequence in
establishing
the core motifs. The command line
is miscluster [parameters] |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Generating background/foreground Markov Chain Model | |||||||||||||||||||||||||||||||||||||||||||||||||||||
The
perl script is modified from Mahony et al. (2006). It generates the
markov chain model and the frequency of each Kmer in the input
sequences. The command line to use the perl script is, perl BackExtract_modi.pl -seq fastafile -x len [-out outfilename] fastafile is the fasta file name that stores the background sequences. len is the markov order (default is 3). In MISCLUSTER, the seventh order markov chain model is used. outfilename is the output file name. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
The
Yeast and E. Coli background sequences are retrieved from RSAT
(Helden
et al. 2003). The Human and mouse background sequences are retrieved
from
http://biowulf.bu.edu/zlab/promoser/. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Using MISCLUSTER | |||||||||||||||||||||||||||||||||||||||||||||||||||||
Step 1: MISCLUSTER require users to prepare input sequences that believe to be co-regulated by the same protein. The co-regulation can be established through gene expression analysis or other in vivo method such as CHIPchip, CHIP-PET, etc (ref). These methods have been widely use for genomewide motif discovery. Step 2: Users have to prepare the background and input sequences markov model using the perl script. The markov model is based on 7th order markov model which can be generated using the Perl script provided above. Step 3: Decides the motif length (-l) and mismatch value (-m). For example, the command line to discover motifs from upstream DNA sequences for genes believed to be co-regulated by GAL4 protein is, miscluster -f GAL4_YPD.fsa -l 17 -m 7 -k 2 -o gal4.output -thr 0.75 -b yeast_7c.back -z gal4.fore The mismatch value is usually set between (0, l//2]. One should attempt with smaller mismatch value initially before proceeding to larger values. Step 4: The output file gal4.output contains the top 10 motifs returned by MISCLUSTER. The following is the top 1 motif.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Materials and Results | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
References | |||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Mahony, S.; Benos, P.
V.; Smith, T. J. & Golden, A. Self-organizing neural networks
to support the discovery of DNA-binding motifs Neural Networks, 2006,
19, 950-962. 2.van Helden, J. Regulatory sequence analysis tools Nucl. Acids Res., 2003, 31, 3593 - 3596 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright (2009). Disclaimer: The codes and executable are provided as it is and users used at their own risk. |