Background Features of a DNA sequence can be found by compressing

Background Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression indicates low info content material. chromosomes of Cyanidioschyzon merolae. We present a tool that provides L(+)-Rhamnose Monohydrate manufacture useful linear transformations to investigate and save new sequences. Various good examples illustrate the strategy, getting features for sequences only and in different contexts. We also show how to highlight all units of self-repetition features, in this case within Plasmodium falciparum chromosome 2. Conclusion The strategy finds features that are significant and that biologists confirm. The exploration of long info sequences in linear time and space is definitely fast and the saved results are self documenting. Background The paper presents a strategy for exploring long DNA sequences, of the order of millions of bases, by means of their info content. We bring together two of pieces of our work, a Bayesian compression model and a graphical exploration tool, and give good examples illustrating the strategy. Compression is used to find the features of a sequence and common features that relate one sequence to another. Linear info content material sequences are then used to locate various kinds of common info. Genomic subsequences or areas recognized through this process can then become further investigated. The compression problem is definitely to calculate the information content material per foundation, generating an info sequence. Information is definitely relative, i.e. it depends within the context. The context can include one or more other sequences and hence info content material can L(+)-Rhamnose Monohydrate manufacture relate two or more sequences. Note that an info sequence is definitely 1-dimensional; operations such as difference, zoom, clean and threshold are efficient, taking linear time and space. This is in contrast to the traditional 2-dimensional plots of one sequence against another which must be stored at low resolution for long sequences. Any per element compression model can be used to generate an info sequence. Here we use our Approximate Repeats Model (ARM) [1-3], however, additional statistical models that create an info sequence could be used. We present the ARM, introduce our tool to manipulate info sequences, and explore its use for the reddish alga Cyanidioschyzon merolae [4] and the malaria strain Plasmodium falciparum [5]. Methods DNA sequence compression We wish to examine the information content material of sequences. Information content material and compressibility are inherently related: low info content indicates high compressibility and high info content indicates low compressibility. So, if one has an efficient encoding of a sequence, then it can be argued that one has a good model of that sequence. From Shannon [6] we know that an efficient encoding is related to its probability from the log probability. That is, info I(m) = –logP(m), where P(m) is the probability of m happening. When trying to make L(+)-Rhamnose Monohydrate manufacture an inference from some data using a Bayesian technique, we attempt to maximize the posterior probability, P(H|D) = P(D|H) P(H)/P(D) for hypothesis H and data D. If our model (hypothesis) has a nuisance parameter about which we do not care to make an inference, we ought to sum total possible values for this parameter. This is necessary when using sequence positioning to infer how related two sequences are. If we are only interested in whether the sequences are related or not we should sum over all possible alignments [7]. The way that compression models for DNA manage repetition can be broadly classified as substitutional or statistical. A substitutional model uses some form of pointer back to an earlier instance of a repeated subsequence to encode a later on instance. On the other hand, a statistical model encodes the sequence element by element using a probability distribution on the possible values of the next element in the sequence. The distribution can be formed like a blend of opinions derived from the base distribution and from the space and fidelity of matches between recent history Rabbit Polyclonal to FRS3 and earlier parts of the sequence. A statistical method can directly yield a per element info sequence, in addition to deriving a compressed encoded sequence. However, there is no simple natural way to derive a per element info sequence for any substitutional model. Significant improvements in substitutional compression models for DNA include: BioCompress [8] and BioCompress-2 [9]; and the more recent DNACompress [10]. And for statistical models: Loewenstern and Yianilos [11]; Korodi and Tabus [12]; and Cao et al..