An approach to multiple DNA sequences compression

Wu, Choi-ping Paula

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/85471

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	-
dc.creator	Wu, Choi-ping Paula	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/4162	-
dc.language.iso	English	-
dc.title	An approach to multiple DNA sequences compression	-
dc.type	Thesis	-
dcterms.abstract	Deoxyribonucleic acid (DNA) technologies have been widely used in genetic engineering, forensics and anthropology applications. A DNA sequence is a long sequence consisting of four types of nucleotides called bases. The number of bases of the 24 chromosomes in humans ranges from 50 to 250 million. Without any compression, two bits per base are required for storage which results in a large number of bits for encoding DNA sequences. An effective way to compress these sequences is thus desirable in order to reduce the storage space required. General-purpose compression tools such as gzip use more than two bits to encode a base. This is because these tools do not make use of the special characteristics of a DNA sequence. For example, it is well known that a DNA sequence has long-term correlation in that subsequences in different regions of a DNA sequence are similar to each other. State-of-the-art DNA compression schemes all rely on exploiting this long-term correlation. In particular, repetitions within the DNA sequence are searched so that similar subsequences can be encoded with reference to each other. For these DNA compression schemes, the reduced average bits per base (bpb) are around 0.25 for the benchmark DNA sequences. Thus, the compression gain is not large. It is well known that there are similarities among different chromosome sequences. All state-of-the-art compression algorithms exploit only self-sequence similarities, and specifically ignore the cross-sequence similarities. We have performed a thorough study of similarities within the same chromosome sequence as well as similarities among different chromosome sequences. These similarities are characterized by the existence of similar subsequences in different chromosome sequences. Our study indicates that these cross-sequence similarities are often significant when compared to self-sequence similarities. In the experimental results from the sixteen chromosome sequences in S. cerevisiae, the average repetitive length from cross-sequence prediction was almost fourteen times of that from self-sequence prediction. To make use of both self-sequence and cross-sequence similarities in DNA compression, we have proposed a multi-sequence compression algorithm. Our scheme compresses two or more sequences together so that similar subsequences found among multiple sequences can be encoded together. In this scheme, we first create a list of similar subsequences, from either the reference sequence or the current sequence which are ordered according to their importance. The list is then modified by removing the overlapping similar subsequences. After reordering the list according to their position and removing similar subsequences from the current sequence, Arith-2 coder is used to further compress the non-repetitive regions. Our experimental results show that compressing a sequence with reference to another sequence achieves an average of 5.5% saving in bpb as compared with that without reference to another sequence, hence the bpb of compressing two chromosome sequences together is consistently better than that of compressing them separately. This shows the importance of cross-sequence similarities. We have also extended the cross-sequence predictions to more than two chromosome sequences. We found that there is always additional savings in bpb by compressing a number of chromosome sequences together. Since by making use of the cross-sequence similarities, our proposed multiple sequence compression algorithm can outperform other single sequence-based compression algorithms.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	M.Phil.	-
dcterms.extent	xiii, 115 p. : ill. (some col.) ; 30 cm.	-
dcterms.issued	2009	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations.	-
dcterms.LCSH	Nucleotide sequence -- Mathematical models.	-
dcterms.LCSH	Chromosomes -- Mathematical models.	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/4162

Show simple item record

Page views

184

Last Week
1

Last month

Citations as of Dec 7, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM