Please use this identifier to cite or link to this item:
Title: An approach to multiple DNA sequences compression
Authors: Wu, Choi-ping Paula
Keywords: Hong Kong Polytechnic University -- Dissertations
Nucleotide sequence -- Mathematical models
Chromosomes -- Mathematical models
Issue Date: 2009
Publisher: The Hong Kong Polytechnic University
Abstract: Deoxyribonucleic acid (DNA) technologies have been widely used in genetic engineering, forensics and anthropology applications. A DNA sequence is a long sequence consisting of four types of nucleotides called bases. The number of bases of the 24 chromosomes in humans ranges from 50 to 250 million. Without any compression, two bits per base are required for storage which results in a large number of bits for encoding DNA sequences. An effective way to compress these sequences is thus desirable in order to reduce the storage space required. General-purpose compression tools such as gzip use more than two bits to encode a base. This is because these tools do not make use of the special characteristics of a DNA sequence. For example, it is well known that a DNA sequence has long-term correlation in that subsequences in different regions of a DNA sequence are similar to each other. State-of-the-art DNA compression schemes all rely on exploiting this long-term correlation. In particular, repetitions within the DNA sequence are searched so that similar subsequences can be encoded with reference to each other. For these DNA compression schemes, the reduced average bits per base (bpb) are around 0.25 for the benchmark DNA sequences. Thus, the compression gain is not large. It is well known that there are similarities among different chromosome sequences. All state-of-the-art compression algorithms exploit only self-sequence similarities, and specifically ignore the cross-sequence similarities. We have performed a thorough study of similarities within the same chromosome sequence as well as similarities among different chromosome sequences. These similarities are characterized by the existence of similar subsequences in different chromosome sequences. Our study indicates that these cross-sequence similarities are often significant when compared to self-sequence similarities. In the experimental results from the sixteen chromosome sequences in S. cerevisiae, the average repetitive length from cross-sequence prediction was almost fourteen times of that from self-sequence prediction. To make use of both self-sequence and cross-sequence similarities in DNA compression, we have proposed a multi-sequence compression algorithm. Our scheme compresses two or more sequences together so that similar subsequences found among multiple sequences can be encoded together. In this scheme, we first create a list of similar subsequences, from either the reference sequence or the current sequence which are ordered according to their importance. The list is then modified by removing the overlapping similar subsequences. After reordering the list according to their position and removing similar subsequences from the current sequence, Arith-2 coder is used to further compress the non-repetitive regions. Our experimental results show that compressing a sequence with reference to another sequence achieves an average of 5.5% saving in bpb as compared with that without reference to another sequence, hence the bpb of compressing two chromosome sequences together is consistently better than that of compressing them separately. This shows the importance of cross-sequence similarities. We have also extended the cross-sequence predictions to more than two chromosome sequences. We found that there is always additional savings in bpb by compressing a number of chromosome sequences together. Since by making use of the cross-sequence similarities, our proposed multiple sequence compression algorithm can outperform other single sequence-based compression algorithms.
Description: xiii, 115 p. : ill. (some col.) ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M EIE 2009 Wu
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b23067883_link.htmFor PolyU Users 162 BHTMLView/Open
b23067883_ir.pdfFor All Users (Non-printable) 1.74 MBAdobe PDFView/Open
Show full item record
PIRA download icon_1.1View/Download Contents

Page view(s)

Last Week
Last month
Citations as of Oct 22, 2018


Citations as of Oct 22, 2018

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.