The objective of this work is to find a strategy who improves compression of genomic sequences (fasta,fa,etc). For that we already have a plentitude of compressing tools available in the market, such as NAF,MBGC,GZIP, among others.
But, if we can work on a previous ordered file, where the most similar sequences are grouped, it's reasonable to think that the compression ratio would decrease. So, to achieve that, an executable file will be created (generated with C++ code) dedicated to sort that type of file. The criteria used to do that can be decided by the user when he is running ./FASTA_ANALY. It can go from absolute number or percentage of nucleotide pairs (AT or CG) to size or percentage.
On the other hand, there's also a compression script where it's possible to use the 5 sorting compression scenarios together with 7 different compressors, 3 are general-purpose and 4 are DNA-specific. This script will give not only the sizes of the compressed files and the times of the compression but also a comparison between compression with or without sorting, through the creation of CSV viles and plots.
./FASTA_ANALY -sort=AT unsorted_file.fasta sorted_file.fasta 1
A description of the options available can be obtained, invoking:
./FASTA_ANALY -h
Data Compressor | Repository | Description |
---|---|---|
NAF | code | article |
MFCompress | code | article |
JARVIS3 | code | |
gzip | code | article |
lzma | code | |
bzip2 | code | article |
MBGC | code | article |
Change directory and give execute permissions:
cd src/Compression_Scripts chmod +x *.sh
Run all compression commands:
./compression_test_script.sh
Run isolated compression commands:
./compression_COMPRESSORNAME.sh SORTING_TYPE INPUT_FILE 0