Skip to content

Latest commit

 

History

History
57 lines (37 loc) · 2.73 KB

README.md

File metadata and controls

57 lines (37 loc) · 2.73 KB

FASTA_ANALYSIS

The objective of this work is to find a strategy who improves compression of genomic sequences (fasta,fa,etc). For that we already have a plentitude of compressing tools available in the market, such as NAF,MBGC,GZIP, among others.

But, if we can work on a previous ordered file, where the most similar sequences are grouped, it's reasonable to think that the compression ratio would decrease. So, to achieve that, an executable file will be created (generated with C++ code) dedicated to sort that type of file. The criteria used to do that can be decided by the user when he is running ./FASTA_ANALY. It can go from absolute number or percentage of nucleotide pairs (AT or CG) to size or percentage.

On the other hand, there's also a compression script where it's possible to use the 5 sorting compression scenarios together with 7 different compressors, 3 are general-purpose and 4 are DNA-specific. This script will give not only the sizes of the compressed files and the times of the compression but also a comparison between compression with or without sorting, through the creation of CSV viles and plots.

Usage Example

./FASTA_ANALY -sort=AT unsorted_file.fasta sorted_file.fasta 1

A description of the options available can be obtained, invoking:

./FASTA_ANALY -h

Data compression tools used


Data Compressor Repository Description
NAF code article
MFCompress code article
JARVIS3 code
gzip code article
lzma code
bzip2 code article
MBGC code article

Compression Benchmark Reproducibility:

Change directory and give execute permissions:

cd src/Compression_Scripts
chmod +x *.sh

Compression Benchmark Usage:

Run all compression commands:

./compression_test_script.sh

Run isolated compression commands:

./compression_COMPRESSORNAME.sh SORTING_TYPE INPUT_FILE 0