FASTA_ANALYSIS

The objective of this work is to find a strategy who improves compression of genomic sequences (fasta,fa,etc). For that we already have a plentitude of compressing tools available in the market, such as NAF,MBGC,GZIP, among others.

But, if we can work on a previous ordered file, where the most similar sequences are grouped, it's reasonable to think that the compression ratio would decrease. So, to achieve that, an executable file will be created (generated with C++ code) dedicated to sort that type of file. The criteria used to do that can be decided by the user when he is running ./FASTA_ANALY. It can go from absolute number or percentage of nucleotide pairs (AT or CG) to size or percentage.

On the other hand, there's also a compression script where it's possible to use the 5 sorting compression scenarios together with 7 different compressors, 3 are general-purpose and 4 are DNA-specific. This script will give not only the sizes of the compressed files and the times of the compression but also a comparison between compression with or without sorting, through the creation of CSV viles and plots.

Usage Example

./FASTA_ANALY -sort=AT unsorted_file.fasta sorted_file.fasta 1

A description of the options available can be obtained, invoking:

./FASTA_ANALY -h

Data compression tools used

Data Compressor	Repository	Description
NAF	code	article
MFCompress	code	article
JARVIS3	code
gzip	code	article
lzma	code
bzip2	code	article
MBGC	code	article

Compression Benchmark Reproducibility:

Change directory and give execute permissions:

cd src/Compression_Scripts
chmod +x *.sh

Compression Benchmark Usage:

Run all compression commands:

./compression_test_script.sh

Run isolated compression commands:

./compression_COMPRESSORNAME.sh SORTING_TYPE INPUT_FILE 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FASTA_ANALYSIS

Usage Example

Data compression tools used

Compression Benchmark Reproducibility:

Compression Benchmark Usage:

Files

README.md

Latest commit

History

README.md

File metadata and controls

FASTA_ANALYSIS

Usage Example

Data compression tools used

Compression Benchmark Reproducibility:

Compression Benchmark Usage: