Skip to content

Supplementary data for "A survey and evaluations of histogram-based statistics in alignment-free sequence comparison"

Notifications You must be signed in to change notification settings

BioinformaticsToolsmith/Alignment-Free-Kmer-Statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manual for using the Benchmark

Authors: Brian Luczak, Ben James, and Hani Girgis The Bioinformatics Toolsmith Laboratory, the University of Tulsa, OK, USA

  • This Alignment-Free Benchmarking Tool is designed to run on Mac and Linux systems.

Requirements

  • GNU g++ 7.1 or higher
  • MATLAB with Statistics and Machine Learning Toolbox R2017a or higher
  • Perl
  • BioPerl

After downloading and unzipping the Benchmark directory, there are 2 primary folders: Cpp (C++ code), and MATLAB.

  • The C++ code is used for creating the k-mer histograms for the sequences and align them using the Needleman-Wunsch alignment algorithm.
  • The MATLAB code is to load the alignment/histograms files and evaluate the statistics.
  • There are 3 Perl scripts.

Compilation and installation

  • git clone https://github.com/TulsaBioinformaticsToolsmith/Alignment-Free-Kmer-Statistics/
  • cd Alignment-Free-Kmer-Statistics
  • ./install.pl
  • cd Cpp/Align
  • CXX=g++-7 make or just make if GNU g++ is the default compiler
  • cd ../Hist
  • CXX=g++-7 make or just make if GNU g++ is the default compiler

Testing the Benchmark with a given folder of sequences

  • Next, after obtaining a chosen selection of sequences in the FASTA format, the Perl script alignmentFree.pl will allow you to evaluate the same experiments on any selection of data.
  • Example: alignmentFree.pl ~/user/myFASTASeqfolder ~/user/Results
  • Notes:
    • MATLAB should be a searchable application within the system, i.e. it can be executed from any directory without the full path.
    • The output directory (second command line argument) will be deleted if it already exists and a new one will be created.
  • Steps of the Process:
    1. Each of the sequences is aligned pairwise with every other sequence (no repetitions)
      • This establishes the benchmark for comparison
      • Due to the relative inefficiency of alignment, this can take quite a bit of time depending on the number of sequences and their length
        • The program will output the progress in .2% intervals
    2. K-mer histograms are generated for k=1 through k=10, this will cover any sequences with lengths less than or equal to one million base pairs
    3. MATLAB is opened and the EVALUATE function is called
      • This method procedurally picks an appropriate k-value based on the average sequence length
      • The files are converted into matrices using dlmread, delimiter is !, this delimiter should prevent any collisions with sequence names in the FASTA format
      • Warning: only1 million sequence comparisons are allowed
    4. The REVIEW_PAPER function is called
      • Three primary experiments from the paper are run on the histogram and alignment data
        • The K Nearest Neighbor Experiment is repeated with mer-size minus 1 and minus 2 (as long as the initial kmer-size is big enough)
    5. The results can be found in the designated output directory provided by the user -There are 3 folders provided in the results:
      • Align: contains the file with the alignment identity scores for the sequences provided
      • Hist: contains the histogram files, k-mer size 1 through 10 are provided
      • Matlab_EVAL: contains the complete MATLAB evaluation with figures
        • exp1 is the Sensitivity/Specificity experiment, it contains pdf versions of the figures as well as the Distribution of Alignment Identity score values by 5% bins
          • exp2 is the Linear Correlation experiment, the text files will allow you to see the specific R^2 values, Fig holds all the figures of the top performing statistics, the first number is the threshold percentage, 1-5 are the rankings
          • exp3 is the K Nearest Neighbor experiment
  • Replicating the Review Paper Results
    • After opening the MATLAB code with your MATLAB (R2017a), simply run the following script in the command window:
      > Review_Paper_Results('output_dir')
              
    • Here, output_dir is the path to the output directory where the Results will be created.

    -This will load the data from the workspace Example_Data.mat and run each of the experiments with the same parameters used in the paper.

  • MATLAB Conventions
    • EVALUATE is the driver MATLAB Function that alignmentFree.pl calls
    • REVIEW_PAPER is the primary MATLAB function that runs each of the experiments from the review paper

    -All alignment identity score matrices have the numbers for the sequences being compared in cols 2 and 3.

    -Example: see build_all_features 91.8976 2 4 in an Alignment matrix means that sequences 2 and 4 (rows 2 and 4) in the histogram matrix have an id score of 89.76%

About

Supplementary data for "A survey and evaluations of histogram-based statistics in alignment-free sequence comparison"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published