Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several major updates #78

Merged
merged 58 commits into from
Oct 20, 2024
Merged
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
b88ab25
src/smithlab_cpp: updating submodules
andrewdavidsmith Oct 18, 2024
e714421
Fixing hashes on test outputs and fixing script for testing lc_extrap
andrewdavidsmith Oct 18, 2024
300ca41
src/moment_sequence.hcpp: removing unused verbose arguments to functions
andrewdavidsmith Oct 18, 2024
75a7a5e
src/c_curve.cpp: removed code that was not used for c_curve from this…
andrewdavidsmith Oct 18, 2024
dba838d
src/load_data_for_complexity.hcpp: made filenames pass by const refer…
andrewdavidsmith Oct 18, 2024
744b7cb
src/lc_extrap.cpp: removing c_str
andrewdavidsmith Oct 18, 2024
ef0f47e
src/gc_extrap.cpp: removing unused verbose arg to load_coverage_count…
andrewdavidsmith Oct 18, 2024
a04135f
src/bound_pop.cpp: removing unused verbose arg to quadrature rules fu…
andrewdavidsmith Oct 18, 2024
dc9d4f8
src/bamxx: Adding bamxx submodule
andrewdavidsmith Oct 18, 2024
cc115f6
Makefile.am: removing the to-mr target and adding bamxx and src/bam_r…
andrewdavidsmith Oct 18, 2024
350c990
src/load_data_for_complexity.hcpp: Updating the function to load coun…
andrewdavidsmith Oct 18, 2024
118a092
Adding back functionality to work with BAM in paired-end mode now usi…
andrewdavidsmith Oct 18, 2024
a0bc9b3
src/common.hcpp: adding these files
andrewdavidsmith Oct 18, 2024
5c1483a
src/bam_record_utils.hcpp: adding these source file
andrewdavidsmith Oct 18, 2024
b6aaa24
src/dnmt_error.hpp: adding this file because bam_record_utils.cpp nee…
andrewdavidsmith Oct 18, 2024
d3454eb
Formatting with clang-format and linting with cpplint
andrewdavidsmith Oct 18, 2024
05c2081
src/load_data_for_complexity.hpp: adding a function to load coverage …
andrewdavidsmith Oct 19, 2024
90148c2
src/load_data_for_complexity.cpp: adding structs to gather data toget…
andrewdavidsmith Oct 19, 2024
47cd233
src/gc_extrap.cpp: added functionality to do genome coverage from BAM…
andrewdavidsmith Oct 19, 2024
3245e62
Adding threads for functions that read BAM input
andrewdavidsmith Oct 19, 2024
c72b53f
src/load_data_for_complexity.cpp: factoring out sorted order check on…
andrewdavidsmith Oct 19, 2024
0a94079
src/load_data_for_complexity.cpp: linting with cpplint
andrewdavidsmith Oct 19, 2024
251d39f
src/load_data_for_complexity.cpp: fixing a bug in checking whether a …
andrewdavidsmith Oct 19, 2024
2b21984
src/load_data_for_complexity.cpp: another bugfix related to checking …
andrewdavidsmith Oct 19, 2024
6562e4a
src/load_data_for_complexity.cpp: fixing a bug in checking sorted ord…
andrewdavidsmith Oct 19, 2024
7b38e63
src/smithlab_cpp: updating submodule
andrewdavidsmith Oct 19, 2024
335ecc9
src/load_data_for_complexity.cpp: formatting and comments
andrewdavidsmith Oct 19, 2024
197c9e5
src/lc_extrap.cpp: adding code to format the histogram and write it t…
andrewdavidsmith Oct 19, 2024
c063d4a
src/common.hpp: adding the report_histogram to the common sources
andrewdavidsmith Oct 19, 2024
8cc2b43
src/load_data_for_complexity.cpp: fixing a bug in setting endpoints o…
andrewdavidsmith Oct 19, 2024
87082db
src/c_curve.cpp: adding the ability to output the counts histogram
andrewdavidsmith Oct 19, 2024
ce52e2b
src/gc_extrap.cpp: adding the ability to output the counts histogram
andrewdavidsmith Oct 19, 2024
99edc6b
src/lc_extrap.cpp: moving the report_histogram function to common
andrewdavidsmith Oct 19, 2024
094584a
src/load_data_for_complexity.cpp: fixed the priority queue which was …
andrewdavidsmith Oct 19, 2024
d15c897
src/load_data_for_complexity.cpp: removing some code that was aimed a…
andrewdavidsmith Oct 19, 2024
8598eaf
src/lc_extrap.cpp: merging option to report the 'about' information w…
andrewdavidsmith Oct 19, 2024
4761cc0
src/gc_extrap.cpp: merging option to report the 'about' information w…
andrewdavidsmith Oct 19, 2024
96714f9
src/bound_pop.cpp: providing option to report histogram to a file
andrewdavidsmith Oct 19, 2024
c62b649
src/pop_size.cpp: providing option to report histogram to a file
andrewdavidsmith Oct 19, 2024
5976618
src/load_data_for_complexity.cpp: satsifying cpplint
andrewdavidsmith Oct 19, 2024
ea1ee85
src/to-mr.cpp: removing this file as its only remaining functionality…
andrewdavidsmith Oct 20, 2024
539bc81
src/moment_sequence.cpp: fixing a var shadowing earlier var, adding c…
andrewdavidsmith Oct 20, 2024
ef0dc93
src/moment_sequence.cpp: removing the ACCEPT_HANKEL variable because …
andrewdavidsmith Oct 20, 2024
fd8fdd2
src/continued_fraction.hpp: removing the return_degree function as it…
andrewdavidsmith Oct 20, 2024
6b42ed7
Collapsing consecutive verbose conditions and factoring a big chunk o…
andrewdavidsmith Oct 20, 2024
eda26a5
.cppcheck_suppress: adding a config file for cppcheck
andrewdavidsmith Oct 20, 2024
7c9f9c7
src/continued_fraction.hcpp: removing an unused function
andrewdavidsmith Oct 20, 2024
0f896af
src/bam_record_utils.hpp: fixing the use of bool values in less-than …
andrewdavidsmith Oct 20, 2024
519239d
src/bam_record_utils.cpp: fixing some issues from cppcheck including …
andrewdavidsmith Oct 20, 2024
6b4c49b
src/load_data_for_complexity.cpp: removing some unused variables
andrewdavidsmith Oct 20, 2024
12c73b5
.github/workflows/cpplint.yml: removing the step of checking version
andrewdavidsmith Oct 20, 2024
e1dee37
.cppcheck_suppress: adding newline before eof
andrewdavidsmith Oct 20, 2024
afd38e7
cppcheck.yml: adding workflow for cppcheck
andrewdavidsmith Oct 20, 2024
402a67f
only writing histograms to files and this is separate from verbose
andrewdavidsmith Oct 20, 2024
d35eaf4
Removing the unused srand with pid and time because it was never used…
andrewdavidsmith Oct 20, 2024
c1005a5
clang-format
andrewdavidsmith Oct 20, 2024
e4cc220
.github/workflows/cppcheck.yml: trying to find cppcheck
andrewdavidsmith Oct 20, 2024
dbcf158
.github/workflows/cppcheck.yml: found cppcheck
andrewdavidsmith Oct 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
src/c_curve.cpp: adding the ability to output the counts histogram
andrewdavidsmith committed Oct 19, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 87082dbb58a2c1d20395c6e6fc28adf5a8cf5a26
93 changes: 45 additions & 48 deletions src/c_curve.cpp
Original file line number Diff line number Diff line change
@@ -25,7 +25,6 @@
#include "load_data_for_complexity.hpp"
#include "moment_sequence.hpp"

#include <GenomicRegion.hpp>
#include <OptionParser.hpp>
#include <smithlab_os.hpp>
#include <smithlab_utils.hpp>
@@ -38,17 +37,21 @@
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <cstring>
#include <filesystem>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <numeric>
#include <random>
#include <string>
#include <unordered_map>
#include <vector>

namespace fs = std::filesystem;

using std::accumulate;
using std::array;
using std::cbegin;
using std::cend;
using std::cerr;
using std::endl;
using std::isfinite;
@@ -59,37 +62,33 @@ using std::runtime_error;
using std::setprecision;
using std::size;
using std::string;
using std::to_string;
using std::uint64_t;
using std::unordered_map;
using std::vector;

template <typename T>
T
median_from_sorted_vector(const vector<T> sorted_data, const size_t stride,
median_from_sorted_vector(const vector<T> &sorted_data, const size_t stride,
const size_t n) {
if (n == 0 || sorted_data.empty())
return 0.0;

const size_t lhs = (n - 1) / 2;
const size_t rhs = n / 2;

if (lhs == rhs)
return sorted_data[lhs * stride];

return (sorted_data[lhs * stride] + sorted_data[rhs * stride]) / 2.0;
}

int
c_curve_main(const int argc, const char *argv[]) {
try {
bool VERBOSE = false;
bool verbose = false;
bool PAIRED_END = false;
bool HIST_INPUT = false;
bool VALS_INPUT = false;
uint64_t seed = 408;

string outfile;
string histogram_outfile;

size_t upper_limit = 0;
double step_size = 1e6;
@@ -99,25 +98,32 @@ c_curve_main(const int argc, const char *argv[]) {
uint32_t n_threads{1};
#endif

const string description = R"(
Generate the complexity curve for data. This does not extrapolate, \
but instead resamples from the given data.)";
const string description =
R"(
Generate the complexity curve for data. This does not extrapolate, but
instead resamples from the given data.
)";
string program_name = fs::path(argv[0]).filename();
program_name += " " + string(argv[1]);

/********** GET COMMAND LINE ARGUMENTS FOR C_CURVE ***********/
OptionParser opt_parse(strip_path(argv[1]), description, "<input-file>");
OptionParser opt_parse(program_name, description, "<input-file>");
opt_parse.add_opt("output", 'o', "yield output file (default: stdout)",
false, outfile);
opt_parse.add_opt("step", 's', "step size in extrapolations", false,
step_size);
opt_parse.add_opt("verbose", 'v', "print more information", false, VERBOSE);
opt_parse.add_opt("pe", 'P', "input is paired end read file", false,
opt_parse.add_opt("verbose", 'v', "print more information", false, verbose);
opt_parse.add_opt("pe", 'P', "input paired end read file", false,
PAIRED_END);
opt_parse.add_opt("hist", 'H',
"input is a text file containing the observed histogram",
false, HIST_INPUT);
opt_parse.add_opt(
"vals", 'V', "input is a text file containing only the observed counts",
false, VALS_INPUT);
"input is text file containing observed histogram", false,
HIST_INPUT);
opt_parse.add_opt("hist-out", '\0',
"output histogram to this file (for non-hist input)",
false, histogram_outfile);
opt_parse.add_opt("vals", 'V',
"input is text file containing only observed counts",
false, VALS_INPUT);
#ifdef HAVE_HTSLIB
opt_parse.add_opt("bam", 'B', "input is in BAM format", false,
BAM_FORMAT_INPUT);
@@ -136,6 +142,7 @@ but instead resamples from the given data.)";
opt_parse.parse(argc - 1, argv + 1, leftover_args);
if (argc == 2 || opt_parse.help_requested()) {
cerr << opt_parse.help_message() << endl;
cerr << opt_parse.about_message() << endl;
return EXIT_SUCCESS;
}
if (opt_parse.about_requested()) {
@@ -154,101 +161,91 @@ but instead resamples from the given data.)";
/******************************************************************/

// Setup the random number generator
srand(time(0) + getpid()); // give the random fxn a new seed
srand(time(0) + getpid()); // random seed
mt19937 rng(seed);

vector<double> counts_hist;
size_t n_reads = 0;

// LOAD VALUES
if (HIST_INPUT) {
if (VERBOSE)
if (verbose)
cerr << "INPUT_HIST" << endl;
n_reads = load_histogram(input_file_name, counts_hist);
}
else if (VALS_INPUT) {
if (VERBOSE)
if (verbose)
cerr << "VALS_INPUT" << endl;
n_reads = load_counts(input_file_name, counts_hist);
}
#ifdef HAVE_HTSLIB
else if (BAM_FORMAT_INPUT && PAIRED_END) {
if (VERBOSE)
if (verbose)
cerr << "PAIRED_END_BAM_INPUT" << endl;
n_reads = load_counts_BAM_pe(n_threads, input_file_name, counts_hist);
}
else if (BAM_FORMAT_INPUT) {
if (VERBOSE)
if (verbose)
cerr << "BAM_INPUT" << endl;
n_reads = load_counts_BAM_se(n_threads, input_file_name, counts_hist);
}
#endif
else if (PAIRED_END) {
if (VERBOSE)
if (verbose)
cerr << "PAIRED_END_BED_INPUT" << endl;
n_reads = load_counts_BED_pe(input_file_name, counts_hist);
}
else { // default is single end bed file
if (VERBOSE)
if (verbose)
cerr << "BED_INPUT" << endl;
n_reads = load_counts_BED_se(input_file_name, counts_hist);
}

const size_t max_observed_count = counts_hist.size() - 1;
const double distinct_reads =
accumulate(begin(counts_hist), end(counts_hist), 0.0);
accumulate(cbegin(counts_hist), cend(counts_hist), 0.0);

const size_t total_reads = get_counts_from_hist(counts_hist);

const size_t distinct_counts =
std::count_if(begin(counts_hist), end(counts_hist),
std::count_if(cbegin(counts_hist), cend(counts_hist),
[](const double x) { return x > 0.0; });

if (VERBOSE)
if (verbose)
cerr << "TOTAL READS = " << n_reads << endl
<< "COUNTS_SUM = " << total_reads << endl
<< "DISTINCT READS = " << distinct_reads << endl
<< "DISTINCT COUNTS = " << distinct_counts << endl
<< "MAX COUNT = " << max_observed_count << endl
<< "COUNTS OF 1 = " << counts_hist[1] << endl;

if (VERBOSE) {
// output the original histogram
cerr << "OBSERVED COUNTS (" << counts_hist.size() << ")" << endl;
for (size_t i = 0; i < counts_hist.size(); i++)
if (counts_hist[i] > 0)
cerr << i << '\t' << static_cast<size_t>(counts_hist[i]) << endl;
cerr << endl;
}
if (verbose)
report_histogram(histogram_outfile, counts_hist);

if (upper_limit == 0)
upper_limit = n_reads; // set upper limit to equal the number of
upper_limit = n_reads; // set upper limit equal to number of
// molecules

// handles output of c_curve
// setup for output of the complexity curve
std::ofstream of;
if (!outfile.empty())
of.open(outfile.c_str());
of.open(outfile);
std::ostream out(outfile.empty() ? std::cout.rdbuf() : of.rdbuf());

// prints the complexity curve
out << "total_reads" << "\t" << "distinct_reads" << endl;
out << 0 << '\t' << 0 << endl;
for (size_t i = step_size; i <= upper_limit; i += step_size) {
if (VERBOSE)
if (verbose)
cerr << "sample size: " << i << endl;
out << i << "\t"
<< interpolate_distinct(counts_hist, total_reads, distinct_reads, i)
<< endl;
}
}
catch (runtime_error &e) {
catch (const std::exception &e) {
cerr << "ERROR:\t" << e.what() << endl;
return EXIT_FAILURE;
}
catch (std::bad_alloc &ba) {
cerr << "ERROR: could not allocate memory" << endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}