Skip to content

Commit

Permalink
Merge pull request #14 from dpryan79/iterators
Browse files Browse the repository at this point in the history
Iterators
  • Loading branch information
dpryan79 authored Nov 21, 2016
2 parents a478e90 + 55469e1 commit d6a2884
Show file tree
Hide file tree
Showing 58 changed files with 1,040 additions and 131 deletions.
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,14 @@ test/exampleWrite: libBigWig.so
test/testBigBed: libBigWig.a
$(CC) -o $@ -I. $(CFLAGS) test/testBigBed.c libBigWig.a $(LIBS)

test: test/testLocal test/testRemote test/testWrite test/testLocal test/exampleWrite test/testRemoteManyContigs test/testBigBed
test/testIterator: libBigWig.a
$(CC) -o $@ -I. $(CFLAGS) test/testIterator.c libBigWig.a $(LIBS)

test: test/testLocal test/testRemote test/testWrite test/testLocal test/exampleWrite test/testRemoteManyContigs test/testBigBed test/testIterator
./test/test.py

clean:
rm -f *.o libBigWig.a libBigWig.so *.pico test/testLocal test/testRemote test/testWrite test/exampleWrite test/testRemoteManyContigs test/testBigBed example_output.bw
rm -f *.o libBigWig.a libBigWig.so *.pico test/testLocal test/testRemote test/testWrite test/exampleWrite test/testRemoteManyContigs test/testBigBed test/testIterator example_output.bw

install: libBigWig.a libBigWig.so
install -d $(prefix)/lib $(prefix)/include
Expand Down
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,37 @@ Entries will then be of the form (one per line):

Note that chromosome and start/end intervals are stored separately, so there's no need to parse them out of string. libBigWig can return these entries, either with or without the above associated strings. Parsing these string is left to the application requiring them and is currently outside the scope of this library.

# Interval/Entry iterators

Sometimes it is desirable to request a large number of intervals from a bigWig file or entries from a bigBed file, but not hold them all in memory at once (e.g., due to saving memory). To support this, libBigWig (since version 0.3.0) supports two kinds of iterators. The general process of using iterators is: (1) iterator creation, (2) traversal, and finally (3) iterator destruction. Only iterator creation differs between bigWig and bigBed files.

Importantly, iterators return results by one or more blocks. This is for convenience, since bigWig intervals and bigBed entries are stored in together in fixed-size groups, called blocks. The number of blocks of entries returned, therefore, is an option that can be specified to balance performance and memory usage.

## Iterator creation

For bigwig files, iterators are created with the `bwOverlappingIntervalsIterator()`. This function takes chromosomal bounds (chromosome name, start, and end position) as well as a number of blocks. The equivalent function for bigBed files is `bbOverlappingEntriesIterator()`, which additionally takes a `withString` argutment, which dictates whether the returned entries include the associated string values or not.

Each of the aforementioned files returns a pointer to a `bwOverlapIterator_t` object. The only important parts of this structure for end users are the following members: `entries`, `intervals`, and `data`. `entries` is a pointer to a `bbOverlappingEntries_t` object, or `NULL` if a bigWig file is being used. Likewise, `intervals` is a pointer to a `bwOverlappingIntervals_t` object, or `NULL` if a bigBed file is being used. `data` is a special pointer, used to signify the end of iteration. Thus, when `data` is a `NULL` pointer, iteration has ended.

## Iterator traversal

Regardless of whether a bigWig or bigBed file is being used, the `bwIteratorNext()` function will free currently used memory and load the appropriate intervals or entries for the next block(s). On error, this will return a NULL pointer (memory is already internally freed in this case).

## Iterator destruction

`bwOverlapIterator_t` objects MUST be destroyed after use. This can be done with the `bwIteratorDestroy()` function.

## Example

A full example is provided in `tests/testIterator.c`, but a small example of iterating over all bigWig intervals in `chr1:0-10000000` in chunks of 5 blocks follows:

iter = bwOverlappingIntervalsIterator(fp, "chr1", 0, 10000000, 5);
while(iter->data) {
//Do stuff with iter->intervals
iter = bwIteratorNext(iter);
}
bwIteratorDestroy(iter);

# A note on bigWig statistics

The results of `min`, `max`, and `mean` should be the same as those from `BigWigSummary`. `stdev` and `coverage`, however, may differ due to Kent's tools producing incorrect results (at least for `coverage`, though the same appears to be the case for `stdev`).
Expand Down
71 changes: 71 additions & 0 deletions bigWig.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ extern "C" {
*
* As of version 0.3.0, libBigWig supports reading bigBed files. If an application needs to support both bigBed and bigWig input, then the `bwIsBigWig` and `bbIsBigBed` functions can be used to determine the file type. These both use the "magic" number at the beginning of the file to determine the file type.
*
* \section Interval and entry iterators
*
* As of version 0.3.0, libBigWig supports iterating over intervals in bigWig files and entries in bigBed files. The number of intervals/entries returned with each iteration can be controlled by setting the number of blocks processed in each iteration (intervals and entries are group inside of bigWig and bigBed files into blocks of entries). See `test/testIterator.c` for an example.
*
* \section Examples
*
* Please see [README.md](README.md) and the files under `test/` for examples.
Expand Down Expand Up @@ -217,6 +221,24 @@ typedef struct {
char **str; /**<The strings associated with a given entry.*/
} bbOverlappingEntries_t;

/*!
* @brief A structure to hold iterations
* One of intervals and entries should be used to access records from bigWig or bigBed files, respectively.
*/
typedef struct {
bigWigFile_t *bw; /**<Pointer to the bigWig/bigBed file.*/
uint32_t tid; /**<The contig/chromosome ID.*/
uint32_t start; /**<Start position of the query interval.*/
uint32_t end; /**<End position of the query interval.*/
uint64_t offset; /**<Offset into the blocks.*/
uint32_t blocksPerIteration; /**<Number of blocks to use per iteration.*/
int withString; /**<For bigBed entries, whether to return the string with the entries.*/
void *blocks; /**<Overlapping blocks.*/
bwOverlappingIntervals_t *intervals; /**<Overlapping intervals (or NULL).*/
bbOverlappingEntries_t *entries; /**<Overlapping entries (or NULL).*/
void *data; /**<Points to either intervals or entries. If there are no further intervals/entries, then this is NULL. Use this to test for whether to continue iterating.*/
} bwOverlapIterator_t;

/*!
* @brief Initializes curl and global variables. This *MUST* be called before other functions (at least if you want to connect to remote files).
* For remote file, curl must be initialized and regions of a file read into an internal buffer. If the buffer is too small then an excessive number of connections will be made. If the buffer is too large than more data than required is fetched. 128KiB is likely sufficient for most needs.
Expand Down Expand Up @@ -343,6 +365,55 @@ bwOverlappingIntervals_t *bwGetOverlappingIntervals(bigWigFile_t *fp, char *chro
*/
bbOverlappingEntries_t *bbGetOverlappingEntries(bigWigFile_t *fp, char *chrom, uint32_t start, uint32_t end, int withString);

/*!
* @brief Creates an iterator over intervals in a bigWig file
* Iterators can be traversed with `bwIteratorNext()` and destroyed with `bwIteratorDestroy()`.
* Intervals are in the `intervals` member and `data` can be used to determine when to end iteration.
* @param fp A valid bigWigFile_t pointer. This MUST be for a bigWig file!
* @param chrom A valid chromosome name.
* @param start The start position of the interval. This is 0-based half open, so 0 is the first base.
* @param end The end position of the interval. Again, this is 0-based half open, so 100 will include the 100th base...which is at position 99.
* @param blocksPerIteration The number of blocks (internal groupings of intervals in bigWig files) to return per iteration.
* @return NULL on error, otherwise a bwOverlapIterator_t pointer
* @see bwOverlapIterator_t
* @see bwIteratorNext
* @see bwIteratorDestroy
*/
bwOverlapIterator_t *bwOverlappingIntervalsIterator(bigWigFile_t *fp, char *chrom, uint32_t start, uint32_t end, uint32_t blocksPerIteration);

/*!
* @brief Creates an iterator over entries in a bigBed file
* Iterators can be traversed with `bwIteratorNext()` and destroyed with `bwIteratorDestroy()`.
* Entries are in the `entries` member and `data` can be used to determine when to end iteration.
* @param fp A valid bigWigFile_t pointer. This MUST be for a bigBed file!
* @param chrom A valid chromosome name.
* @param start The start position of the interval. This is 0-based half open, so 0 is the first base.
* @param end The end position of the interval. Again, this is 0-based half open, so 100 will include the 100th base...which is at position 99.
* @param withString Whether the returned entries should include their associated strings.
* @param blocksPerIteration The number of blocks (internal groupings of entries in bigBed files) to return per iteration.
* @return NULL on error, otherwise a bwOverlapIterator_t pointer
* @see bbGetOverlappingEntries
* @see bwOverlapIterator_t
* @see bwIteratorNext
* @see bwIteratorDestroy
*/
bwOverlapIterator_t *bbOverlappingEntriesIterator(bigWigFile_t *fp, char *chrom, uint32_t start, uint32_t end, int withString, uint32_t blocksPerIteration);

/*!
* @brief Traverses to the entries/intervals in the next group of blocks.
* @param iter A bwOverlapIterator_t pointer that is updated (or destroyed on error)
* @return NULL on error, otherwise a bwOverlapIterator_t pointer with the intervals or entries from the next set of blocks.
* @see bwOverlapIterator_t
* @see bwIteratorDestroy
*/
bwOverlapIterator_t *bwIteratorNext(bwOverlapIterator_t *iter);

/*!
* @brief Destroys a bwOverlapIterator_t
* @param iter The bwOverlapIterator_t that should be destroyed
*/
void bwIteratorDestroy(bwOverlapIterator_t *iter);

/*!
* @brief Return all per-base bigWig values in a given interval.
* Given an interval (e.g., chr1:0-100), return the value at each position in a bigWig file. Positions without associated values are suppressed by default, but may be returned if `includeNA` is not 0.
Expand Down
123 changes: 123 additions & 0 deletions bwValues.c
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,129 @@ bbOverlappingEntries_t *bbGetOverlappingEntries(bigWigFile_t *fp, char *chrom, u
return output;
}

//Returns NULL on error
bwOverlapIterator_t *bwOverlappingIntervalsIterator(bigWigFile_t *bw, char *chrom, uint32_t start, uint32_t end, uint32_t blocksPerIteration) {
bwOverlapIterator_t *output = NULL;
uint64_t n;
uint32_t tid = bwGetTid(bw, chrom);
if(tid == (uint32_t) -1) return output;
output = calloc(1, sizeof(bwOverlapIterator_t));
if(!output) return output;
bwOverlapBlock_t *blocks = bwGetOverlappingBlocks(bw, chrom, start, end);

output->bw = bw;
output->tid = tid;
output->start = start;
output->end = end;
output->blocks = blocks;
output->blocksPerIteration = blocksPerIteration;

if(blocks) {
n = blocks->n;
if(n>blocksPerIteration) blocks->n = blocksPerIteration;
output->intervals = bwGetOverlappingIntervalsCore(bw, blocks,tid, start, end);
blocks->n = n;
output->offset = blocksPerIteration;
}
output->data = output->intervals;
return output;
}

//Returns NULL on error
bwOverlapIterator_t *bbOverlappingEntriesIterator(bigWigFile_t *bw, char *chrom, uint32_t start, uint32_t end, int withString, uint32_t blocksPerIteration) {
bwOverlapIterator_t *output = NULL;
uint64_t n;
uint32_t tid = bwGetTid(bw, chrom);
if(tid == (uint32_t) -1) return output;
output = calloc(1, sizeof(bwOverlapIterator_t));
if(!output) return output;
bwOverlapBlock_t *blocks = bwGetOverlappingBlocks(bw, chrom, start, end);

output->bw = bw;
output->tid = tid;
output->start = start;
output->end = end;
output->blocks = blocks;
output->blocksPerIteration = blocksPerIteration;
output->withString = withString;

if(blocks) {
n = blocks->n;
if(n>blocksPerIteration) blocks->n = blocksPerIteration;
output->entries = bbGetOverlappingEntriesCore(bw, blocks,tid, start, end, withString);
blocks->n = n;
output->offset = blocksPerIteration;
}
output->data = output->entries;
return output;
}

void bwIteratorDestroy(bwOverlapIterator_t *iter) {
if(!iter) return;
if(iter->blocks) destroyBWOverlapBlock((bwOverlapBlock_t*) iter->blocks);
if(iter->intervals) bwDestroyOverlappingIntervals(iter->intervals);
if(iter->entries) bbDestroyOverlappingEntries(iter->entries);
free(iter);
}

//On error, points to NULL and destroys the input
bwOverlapIterator_t *bwIteratorNext(bwOverlapIterator_t *iter) {
uint64_t n, *offset, *size;
bwOverlapBlock_t *blocks = iter->blocks;

if(iter->intervals) {
bwDestroyOverlappingIntervals(iter->intervals);
iter->intervals = NULL;
}
if(iter->entries) {
bbDestroyOverlappingEntries(iter->entries);
iter->entries = NULL;
}
iter->data = NULL;

if(iter->offset < blocks->n) {
//store the previous values
n = blocks->n;
offset = blocks->offset;
size = blocks->size;

//Move the start of the blocks
blocks->offset += iter->offset;
blocks->size += iter->offset;
if(iter->offset + iter->blocksPerIteration > n) {
blocks->n = blocks->n - iter->offset;
} else {
blocks->n = iter->blocksPerIteration;
}

//Get the intervals or entries, as appropriate
if(iter->bw->type == 0) {
//bigWig
iter->intervals = bwGetOverlappingIntervalsCore(iter->bw, blocks, iter->tid, iter->start, iter->end);
iter->data = iter->intervals;
} else {
//bigBed
iter->entries = bbGetOverlappingEntriesCore(iter->bw, blocks, iter->tid, iter->start, iter->end, iter->withString);
iter->data = iter->entries;
}
iter->offset += iter->blocksPerIteration;

//reset the values in iter->blocks
blocks->n = n;
blocks->offset = offset;
blocks->size = size;

//Check for error
if(!iter->intervals && !iter->entries) goto error;
}

return iter;

error:
bwIteratorDestroy(iter);
return NULL;
}

//This is like bwGetOverlappingIntervals, except it returns 1 base windows. If includeNA is not 0, then a value will be returned for every position in the range (defaulting to NAN).
//The ->end member is NULL
//If includeNA is not 0 then ->start is also NULL, since it's implied
Expand Down
Loading

0 comments on commit d6a2884

Please sign in to comment.