Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long vectors not supported yet #72

Open
daisy238 opened this issue Feb 8, 2024 · 3 comments
Open

long vectors not supported yet #72

daisy238 opened this issue Feb 8, 2024 · 3 comments

Comments

@daisy238
Copy link

daisy238 commented Feb 8, 2024

Hi Caitlin,

Thanks for developing TreeWAS!

I'm trying to use unitigs with TreeWAS, and have been running into the below error:

Error in unlist(snps[!is.na(snps)]) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Calls: treeWAS -> unique -> as.vector -> unlist
Execution halted

I've successfully already run TreeWAS on a smaller gene absence presence dataset so to me this looks like a memory based issue. I've therefore added the mem.lim parameter but I still receive the same error. I'm running the job on the cluster with 925GB. The unitigs file is around 27GB, with 2806 genomes and 5682556 unitigs/columns.

My TreeWAS commands are below:

unitigs <- treeWAS(snps = unitig_matrix,
                   phen = phenotypes,
                   tree = data_tree,
                   mem.lim = 900,
                   seed = 1)

I've also tried using mem.lim = TRUE but this has given me the same error.

If I reduce the number of columns in the unitig matrix down to 1000, TreeWAS then works.

Do you have any advice please, for dealing with a large unitig matrix?

@caitiecollins
Copy link
Owner

Hi Daisy,

I just pushed a change that should resolve the current issue you're facing. So if you re-download and install the treeWAS package (with dependencies=TRUE) from GitHub, it should work without hitting that error.

The line causing the error runs before the lines that are adjusted by the memory limit setting (which then subdivides your snps data into chunks to make it more manageable), which is why it wasn't affected by the mem.lim parameter, unfortunately.

That's a mighty large dataset you're working with (gotta love unitigs), so you may run into other issues. If you do, please let me know and I'll try to get back to you quicker with a fix. I'm keen to make the package more scalable.

Best,
Caitlin.

@daisy238
Copy link
Author

daisy238 commented Feb 29, 2024

Hi Caitlin,

Thanks for looking into this and making the change. This has prevented the previous vector issue but we are now encountering another issue:

Error in `dplyr::n_distinct()`:
! Can't recycle `..1` (size 1605613362) to size 1605613362.
Backtrace:
    ▆
 1. ├─treeWAS::treeWAS(...)
 2. │ └─dplyr::n_distinct(snps[!is.na(snps)])
 3. │   └─vctrs::vec_recycle_common(!!!args, .size = size)
 4. └─vctrs:::stop_recycle_incompatible_size(...)
 5.   └─vctrs:::stop_vctrs(...)
 6.     └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = vctrs_error_call(call))
Execution halted

On another note, I tried to remove the following lines from the previous treeWAS.R code, to get around the long vectors issue. With 900GB memory this split the unitig matrix into 83 chunks, however each chunk took around 24 hours to process. Is this to be expected?

portion of treeWAS.R code removed:

 ## CHECK IF BINARY:
 if(length(unique(as.vector(unlist(snps[!is.na(snps)])))) != 2){
    stop("snps must be a binary matrix")

@caitiecollins
Copy link
Owner

caitiecollins commented Mar 14, 2024

Hi Daisy,

That sounds like far longer per chunk than I would expect. It sounds like you may be bumping up against some memory constraints still, which could be slowing it down.

I would suggest trying to run a larger number of smaller chunks. Typically this doesn't actually take longer than running fewer larger chunks, and it may help if you're still approaching any unseen memory limits within each chunk.

Try setting chunk.size=10000.
I just ran a toy example with 2806 rows and 10000 columns on my laptop and it finished in 8.5 minutes. (And only ~8.9 minutes whether I subdivided that into 2, 5, or 10 chunks). So with your 5682556 columns, at that rate with chunk.size=10000 it could finish in under 4 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants