-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensembl conversion #3
Comments
Hello, I tried changing the code in the cloned repository but for some reason when I import the functions from the repo it doesnt recognize the changes I made. I'm assuming it's an issue on my end but to circumvent, I simply used: from CellPLM.pipeline.experimental import symbol_to_ensembl for whatever reason, after I did this, this stopped working due to some values not being strings (unsure if referring to .var): train_data = sp_adata.concatenate( So I replaced with updated version: train_data = ad.concat([sp_adata, sc_adata], join='outer') when i go to fit I get the following error: After filtering, 816 genes remain. I'm assuming it's referring to the "0" in the converted gene name list. unsure why this is happening, as when I run this outside of the function it works: import torch as T any suggestions on how to execute the fine tuning process appropriately? Should I just remove all ensemble duplicates that were converted into "0"? |
quick update, even when i remove all instances of 0 from my single cell adata, and spatial adata, i still get a keyvalue error with: any suggestions moving forward would be greatly appreciated! |
Hello, apologies for the string of replies, but I'm currently unable to troubleshoot the imputation pipeline. there is a key value error with: I figured out that even if i remove the duplicates from my spatial and sc anndata objects and input ensemble converted genes as their var_names, i get a key value error at gene "ENSG00000282633", which is the gene name at index 406. i noticed that in: query_dataset = 'HumanLiverCancerPatient2_filtered_ensg.h5ad' there are 407 genes. I'm wondering if this is occuring because I'm using pretrained version: "20231027_85M", if so, then I should use "20230926_85M"? or should I be subsetting my object to the 407 gene names that are in the sc and sp data you used in your tutorials... that doesn't seem right to me but I think there is a logical fix that I'm not exactly aware of. Any advice would be helpful! Jordan |
Sorry for the late reply, I was busy with several paper submissions in the past month. I appreciate the detailed information you provided, however currently it's hard to locate the issue without a reproducible code sample. I want to make several explanations to your current question:
The genes that are not present in the pretrained gene list are supposed to be removed, which is done here: CellPLM/CellPLM/pipeline/__init__.py Line 71 in bd34dd0
This step is crucial and I suppose it's somehow skipped in your current program. If your target genes are not covered in our current pretrain gene list, then there is no solution in the current fine-tune framework. I would like to apologize for the inconvenience and we will add an I would appreciate if you could provide a minimum reproducible sample so that we can try from our side. |
Hey @wehos , thanks for your input! That makes sense. In terms of a minimum reproducible sample, do you mean just a subset of the data to try on your end, or also some of the code I'm using? |
Yes. Anything convenient for you as long as it can help reproduce the issue |
Hey thanks for your patience! Here is an example script + some example code. Let me know if for some reason it doesn't load properly. should be an example script and two h5ad files for sc and sp adata structures that have required metadata cols. Best, Jordan |
Hello, just wanted to follow up to see whether there was any luck replicating the error? Thanks for your help! Jordan |
Hi Jordan, Sorry for the delay. The example you provided is very helpful! I have looked into it and figured out the issue. Some updates needed to be made to the data module (see commit here). I have pushed the updates to github and you can pull the latest version. Here are some kind reminders for your current codes:
and set (2) I found out some genes have completely zero readout across all cells in your sample sp data. This would result in a NaN overall correlation score in our evaluation. This should not affect the training process tho. |
Hey! No problem, thanks for making time for this! I'll check out the update and keep your reminders in mind. Thanks for your input, excited to see how it performs 👍 |
Hi there! I'm excited to use this tool. I noticed in the fit function for fine tuning for imputation, ensembl_auto_conversion is set to False. I was wondering if there was a reason for not allowing automatic conversion during the fitting process, or whether it was simply forgotten to be made:
ensembl_auto_conversion = ensembl_auto_conversion, so the user can specify?
adata = self.common_preprocess(adata, 0, covariate_fields, ensembl_auto_conversion=False)
Just wondering if it's better that I convert the single cell and spatial gene names ahead of time for some reason. updating my cloned respository for now.
Thanks!
The text was updated successfully, but these errors were encountered: