Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensembl conversion #3

Open
jayypaul opened this issue Jan 22, 2024 · 10 comments
Open

ensembl conversion #3

jayypaul opened this issue Jan 22, 2024 · 10 comments

Comments

@jayypaul
Copy link

Hi there! I'm excited to use this tool. I noticed in the fit function for fine tuning for imputation, ensembl_auto_conversion is set to False. I was wondering if there was a reason for not allowing automatic conversion during the fitting process, or whether it was simply forgotten to be made:
ensembl_auto_conversion = ensembl_auto_conversion, so the user can specify?

adata = self.common_preprocess(adata, 0, covariate_fields, ensembl_auto_conversion=False)

Just wondering if it's better that I convert the single cell and spatial gene names ahead of time for some reason. updating my cloned respository for now.

Thanks!

@jayypaul
Copy link
Author

Hello, I tried changing the code in the cloned repository but for some reason when I import the functions from the repo it doesnt recognize the changes I made. I'm assuming it's an issue on my end but to circumvent, I simply used:

from CellPLM.pipeline.experimental import symbol_to_ensembl
i noticed in "common_preprocess" you call .var_names_make_unique(), which will append numbers to duplicated gene names.

for whatever reason, after I did this, this stopped working due to some values not being strings (unsure if referring to .var):

train_data = sp_adata.concatenate(
sc_adata, join='outer', batch_key=None, index_unique=None)

So I replaced with updated version:

train_data = ad.concat([sp_adata, sc_adata], join='outer')
train_data.obs

when i go to fit I get the following error:
pipeline.fit(
train_data, # An AnnData object
pipeline_config, # The config dictionary we created previously, optional
split_field = 'split', # Specify a column in .obs that contains split information
train_split = 'train',
valid_split = 'valid',
batch_gene_list = batch_gene_list, # Specify genes that are measured in each batch
device = DEVICE,
# ensembl_auto_conversion = True
)

After filtering, 816 genes remain.
Traceback (most recent call last):
File "", line 1, in
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/pipeline/imputation.py", line 108, in fit
dataset = TranscriptomicDataset(adata, split_field, covariate_fields, label_fields, batch_gene_list)
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/utils/data.py", line 45, in init
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/utils/data.py", line 45, in
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
KeyError: '0'

I'm assuming it's referring to the "0" in the converted gene name list. unsure why this is happening, as when I run this outside of the function it works:

import torch as T
batch_gene_mask = {}
gene_list = train_data.var.index.tolist()
g2id = dict(zip(gene_list, list(range(len(gene_list)))))
for batch in batch_gene_list:
print(batch)
idx = T.LongTensor([g2id[g] for g in batch_gene_list[batch]])
batch_gene_mask[batch] = T.zeros(len(g2id)).bool()
batch_gene_mask[batch][idx] = True

any suggestions on how to execute the fine tuning process appropriately? Should I just remove all ensemble duplicates that were converted into "0"?

@jayypaul
Copy link
Author

quick update, even when i remove all instances of 0 from my single cell adata, and spatial adata, i still get a keyvalue error with:
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

any suggestions moving forward would be greatly appreciated!

@jayypaul
Copy link
Author

jayypaul commented Feb 1, 2024

Hello, apologies for the string of replies, but I'm currently unable to troubleshoot the imputation pipeline. there is a key value error with:
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

I figured out that even if i remove the duplicates from my spatial and sc anndata objects and input ensemble converted genes as their var_names, i get a key value error at gene "ENSG00000282633", which is the gene name at index 406.

i noticed that in:

query_dataset = 'HumanLiverCancerPatient2_filtered_ensg.h5ad'
ref_dataset = 'GSE151530_Liver_ensg.h5ad'
query_data = ad.read_h5ad(f'CellPLM/data/{query_dataset}')
query_data.var
ref_data = ad.read_h5ad(f'CellPLM/data/{ref_dataset}')

there are 407 genes. I'm wondering if this is occuring because I'm using pretrained version: "20231027_85M", if so, then I should use "20230926_85M"?

or should I be subsetting my object to the 407 gene names that are in the sc and sp data you used in your tutorials... that doesn't seem right to me but I think there is a logical fix that I'm not exactly aware of. Any advice would be helpful!

Jordan

@wehos
Copy link
Contributor

wehos commented Feb 14, 2024

Sorry for the late reply, I was busy with several paper submissions in the past month.

I appreciate the detailed information you provided, however currently it's hard to locate the issue without a reproducible code sample. I want to make several explanations to your current question:

  1. ensembl_auto_conversion is set to False because it is currently an experimental function and we encourage users to process their data on their own.

  2. Even when i remove all instances of 0 from my single cell adata, and spatial adata, i still get a keyvalue error with:
    idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

The genes that are not present in the pretrained gene list are supposed to be removed, which is done here:

adata = adata[:, [x for x in adata.var.index.tolist() if x in self.model.gene_set]]

This step is crucial and I suppose it's somehow skipped in your current program. If your target genes are not covered in our current pretrain gene list, then there is no solution in the current fine-tune framework. I would like to apologize for the inconvenience and we will add an add_gene solution in the near future.

I would appreciate if you could provide a minimum reproducible sample so that we can try from our side.

@jayypaul
Copy link
Author

Hey @wehos , thanks for your input! That makes sense. In terms of a minimum reproducible sample, do you mean just a subset of the data to try on your end, or also some of the code I'm using?

@wehos
Copy link
Contributor

wehos commented Feb 19, 2024

Yes. Anything convenient for you as long as it can help reproduce the issue

@jayypaul
Copy link
Author

Hey thanks for your patience! Here is an example script + some example code.
cellplm_eg.zip

Let me know if for some reason it doesn't load properly. should be an example script and two h5ad files for sc and sp adata structures that have required metadata cols.

Best,

Jordan

@jayypaul
Copy link
Author

Hello, just wanted to follow up to see whether there was any luck replicating the error?

Thanks for your help!

Jordan

@wehos
Copy link
Contributor

wehos commented Mar 28, 2024

Hi Jordan,

Sorry for the delay. The example you provided is very helpful! I have looked into it and figured out the issue. Some updates needed to be made to the data module (see commit here). I have pushed the updates to github and you can pull the latest version.

Here are some kind reminders for your current codes:
(1) If you want to do the evaluation, simply use sp_adata as the query data to run score function. You would see the evaluation scores. After the evaluation, if you'd like to go with real-world applications, don't forget to comment these things out:

sp_adata.obsm['truth'] = sp_adata[:, target_genes].X.toarray()
sp_adata[:, target_genes].X = 0

and set query_genes as much as you can to facilitate the training process. In this scenario, we would not be able to test the performance of the model on unmeasured genes because the ground-truth is not given.

(2) I found out some genes have completely zero readout across all cells in your sample sp data. This would result in a NaN overall correlation score in our evaluation. This should not affect the training process tho.

@jayypaul
Copy link
Author

Hey! No problem, thanks for making time for this! I'll check out the update and keep your reminders in mind. Thanks for your input, excited to see how it performs 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants