ensembl conversion #3

jayypaul · 2024-01-22T19:43:27Z

Hi there! I'm excited to use this tool. I noticed in the fit function for fine tuning for imputation, ensembl_auto_conversion is set to False. I was wondering if there was a reason for not allowing automatic conversion during the fitting process, or whether it was simply forgotten to be made:
ensembl_auto_conversion = ensembl_auto_conversion, so the user can specify?

adata = self.common_preprocess(adata, 0, covariate_fields, ensembl_auto_conversion=False)

Just wondering if it's better that I convert the single cell and spatial gene names ahead of time for some reason. updating my cloned respository for now.

Thanks!

jayypaul · 2024-01-22T21:50:04Z

Hello, I tried changing the code in the cloned repository but for some reason when I import the functions from the repo it doesnt recognize the changes I made. I'm assuming it's an issue on my end but to circumvent, I simply used:

from CellPLM.pipeline.experimental import symbol_to_ensembl
i noticed in "common_preprocess" you call .var_names_make_unique(), which will append numbers to duplicated gene names.

for whatever reason, after I did this, this stopped working due to some values not being strings (unsure if referring to .var):

train_data = sp_adata.concatenate(
sc_adata, join='outer', batch_key=None, index_unique=None)

So I replaced with updated version:

train_data = ad.concat([sp_adata, sc_adata], join='outer')
train_data.obs

when i go to fit I get the following error:
pipeline.fit(
train_data, # An AnnData object
pipeline_config, # The config dictionary we created previously, optional
split_field = 'split', # Specify a column in .obs that contains split information
train_split = 'train',
valid_split = 'valid',
batch_gene_list = batch_gene_list, # Specify genes that are measured in each batch
device = DEVICE,
# ensembl_auto_conversion = True
)

After filtering, 816 genes remain.
Traceback (most recent call last):
File "", line 1, in
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/pipeline/imputation.py", line 108, in fit
dataset = TranscriptomicDataset(adata, split_field, covariate_fields, label_fields, batch_gene_list)
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/utils/data.py", line 45, in init
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
File "/home/jaaayy/miniconda3/envs/cellplm/lib/python3.9/site-packages/CellPLM/utils/data.py", line 45, in
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
KeyError: '0'

I'm assuming it's referring to the "0" in the converted gene name list. unsure why this is happening, as when I run this outside of the function it works:

import torch as T
batch_gene_mask = {}
gene_list = train_data.var.index.tolist()
g2id = dict(zip(gene_list, list(range(len(gene_list)))))
for batch in batch_gene_list:
print(batch)
idx = T.LongTensor([g2id[g] for g in batch_gene_list[batch]])
batch_gene_mask[batch] = T.zeros(len(g2id)).bool()
batch_gene_mask[batch][idx] = True

any suggestions on how to execute the fine tuning process appropriately? Should I just remove all ensemble duplicates that were converted into "0"?

jayypaul · 2024-01-22T22:47:09Z

quick update, even when i remove all instances of 0 from my single cell adata, and spatial adata, i still get a keyvalue error with:
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

any suggestions moving forward would be greatly appreciated!

jayypaul · 2024-02-01T21:29:25Z

Hello, apologies for the string of replies, but I'm currently unable to troubleshoot the imputation pipeline. there is a key value error with:
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

I figured out that even if i remove the duplicates from my spatial and sc anndata objects and input ensemble converted genes as their var_names, i get a key value error at gene "ENSG00000282633", which is the gene name at index 406.

i noticed that in:

query_dataset = 'HumanLiverCancerPatient2_filtered_ensg.h5ad'
ref_dataset = 'GSE151530_Liver_ensg.h5ad'
query_data = ad.read_h5ad(f'CellPLM/data/{query_dataset}')
query_data.var
ref_data = ad.read_h5ad(f'CellPLM/data/{ref_dataset}')

there are 407 genes. I'm wondering if this is occuring because I'm using pretrained version: "20231027_85M", if so, then I should use "20230926_85M"?

or should I be subsetting my object to the 407 gene names that are in the sc and sp data you used in your tutorials... that doesn't seem right to me but I think there is a logical fix that I'm not exactly aware of. Any advice would be helpful!

Jordan

wehos · 2024-02-14T01:31:02Z

Sorry for the late reply, I was busy with several paper submissions in the past month.

I appreciate the detailed information you provided, however currently it's hard to locate the issue without a reproducible code sample. I want to make several explanations to your current question:

ensembl_auto_conversion is set to False because it is currently an experimental function and we encourage users to process their data on their own.
Even when i remove all instances of 0 from my single cell adata, and spatial adata, i still get a keyvalue error with:
idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])

The genes that are not present in the pretrained gene list are supposed to be removed, which is done here:

CellPLM/CellPLM/pipeline/__init__.py

Line 71 in bd34dd0

    
           adata = adata[:, [x for x in adata.var.index.tolist() if x in self.model.gene_set]]

This step is crucial and I suppose it's somehow skipped in your current program. If your target genes are not covered in our current pretrain gene list, then there is no solution in the current fine-tune framework. I would like to apologize for the inconvenience and we will add an add_gene solution in the near future.

I would appreciate if you could provide a minimum reproducible sample so that we can try from our side.

jayypaul · 2024-02-19T14:23:22Z

Hey @wehos , thanks for your input! That makes sense. In terms of a minimum reproducible sample, do you mean just a subset of the data to try on your end, or also some of the code I'm using?

wehos · 2024-02-19T20:21:50Z

Yes. Anything convenient for you as long as it can help reproduce the issue

jayypaul · 2024-02-28T17:23:17Z

Hey thanks for your patience! Here is an example script + some example code.
cellplm_eg.zip

Let me know if for some reason it doesn't load properly. should be an example script and two h5ad files for sc and sp adata structures that have required metadata cols.

Best,

Jordan

jayypaul · 2024-03-19T14:28:33Z

Hello, just wanted to follow up to see whether there was any luck replicating the error?

Thanks for your help!

Jordan

wehos · 2024-03-28T20:08:09Z

Hi Jordan,

Sorry for the delay. The example you provided is very helpful! I have looked into it and figured out the issue. Some updates needed to be made to the data module (see commit here). I have pushed the updates to github and you can pull the latest version.

Here are some kind reminders for your current codes:
(1) If you want to do the evaluation, simply use sp_adata as the query data to run score function. You would see the evaluation scores. After the evaluation, if you'd like to go with real-world applications, don't forget to comment these things out:

sp_adata.obsm['truth'] = sp_adata[:, target_genes].X.toarray()
sp_adata[:, target_genes].X = 0

and set query_genes as much as you can to facilitate the training process. In this scenario, we would not be able to test the performance of the model on unmeasured genes because the ground-truth is not given.

(2) I found out some genes have completely zero readout across all cells in your sample sp data. This would result in a NaN overall correlation score in our evaluation. This should not affect the training process tho.

jayypaul · 2024-03-29T14:25:02Z

Hey! No problem, thanks for making time for this! I'll check out the update and keep your reminders in mind. Thanks for your input, excited to see how it performs 👍

wehos mentioned this issue Mar 12, 2024

Bugs of running own codes for imputation #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensembl conversion #3

ensembl conversion #3

jayypaul commented Jan 22, 2024

jayypaul commented Jan 22, 2024

jayypaul commented Jan 22, 2024

jayypaul commented Feb 1, 2024

wehos commented Feb 14, 2024

jayypaul commented Feb 19, 2024

wehos commented Feb 19, 2024

jayypaul commented Feb 28, 2024

jayypaul commented Mar 19, 2024

wehos commented Mar 28, 2024 •

edited

Loading

jayypaul commented Mar 29, 2024

ensembl conversion #3

ensembl conversion #3

Comments

jayypaul commented Jan 22, 2024

jayypaul commented Jan 22, 2024

jayypaul commented Jan 22, 2024

jayypaul commented Feb 1, 2024

wehos commented Feb 14, 2024

jayypaul commented Feb 19, 2024

wehos commented Feb 19, 2024

jayypaul commented Feb 28, 2024

jayypaul commented Mar 19, 2024

wehos commented Mar 28, 2024 • edited Loading

jayypaul commented Mar 29, 2024

wehos commented Mar 28, 2024 •

edited

Loading