Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs of running own codes for imputation #8

Open
HelloWorldLTY opened this issue Mar 9, 2024 · 5 comments
Open

Bugs of running own codes for imputation #8

HelloWorldLTY opened this issue Mar 9, 2024 · 5 comments

Comments

@HelloWorldLTY
Copy link

HelloWorldLTY commented Mar 9, 2024

Hi, I tried to impute my own spatial datasets (as mouse) with the tutorial for imputation. However, it seems that I cannot impute it with a bug:

ValueError: None of AnnData.var.index found in pre-trained gene set. In case the input gene names are gene symbols, please enable `ensembl_auto_conversion`, or manually convert gene symbols to ensembl ids in the input dataset.

I check that my dataset is in gene name (here the genes name are all upper-case since I tried to use orthology genes.).

image

@wehos
Copy link
Contributor

wehos commented Mar 11, 2024

Sorry for the inconvenience. Our method used Ensembl id as gene index. We provided an automatic method to map gene names to ensembl id based on mygene here.

@HelloWorldLTY
Copy link
Author

Hi, thanks. After transferring the data with this method, I meet a new bug:
In this function:

pipeline.fit(train_data, # An AnnData object
            pipeline_config, # The config dictionary we created previously, optional
            split_field = 'split', #  Specify a column in .obs that contains split information
            train_split = 'train',
            valid_split = 'valid',
            batch_gene_list = batch_gene_list, # Specify genes that are measured in each batch, see previous section for more details
            device = DEVICE,
            ) 
     43 g2id = dict(zip(self.gene_list, list(range(len(self.gene_list)))))
     44 for batch in batch_gene_list:
---> 45     idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
     46     self.batch_gene_mask[batch] = torch.zeros(len(g2id)).bool()
     47     self.batch_gene_mask[batch][idx] = True

KeyError: '0'

I think the reason is after transferring the gene name, there are some strange gene:

'ENSG00000137547',
  'ENSG00000120992',
  'ENSG00000187735',
  'ENSG00000047249',
  'ENSG00000023287',
  '0',
  'ENSG00000168300',
  '0-1',

@wehos
Copy link
Contributor

wehos commented Mar 12, 2024

Generally it is the same issue as here. Did you follow the tutorial? The tutorial should have automatically removed gene ids that are not in pretrained list.

@HelloWorldLTY
Copy link
Author

Yes, I followed the tutorial but used my own datasets. The dataset I used is from tangram: https://github.com/broadinstitute/Tangram/blob/master/tutorial_tangram_with_squidpy.ipynb

I will try to remove all the genes with 0 or 0-id and then have a try🤔

@wehos
Copy link
Contributor

wehos commented Mar 28, 2024

Hello, I have updated the codes so that now it should work more smoothly. If you installed CellPLM with pip previously, please try pip install -U cellplm to update it accordingly. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants