Skip to content
A.s edited this page Jun 17, 2024 · 27 revisions

<--

Answers to Polls

14_FirstSteps

  • How many genes and cells does this dataset have?
  • dim(pbmc.data)
  • 32738 genes, 2700 cells
  • How many genes are not expressed in any cell?
  • rowSums(pbmc.data) == 0 %>% sum()
  • 16104
  • Which are the top 3 genes with the highest total count?
  • rowSums(pbmc.data) %>% sort(decreasing = TRUE) %>% head(3)
  • MALAT1, TMSB4X, B2M
  • In cell "AAATTCGATTCTCA-1", how many reads map to gene "ACTB"?
  • pbmc.data["ACTB","AAATTCGATTCTCA-1"]
  • 10
  • How many cells have less than 2000 counts?
  • (colSums(pbmc.data) <= 2000) %>% sum()
  • 1025
  • What's the current number of cells after this step?
  • dim(pbmc)
  • 2638
  • Which are the 3 most highly variable genes?
  • VariableFeatures(pbmc) %>% head(3)
  • LYZ, S100A9, PPBP
  • What's the variance of the gene PYCARD?
  • HVFInfo(pbmc)["PYCARD",]
  • 5.05
  • How many components should we choose to include?
  • 7
  • Which is the default value of the parameter K?
  • k.param = 20
  • How many clusters did we find? pbmc$seurat_clusters %>% unique() %>% length() 9

16_UMAP

  • Is t-SNE affected by the seed?
  • Yes. And there's always one seed set because the algorithm is stochastic (the S in SNE!).
  • (perplexity constraint) Is nrow(X) the number of genes or cells?
  • nrow(X) is the number of cells ("observations" in the data matrix are in rows.) Meanwhile, each column of the data matrix is a variable (gene).
  • Is UMAP affected more than t-SNE by the seed?
  • t-SNE is more sensitive. UMAP emphasis on global structure helps (a lot.)
  • Would decreasing the number of PCs fed to the UMAP algorithm change our visualization? Would you say the results are 'better'?
  • Yes, and it’s better with few PCs. That's why we look for a "sweet" spot with Elbow or JackStraw. The variability in each PC is a mixed proportion (not always 1:1) of relevant, biological variability or technical variability.
  • Could you have the UMAP projection onto 3 axes instead of 2?
  • All answers are correct, except for "no".

21_DE

  • What would happen if we ran FindVariableFeatures after ScaleData?
  • Nothing, the resulting list of genes would be the same. Data scaling is important for ML (e.g. dim reduc); the method(s) to finding variable features are more akin to 'classic' statistical modelling techniques (e.g. regression).
  • What would happen if we skip ScaleData? Would PCA be affected?
  • The resulting PCA has biases introduced by the absolute values of our count matrix. Genes with higher counts would rule the components, even if their variability is not big.
  • Would it be any difference if we ran FindNeighbors and FindClusters but only after UMAP? Depends! If we wanted to FindNeighbors using reduction='umap'. The first option is incorrect in the context of single cell data and seurat processing pipeline.

  • Given that genes used for clustering are the same genes tested for differential expression, Would you interpret the (adjusted) p-values without concerns? Yeah. There's a bit of chicken-and-egg problem... but that's just how it is.

  • Which marker gene is the most expressed in cluster 1 when comparing it to cluster 2?

  • What would happen if we used ident.1 = 2, and ident.2 = 1 instead?

  • How would you extract the 'gene signature' of a given cluster? By DE testing with… Any test, against the remaining clusters

Day 3

  • paste question here

Day 4

  • paste question here

-->

Clone this wiki locally