-
Notifications
You must be signed in to change notification settings - Fork 0
Home
<--
- How many genes and cells does this dataset have?
- dim(pbmc.data)
- 32738 genes, 2700 cells
- How many genes are not expressed in any cell?
- rowSums(pbmc.data) == 0 %>% sum()
- 16104
- Which are the top 3 genes with the highest total count?
- rowSums(pbmc.data) %>% sort(decreasing = TRUE) %>% head(3)
- MALAT1, TMSB4X, B2M
- In cell "AAATTCGATTCTCA-1", how many reads map to gene "ACTB"?
- pbmc.data["ACTB","AAATTCGATTCTCA-1"]
- 10
- How many cells have less than 2000 counts?
- (colSums(pbmc.data) <= 2000) %>% sum()
- 1025
- What's the current number of cells after this step?
- dim(pbmc)
- 2638
- Which are the 3 most highly variable genes?
- VariableFeatures(pbmc) %>% head(3)
- LYZ, S100A9, PPBP
- What's the variance of the gene
PYCARD
?
- HVFInfo(pbmc)["PYCARD",]
- 5.05
- How many components should we choose to include?
- 7
- Which is the default value of the parameter K?
- k.param = 20
- How many clusters did we find? pbmc$seurat_clusters %>% unique() %>% length() 9
- Is t-SNE affected by the seed?
- Yes. And there's always one seed set because the algorithm is stochastic (the S in SNE!).
- (perplexity constraint) Is nrow(X) the number of genes or cells?
- nrow(X) is the number of cells ("observations" in the data matrix are in rows.) Meanwhile, each column of the data matrix is a variable (gene).
- Is UMAP affected more than t-SNE by the seed?
- t-SNE is more sensitive. UMAP emphasis on global structure helps (a lot.)
- Would decreasing the number of PCs fed to the UMAP algorithm change our visualization? Would you say the results are 'better'?
- Yes, and it’s better with few PCs. That's why we look for a "sweet" spot with Elbow or JackStraw. The variability in each PC is a mixed proportion (not always 1:1) of relevant, biological variability or technical variability.
- Could you have the UMAP projection onto 3 axes instead of 2?
- All answers are correct, except for "no".
- What would happen if we ran
FindVariableFeatures
afterScaleData
?
- Nothing, the resulting list of genes would be the same. Data scaling is important for ML (e.g. dim reduc); the method(s) to finding variable features are more akin to 'classic' statistical modelling techniques (e.g. regression).
- What would happen if we skip
ScaleData
? Would PCA be affected?
- The resulting PCA has biases introduced by the absolute values of our count matrix. Genes with higher counts would rule the components, even if their variability is not big.
-
Would it be any difference if we ran
FindNeighbors
andFindClusters
but only after UMAP? Depends! If we wanted to FindNeighbors using reduction='umap'. The first option is incorrect in the context of single cell data and seurat processing pipeline. -
Given that genes used for clustering are the same genes tested for differential expression, Would you interpret the (adjusted) p-values without concerns? Yeah. There's a bit of chicken-and-egg problem... but that's just how it is.
-
Which marker gene is the most expressed in cluster 1 when comparing it to cluster 2?
-
What would happen if we used
ident.1 = 2
, andident.2 = 1
instead? -
How would you extract the 'gene signature' of a given cluster? By DE testing with… Any test, against the remaining clusters
- paste question here
- paste question here
-->
This wiki is not empty, we have hidden content inside an HTML comment. Click button with text EDIT
in the top right corner! ........ (not this pencil --->)