Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to limit the number of variants used in step 1 #587

Open
shengqh opened this issue Dec 24, 2024 · 6 comments
Open

how to limit the number of variants used in step 1 #587

shengqh opened this issue Dec 24, 2024 · 6 comments

Comments

@shengqh
Copy link

shengqh commented Dec 24, 2024

Based on introduction, about 500,000 variants should be enough for fitting the model. Usually, we will use a few criteria to keep high quality variants, for example:

--mac 200 --geno 0.1 --maf 0.1

However, there might still be too many filtered variants.

My question is, which one would be better for Regenie model fitting and test?

  • Random sampling qc-filtered variants to required number, such as 500,000. The average MAF of variants would not be very high.

  • Increase qc criteria, for example, "--mac 800 --geno 0.05" and so on, which might result a lot of very high MAF variants.

@Ojami
Copy link

Ojami commented Dec 27, 2024

Please see #497 and #530.

Hope this helps.

@shengqh
Copy link
Author

shengqh commented Dec 28, 2024

In your post:

From REGENIE paper:

a minor allele frequency of ≥1%, a Hardy–Weinberg equilibrium test not exceeding P = 1 × 10−15, a genotyping rate above 99%, not present in low-complexity regions, not involved in inter-chromosomal LD and LD pruning using a R2 threshold of 0.9 with a window size of 1,000 markers and a step size of 100 markers. This resulted in up to 471,762 genotyped SNPs that were kept in the analyses

Did those 471,762 genotyped SNPs be used in both step1 and step2 or just step1?

@joellembatchou
Copy link
Collaborator

Hi,

You could also perform LD pruning to reduce the number of variants used in step 1 in addition to using more stringent QC paramaters.

For your question on the analysis in the paper, the 471,762 were used in step 1 and for step 2 we tested on those variants as well as a larger set of imputed variants (using the same step 1 output file).

Cheers,
Joelle

@shengqh
Copy link
Author

shengqh commented Jan 7, 2025

Thank you so much. Another related question. If we want to do rare variant analysis, when we build the model in the first step, which one will be the best: using common variants filtered by higher MAF (for example 0.1) followed by LD pruning to include more confident SNVs in modelling, or using variants filtered by low MAF (for example 0.001) followed by LD pruning to include more rare SNVs in modeling?

@joellembatchou
Copy link
Collaborator

The goal of step 1 is to capture common genetic variation genome-wide so using a MAF threshold of 1 or 5% is sufficient (then combined with stringent LD pruning to reduce the number of variants).

@shengqh
Copy link
Author

shengqh commented Jan 8, 2025

That makes sense. Thank you so much for quick response.

For "not present in low-complexity regions, not involved in inter-chromosomal LD", how did you achieve this goal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants