Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic threshold determination #126

Open
ckuenne opened this issue May 14, 2020 · 10 comments
Open

Automatic threshold determination #126

ckuenne opened this issue May 14, 2020 · 10 comments

Comments

@ckuenne
Copy link

ckuenne commented May 14, 2020

Hi,

is there a systematic way to find meaningful cutoffs for -s/-w? How do you decide on these?

Best,
Carsten

@VJalili
Copy link
Member

VJalili commented May 14, 2020

That's a good question!

The choice of stringent (-s) and weak (-w) thresholds are subject to your experiment setup and peak calling pipeline.

You may consider this example for choosing thresholds:
Based on your experiment setup, you may decide to use, e.g., 1e-8 as the p-value threshold on MACS so to minimize the number of false-positives. In order to increase the number of true-positives without the penalty of significantly increasing the number of false-positives, we suggest you use a more permissive threshold for peak calling (e.g., 1e-4), and then apply mspc to rescue weak but reproducible regions. Specifically, using 1e-4 on MACS will increase the number of true-positives with the penalty of more false-positives, then you can run MSPC on the replicates with -w 1e-4 and -s 1e-8, which will rescue the reproducible weak binding sites (i.e., increase the number of true-positives without increasing the number of false-positives). Additionally, in this setup, you may choose -g 1e-6 for a more permissive process; by default -g (combined stringency) is set to be equal to -s.

You may need to choose different thresholds if you use a different peak caller or different experiment setup.

We are interested in developing a method that can suggest values for these thresholds based on given input. Do you have any particular method in mind?

ping @marziacremona @fernandoPalluzzi

@ckuenne
Copy link
Author

ckuenne commented May 15, 2020

Not really. I'm working with multiple setups including ChIP-Seq (histones, tf), ATAC-Seq, RIP-Seq, and so on, which result in different peak sizes, distributions, and signal-to-noise ratio. Peak calling is done with MACS2, EPIC2, and MUSIC. If you plot these peaks after sorting for ascending -log10(fdr) you get something like this (example from using MACS2 with q-value 1e-4 on ATAC-Seq resulting in ~80k peaks).

image

A generalized algorithm would have to dynamically find lower/upper bounds here, which I have no good idea about. You could look for the slope, i.e. where it changes from linear to exponential. Or area under the curve, but that basically amounts to percentiles, which seems kinda crude.

BTW: in my experience the "extreme" peaks are often artefacts that result from erroneously calling multiple peaks as one, or are located in weird regions that should really be blacklisted. I know you can't fix issues with the peak calling itself, this should be done before your software is run. Just saying that the super peaks are often fishy, despite being highly significant and reproducible.

Edit: I just included your two suggested thresholds.

@VJalili
Copy link
Member

VJalili commented May 19, 2020

This is a good suggestion, we can give it a try. We can work on it together if you're interested.

Regarding the second point: Yes, agree. We discussed excluding black listed regions (#85), probably we can prioritize it.

@marziacremona
Copy link
Contributor

The automatic thresholding would be indeed a cool improvement for MSPC, but it’s very hard to do since it depends a lot on the particular experiment and on the peak caller used.
It’s no surprise that peak callers don’t do that, but they just propose a default fixed threshold… so I don’t think this should be a priority.

Regarding the extreme peaks, excluding ENCODE blacklisted regions as a first MSPC step would be certainty useful, and not complex to do.

@ckuenne
Copy link
Author

ckuenne commented May 25, 2020

I agree that it's probably impossible to automatically find the perfect threshold for every experimental variation automatically. But you might be able to identify certain types of distributions that behave similarly. And an imperfect auto threshold option would already be a great help. Just something to start off from a not totally arbitrary point.

Considering the ENCODE blacklisted regions: these were already removed from my example peaks. I guess that most people do this as part of the normal peak calling. But it sure can't hurt to have the option here.

@VJalili
Copy link
Member

VJalili commented May 25, 2020

@ckuenne I agree suggesting a value for these thresholds is important. I think it would be a very useful addition; would you be interested leading its method development and implementation? I can help you as much needed.

Some questions probably we can start with:

  • Can a value be suggested for these thresholds based on the info in the input BED files (e.g., p-value)? I guess we'd prefer doing our best to avoid asking users for any additional files/information.
  • I guess we want to suggest values for the weak (-w) and stringent (-s) thresholds; it may also worth suggesting a value for the optional combined stringency (-g) threshold .

@ckuenne
Copy link
Author

ckuenne commented May 26, 2020

Well, if I had a viable idea for this I would probably just have written a small wrapper and that's that. Sadly, I don't have any. And this is not really something i can devote myself to right now. Sorry.

@VJalili
Copy link
Member

VJalili commented May 26, 2020

Thanks for the suggestion anyway :) I'd keep issue open for a while in case anyone comes up a method.

@callum-b
Copy link

Hello! I've been experimenting with MSPC a bit to call peaks from two replicates and I've also been struggling with setting thresholds.
Could a starting point not be to use the distribution of pvals in the input files? For example, pool all peaks together, and set W=Quartile1 of the combined distribution and S=Quartile3 of the combined distribution? This could at least be set for a first run, and then perhaps print a message to the user that these params should probably be adjusted to give more fine-tuned results.

@VJalili
Copy link
Member

VJalili commented Nov 20, 2024

That is a good suggestion! Some thoughts:

  • The values for those two thresholds can hugely impact the method's output, and their value should be set based on the experiment setup.

  • I agree the quartile can be a very helpful starting point. My only concern is that their value also depends on the experiment setup; it may work great in most cases but could also lead to tricky defaults. For instance, in a noisy batch, you may have a lot of peaks with signals close to the background, so the defaults could also be set to very permissive thresholds, yielding noisy output. Indeed, those are the default thresholds, and users can adjust them as needed in such cases, though I am afraid most users run programs with default values assuming they are set to reasonable values and do not commonly experiment with optional parameters.

  • Maybe an alternative could be training a regression model using data from various experiment setups. However, that would probably take some metadata about the experiment setup that has to be provided to MSPC in addition to the peaks (it may need some thinking around UX).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants