Automatic threshold determination #126

ckuenne · 2020-05-14T15:40:01Z

Hi,

is there a systematic way to find meaningful cutoffs for -s/-w? How do you decide on these?

Best,
Carsten

VJalili · 2020-05-14T17:36:33Z

That's a good question!

The choice of stringent (-s) and weak (-w) thresholds are subject to your experiment setup and peak calling pipeline.

You may consider this example for choosing thresholds:
Based on your experiment setup, you may decide to use, e.g., 1e-8 as the p-value threshold on MACS so to minimize the number of false-positives. In order to increase the number of true-positives without the penalty of significantly increasing the number of false-positives, we suggest you use a more permissive threshold for peak calling (e.g., 1e-4), and then apply mspc to rescue weak but reproducible regions. Specifically, using 1e-4 on MACS will increase the number of true-positives with the penalty of more false-positives, then you can run MSPC on the replicates with -w 1e-4 and -s 1e-8, which will rescue the reproducible weak binding sites (i.e., increase the number of true-positives without increasing the number of false-positives). Additionally, in this setup, you may choose -g 1e-6 for a more permissive process; by default -g (combined stringency) is set to be equal to -s.

You may need to choose different thresholds if you use a different peak caller or different experiment setup.

We are interested in developing a method that can suggest values for these thresholds based on given input. Do you have any particular method in mind?

ping @marziacremona @fernandoPalluzzi

ckuenne · 2020-05-15T08:02:14Z

Not really. I'm working with multiple setups including ChIP-Seq (histones, tf), ATAC-Seq, RIP-Seq, and so on, which result in different peak sizes, distributions, and signal-to-noise ratio. Peak calling is done with MACS2, EPIC2, and MUSIC. If you plot these peaks after sorting for ascending -log10(fdr) you get something like this (example from using MACS2 with q-value 1e-4 on ATAC-Seq resulting in ~80k peaks).

A generalized algorithm would have to dynamically find lower/upper bounds here, which I have no good idea about. You could look for the slope, i.e. where it changes from linear to exponential. Or area under the curve, but that basically amounts to percentiles, which seems kinda crude.

BTW: in my experience the "extreme" peaks are often artefacts that result from erroneously calling multiple peaks as one, or are located in weird regions that should really be blacklisted. I know you can't fix issues with the peak calling itself, this should be done before your software is run. Just saying that the super peaks are often fishy, despite being highly significant and reproducible.

Edit: I just included your two suggested thresholds.

VJalili · 2020-05-19T03:46:40Z

This is a good suggestion, we can give it a try. We can work on it together if you're interested.

Regarding the second point: Yes, agree. We discussed excluding black listed regions (#85), probably we can prioritize it.

marziacremona · 2020-05-20T18:38:34Z

The automatic thresholding would be indeed a cool improvement for MSPC, but it’s very hard to do since it depends a lot on the particular experiment and on the peak caller used.
It’s no surprise that peak callers don’t do that, but they just propose a default fixed threshold… so I don’t think this should be a priority.

Regarding the extreme peaks, excluding ENCODE blacklisted regions as a first MSPC step would be certainty useful, and not complex to do.

ckuenne · 2020-05-25T07:25:16Z

I agree that it's probably impossible to automatically find the perfect threshold for every experimental variation automatically. But you might be able to identify certain types of distributions that behave similarly. And an imperfect auto threshold option would already be a great help. Just something to start off from a not totally arbitrary point.

Considering the ENCODE blacklisted regions: these were already removed from my example peaks. I guess that most people do this as part of the normal peak calling. But it sure can't hurt to have the option here.

VJalili · 2020-05-25T18:20:09Z

@ckuenne I agree suggesting a value for these thresholds is important. I think it would be a very useful addition; would you be interested leading its method development and implementation? I can help you as much needed.

Some questions probably we can start with:

Can a value be suggested for these thresholds based on the info in the input BED files (e.g., p-value)? I guess we'd prefer doing our best to avoid asking users for any additional files/information.
I guess we want to suggest values for the weak (-w) and stringent (-s) thresholds; it may also worth suggesting a value for the optional combined stringency (-g) threshold .

ckuenne · 2020-05-26T07:11:41Z

Well, if I had a viable idea for this I would probably just have written a small wrapper and that's that. Sadly, I don't have any. And this is not really something i can devote myself to right now. Sorry.

VJalili · 2020-05-26T15:58:40Z

Thanks for the suggestion anyway :) I'd keep issue open for a while in case anyone comes up a method.

callum-b · 2024-11-19T10:31:24Z

Hello! I've been experimenting with MSPC a bit to call peaks from two replicates and I've also been struggling with setting thresholds.
Could a starting point not be to use the distribution of pvals in the input files? For example, pool all peaks together, and set W=Quartile1 of the combined distribution and S=Quartile3 of the combined distribution? This could at least be set for a first run, and then perhaps print a message to the user that these params should probably be adjusted to give more fine-tuned results.

VJalili · 2024-11-20T02:17:14Z

That is a good suggestion! Some thoughts:

The values for those two thresholds can hugely impact the method's output, and their value should be set based on the experiment setup.
I agree the quartile can be a very helpful starting point. My only concern is that their value also depends on the experiment setup; it may work great in most cases but could also lead to tricky defaults. For instance, in a noisy batch, you may have a lot of peaks with signals close to the background, so the defaults could also be set to very permissive thresholds, yielding noisy output. Indeed, those are the default thresholds, and users can adjust them as needed in such cases, though I am afraid most users run programs with default values assuming they are set to reasonable values and do not commonly experiment with optional parameters.
Maybe an alternative could be training a regression model using data from various experiment setups. However, that would probably take some metadata about the experiment setup that has to be provided to MSPC in addition to the peaks (it may need some thinking around UX).

VJalili added documentation question labels May 14, 2020

VJalili added the enhancement label May 25, 2020

VJalili removed the enhancement label Aug 7, 2021

VJalili added the enhancement label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic threshold determination #126

Automatic threshold determination #126

ckuenne commented May 14, 2020

VJalili commented May 14, 2020 •

edited

Loading

ckuenne commented May 15, 2020 •

edited

Loading

VJalili commented May 19, 2020

marziacremona commented May 20, 2020

ckuenne commented May 25, 2020

VJalili commented May 25, 2020

ckuenne commented May 26, 2020

VJalili commented May 26, 2020

callum-b commented Nov 19, 2024

VJalili commented Nov 20, 2024

Automatic threshold determination #126

Automatic threshold determination #126

Comments

ckuenne commented May 14, 2020

VJalili commented May 14, 2020 • edited Loading

ckuenne commented May 15, 2020 • edited Loading

VJalili commented May 19, 2020

marziacremona commented May 20, 2020

ckuenne commented May 25, 2020

VJalili commented May 25, 2020

ckuenne commented May 26, 2020

VJalili commented May 26, 2020

callum-b commented Nov 19, 2024

VJalili commented Nov 20, 2024

VJalili commented May 14, 2020 •

edited

Loading

ckuenne commented May 15, 2020 •

edited

Loading