Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request/idea: option to plot highest-density intervals #153

Open
elnjensen opened this issue Mar 11, 2021 · 3 comments
Open

Feature request/idea: option to plot highest-density intervals #153

elnjensen opened this issue Mar 11, 2021 · 3 comments

Comments

@elnjensen
Copy link

Now that arviz is an optional dependence, it would be nice to have the option to easily plot the highest-density intervals (from arviz.hdi) on the histograms rather than percentiles - in the case of asymmetric distributions, I find these more informative than the current quantiles behavior (though in the limit of a symmetric distribution they would be the same).

If you think this is a useful addition, I'd be happy to write up a PR. Before I do that, though, I thought I'd ask:

  • Do you think this is useful?
  • If so, I'd imagine it behaving similarly to quantiles by drawing lines on the 1-D histograms, though each probability value would give two lines, for the upper and lower bounds of that interval, rather than a single line.
  • It could take an iterable of probabilities to draw, or just a single probability. If multiple probabilities were given, would it make sense to have different line styles by default?

Here's a quick example showing the difference with symmetric vs. asymmetric distributions - HDI 50% interval is in green, vs. 25% and 75% quantiles.

import corner
import numpy as np
import arviz as az 
import matplotlib.pyplot as plt

ndim, nsamples = 2, 20000
np.random.seed(42)
data =(np.random.randn(ndim * nsamples)).reshape([nsamples, ndim])

for samples in data, np.cos(data):
    figure = corner.corner(samples, quantiles=[0.25, 0.75])

    axes = np.array(figure.axes).reshape((ndim, ndim))

    # Loop over the diagonal
    for i in range(ndim):
        ax = axes[i, i]
        for prob in az.hdi(samples[:, i], hdi_prob=0.50):
            ax.axvline(prob, color="g")

Fig1
Fig2

@dfm
Copy link
Owner

dfm commented Mar 11, 2021

This would be great! Do you know if the arviz functions handle weighted samples properly? That would be crucial for this to be included. I'm probably not going to be able to get to this too soon, but I'd be happy to collaborate on a PR.

@elnjensen
Copy link
Author

The arviz.hdi function does not have an option for including weights, so if that's crucial to include here, it would involve new code (though possibly with some borrowing of ideas and/or code from arviz). Looking at the arviz code and the algorithm for how it calculates HDI, it's not obvious to me that it lends itself easily to bringing weights in. The algorithm looks like this:

  • Sort the input array of n samples;
  • Calculate the number of samples m that represents the desired probability fraction, m = hdi_prob * n;
  • Look at all the sub-arrays of length m in the sorted array and find the one with the smallest difference between first and last element - that range of values is the highest-density interval.

To do this with weights, you can't follow quite the same logic because the intervals with the same weighted fractional probability wouldn't all have the same length. But maybe you could do something along these lines:

  1. Sort the input array of n samples;
  2. Sort the array of weights in the same order (e.g. using np.argsort)
  3. Calculate the cumulative sum of the normalized array of weights.
  4. Step through that cumulative-sum array (up until the element with a value of 1 – hdi_prob), at each point using np.searchsorted to find the endpoint of the subarray that would encompass the desired probability span from the starting point.
  5. Find the subarray with the smallest difference between start and end values.

For step 4, I don't see any obvious way to vectorize, so it could be slow for large arrays (but the larger the probability interval is, the fewer subarrays there are to consider).

Does this logic seem reasonable? I guess the upside here is that if it's new code, it is not dependent on having arviz installed.

@dfm
Copy link
Owner

dfm commented Mar 12, 2021

This does sound like the right algorithm to me, and I'd be open to including something like this. We already have a custom quantile function

def quantile(x, q, weights=None):

because the numpy percentile and quantile functions don't support weighted samples, and you'll see that the logic is similar to what you're describing (without the hard part :D). The computational scaling does sound potentially pretty bad, but it might be worth benchmarking.

An alternative would be to add a tutorial to the documentation that demonstrates this procedure using ArviZ directly (it might be worth using the overplot_lines function) so that people could easily copy and paste rather than adding more options to corner itself. In that tutorial, we could resample the weighted points to get uniformly weighted points and then use the same procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants