Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of quantities #133

Closed
toddrjen opened this issue Feb 4, 2014 · 14 comments
Closed

Handling of quantities #133

toddrjen opened this issue Feb 4, 2014 · 14 comments

Comments

@toddrjen
Copy link
Contributor

toddrjen commented Feb 4, 2014

I split this off from issue #123

The current unit-handling class, quantities, has some major issues. The most pressing is that it is unmaintained, and how well it will deal with future python 3.x releases is unknown. It also has bugs that are affecting us that are probably never going to be fixed.

However, it has other issues. For one thing, since it subclasses numpy, it requires multiple copy operations for many tasks we need. Further, it prevents us from using numpy-like classes that support lazy loading (such as h5py). That also means we can't attach units to anything other than numpy arrays. If we want a single number with units, we have to use a numpy scalar.

There are also other problems. For example, telling if two units are the same type (such as time) is difficult and slow. Similarly, copying the units from one variable to another is difficult.

I have found another project, pint, that seems to solve most or all of these problems. It also seems to be a smaller code-based, and is easier to define new units.

https://pint.readthedocs.org/

@rproepp
Copy link
Member

rproepp commented Feb 4, 2014

Andrew mentioned Pint in the upcoming Neo paper and I've already had a quick look at it. It seems very promising and we should definitely consider moving from quantities to Pint. As I've written earlier, I think we should do such a switch at the same version as our other API breaking changes.

If we decide to use Pint, we should contact the authors and talk about sustainability. It has 12 contributors but seems to be largely the effort of a single author.

@toddrjen
Copy link
Contributor Author

toddrjen commented Feb 4, 2014

Yes, there does seem to be one person driving it. However, this was also the case with quantities, and pint seems to have more outside contributions than quantities did (for example pint has more accepted and total pull requests, despite being younger). It also seems to do a better job dealing with issues.

@rproepp
Copy link
Member

rproepp commented Feb 4, 2014

Yes, I completely agree. I only want to make sure that we don't switch and then run into a similar situation as with quantities one or two years down the line.

@samuelgarcia
Copy link
Contributor

Trevor did a good comparison between units handling in python here:
http://conference.scipy.org/scipy2013/presentation_detail.php?id=174

@samuelgarcia
Copy link
Contributor

@toddrjen
Copy link
Contributor Author

I know that several of the projects he represented have had a lot of advances since then. I managed to install and run his the comparison script, but I haven't been able to get his tables working yet. I will report back when I have an updated version of the analysis complete.

@physicalist
Copy link

Has anyone made any progress on this issue? In the aforementioned talk by Trevor, someone in the audience claimed pint was based on quantities, but that's not (or no longer?) the case. The benchmarks on ops weren't really in pint's favor, but if it had been rewritten w/o quantities as dependence this might have changed drastically, worth to re-evaluate in any case - unless someone already did that?

@toddrjen
Copy link
Contributor Author

I have re-run the data analysis from the presentation, and pint still has performance issues, although it is improving its coverage of python and numpy mathematical calculations.

If we have a concrete interest in switching, I could always approach the developers about it. But I didn't want to do that if we weren't really willing to switch if things improved. There didn't seem to be a consensus yet that switching was the right move.

@toddrjen
Copy link
Contributor Author

Anybody have any thoughts on this?

@rproepp
Copy link
Member

rproepp commented Apr 11, 2014

I think I would now prefer to stay with quantities, which might require adopting the project or finding a new maintainer to get the bugs fixed (some already have pull requests). Performance is important for us and it seems that the issues with Pint are inherent in the architecture, see comments at the end on this page: https://pint.readthedocs.org/en/latest/numpy.html. On the same page the author mentions that Pint might be based on numpy inheritance in the future, which would remove the performance issues but reintroduce some of the problems we have with quantities.

The h5py lazy loading object is nice, but it might not be what we need: First, it supports only a very limited subset of numpy array functionality: mostly indexing only. And lazy loading from HDF5 arrays in general is often dependent on the chunking defined when creating the array. A common case would be an n*m array with n signals, each m samples long. When we're interested in only one of the channels and try to load it, we will actually often touch the whole data because the default chunking is rectangular. If we change the chunking to e.g. one chunk per channel, loading temporally defined subsets becomes useless. Too small chunks lead to bad general loading performance etc. It can still save memory, but with a potentially huge performance hit.

@toddrjen
Copy link
Contributor Author

I've done some benchmarking on pint, and the issues raised in that link are
not the primary performance bottleneck. Most of the time seems to be taken
up combining compound units, which is done in a recursive manner.

@rproepp
Copy link
Member

rproepp commented Apr 11, 2014

Ok, so the performance issues could be solvable? Maybe we could contact the author about that, or try ourselves? I'm not sure if switching would be worth it even if performance was closer to quantities (but still often only 50% if I understand the implications of the issues from the link correctly). Further opinions?

@toddrjen
Copy link
Contributor Author

I don't know if they are solvable, we would have to talk to the developers. I don't know how much pint can be optimized, I don't yet understand how the operations work internally, I was just able to determine which methods took most of the time.

The performance could end up better than quantities. The primary performance bottleneck of quantities is also not the numpy operations. Further, quantities may very well have the same issue with multiple numpy operations. I really need to understand how things are handled in practice to know.

pint, however, has the advantage that in performance-critical situations you can bypass the issue entirely by working with the numpy array directly. This is not possible with quantities, at least not in a documented manner.

@apdavison apdavison modified the milestones: 0.6, 0.4 Jul 4, 2016
@apdavison apdavison added bug and removed defect labels Jan 25, 2017
@samuelgarcia
Copy link
Contributor

Close and leave #278 open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants