Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

Open
bdice opened this issue Dec 6, 2018 · 4 comments
Labels
bug Something isn't working minor

Comments

@bdice
Copy link
Member

bdice commented Dec 6, 2018

Original report by me.


I am currently debugging a problem in glotzformats where the gtar file reader is misbehaving, when one frame has a type_shapes.json file but another frame does not.

I've traced it back to a behavior in libgetar that I find unusual. When readBytes is called on a nonexistent path, like traj.readBytes('potato'), it returns None. When getRecord is called on a nonexistent path, like traj.getRecord(gtar.Record('potato')), it returns an empty string.

My expectation was that both of these would raise a KeyError. It's not clear what use case the current behavior is enabling / preserving. The behavior of getRecord in particular is non-obvious, and caused JSON decoding errors when trying to decipher an empty string.

@bdice
Copy link
Member Author

bdice commented Dec 6, 2018

(It is possible to work around this in glotzformats, but I'm reporting this since I don't understand why the library behaves this way.)

@bdice
Copy link
Member Author

bdice commented Dec 10, 2018

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).


The intent of getRecord is that it always returns an array, string, or bytestring of the appropriate type for the record you pass in. Since Record('potato') doesn't specify any storage format, it will be assumed to be some sort of string-type value, as described here, if you're interested. Records that aren't present yield arrays/strings of length 0, which is why your getRecord call returns the empty string. The intended usage of getRecord is to always give it a valid record and index after finding the record you want via getRecordTypes and the frames from queryFrames.

readBytes and friends are part of the "simple" API that we take liberties with for the sake of convenience; in particular, we aren't assuming that the record or frames are valid, so it is more likely that users will ask for something that doesn't exist in the archive. In this case, we return None.

It's not really clear to me that raising a KeyError is any better in either case. Do you happen to have a motivating example where the KeyError feels obviously superior?

@bdice
Copy link
Member Author

bdice commented Dec 10, 2018

In my understanding, the current API cannot differentiate between a record that was written with an empty string and one that does not exist. Raising a KeyError could distinguish these cases.

In particular, glotzformats needs an efficient way to see if several records exist for a known frame (as opposed to the usual case of determining frames for a known record). I could iterate over framesWithRecordsNamed to see if the frame of interest is in that list, but that's needlessly O(N). I can think of two ways this might work.

  1. Don't raise KeyErrors, keep current behavior and add an exists method.
if traj.exists(record):
    return traj.getRecord(record)
else:
    return defaults / raise / whatever
  1. Change current behavior to raise KeyError if the record doesn't exist.
try:
    return traj.getRecord(record)
except KeyError:
    return defaults / raise / whatever

The second one is EAFP, and would be my preference (as it is "more Pythonic", but that may lose some meaning when discussing a Python library that is meant to be a C++ interface).

@bdice
Copy link
Member Author

bdice commented Dec 11, 2018

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).


FWIW, I always intended things like glotzformats to take the most recent version of whatever quantity with respect to the timestep it is given for things that aren't dumped on every frame (like type_shapes.json). I was originally thinking that this type of behavior would go inside libgetar, but I've since come to think that it makes more sense for it to go in whatever is applying a schema and converting the libgetar output into trajectories, like we do in pyqaxe. I think the constructive method, like pyqaxe is using, is more straightforward than the postprocessing way, like GetarLoader and perhaps what you're thinking of for glotzformats, but that is a separate issue that I'm happy to talk with you about offline if you like.

Adding an exists/contains function would certainly be doable with very little effort; most backends already have an std::map lying around that contains all the filenames present in the archive. I'm hesitant to break the API (as in method 2) for such a small-feeling detail, but we can definitely keep the idea around for the next major version, if method 1 isn't satisfactory!

@bdice bdice added minor bug Something isn't working labels Feb 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor
Projects
None yet
Development

No branches or pull requests

1 participant