Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

bdice · 2018-12-06T03:59:19Z

I am currently debugging a problem in glotzformats where the gtar file reader is misbehaving, when one frame has a type_shapes.json file but another frame does not.

I've traced it back to a behavior in libgetar that I find unusual. When readBytes is called on a nonexistent path, like traj.readBytes('potato'), it returns None. When getRecord is called on a nonexistent path, like traj.getRecord(gtar.Record('potato')), it returns an empty string.

My expectation was that both of these would raise a KeyError. It's not clear what use case the current behavior is enabling / preserving. The behavior of getRecord in particular is non-obvious, and caused JSON decoding errors when trying to decipher an empty string.

The text was updated successfully, but these errors were encountered:

bdice · 2018-12-06T04:00:31Z

(It is possible to work around this in glotzformats, but I'm reporting this since I don't understand why the library behaves this way.)

bdice · 2018-12-10T19:31:31Z

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).

The intent of getRecord is that it always returns an array, string, or bytestring of the appropriate type for the record you pass in. Since Record('potato') doesn't specify any storage format, it will be assumed to be some sort of string-type value, as described here, if you're interested. Records that aren't present yield arrays/strings of length 0, which is why your getRecord call returns the empty string. The intended usage of getRecord is to always give it a valid record and index after finding the record you want via getRecordTypes and the frames from queryFrames.

readBytes and friends are part of the "simple" API that we take liberties with for the sake of convenience; in particular, we aren't assuming that the record or frames are valid, so it is more likely that users will ask for something that doesn't exist in the archive. In this case, we return None.

It's not really clear to me that raising a KeyError is any better in either case. Do you happen to have a motivating example where the KeyError feels obviously superior?

bdice · 2018-12-10T22:53:03Z

In my understanding, the current API cannot differentiate between a record that was written with an empty string and one that does not exist. Raising a KeyError could distinguish these cases.

In particular, glotzformats needs an efficient way to see if several records exist for a known frame (as opposed to the usual case of determining frames for a known record). I could iterate over framesWithRecordsNamed to see if the frame of interest is in that list, but that's needlessly O(N). I can think of two ways this might work.

Don't raise KeyErrors, keep current behavior and add an exists method.

if traj.exists(record):
    return traj.getRecord(record)
else:
    return defaults / raise / whatever

Change current behavior to raise KeyError if the record doesn't exist.

try:
    return traj.getRecord(record)
except KeyError:
    return defaults / raise / whatever

The second one is EAFP, and would be my preference (as it is "more Pythonic", but that may lose some meaning when discussing a Python library that is meant to be a C++ interface).

bdice · 2018-12-11T13:40:42Z

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).

FWIW, I always intended things like glotzformats to take the most recent version of whatever quantity with respect to the timestep it is given for things that aren't dumped on every frame (like type_shapes.json). I was originally thinking that this type of behavior would go inside libgetar, but I've since come to think that it makes more sense for it to go in whatever is applying a schema and converting the libgetar output into trajectories, like we do in pyqaxe. I think the constructive method, like pyqaxe is using, is more straightforward than the postprocessing way, like GetarLoader and perhaps what you're thinking of for glotzformats, but that is a separate issue that I'm happy to talk with you about offline if you like.

Adding an exists/contains function would certainly be doable with very little effort; most backends already have an std::map lying around that contains all the filenames present in the archive. I'm hesitant to break the API (as in method 2) for such a small-feeling detail, but we can definitely keep the idea around for the next major version, if method 1 isn't satisfactory!

bdice added minor bug Something isn't working labels Feb 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

bdice commented Dec 6, 2018

bdice commented Dec 6, 2018

bdice commented Dec 10, 2018

bdice commented Dec 10, 2018

bdice commented Dec 11, 2018

Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

Behavior of readBytes for nonexistent paths (returning None) is unexpected. #12

Comments

bdice commented Dec 6, 2018

bdice commented Dec 6, 2018

bdice commented Dec 10, 2018

bdice commented Dec 10, 2018

bdice commented Dec 11, 2018