Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Map object #282

Open
tjlane opened this issue Nov 16, 2024 · 3 comments
Open

Proposal: Map object #282

tjlane opened this issue Nov 16, 2024 · 3 comments
Assignees
Labels
wishlist New feature or request for the future

Comments

@tjlane
Copy link
Collaborator

tjlane commented Nov 16, 2024

In meteor, we found it very useful to implement a Map class. I actually tried to avoid doing this, but eventually found it necessary to keep the rest of the code clean/sane. Otherwise, you start to pass around a huge number of column labels, and the entire thing becomes a crazy bookkeeping exercise.

We have an implementation that should be considered a functional prototype, not a final version:
https://github.com/rs-station/meteor/blob/main/meteor/rsmap.py

The current implementation has a good interface, but the implementation is sub-optimal. It is basically a subclass of rs.DataSet that restricts the columns, only allowing amplitudes, phases, uncertainties, and H/K/L. This has a big downside in that it can affect inherited pandas.DataFrame or rs.DataSet methods in unexpected ways, making for a less than ideal user and developer experience.

That said, such a class could be generally useful to rs users, so it would be great to push a better version of this "upstream" into rs, and have meteor simply use that implementation.


what is a Map?

We can conceptualize it as a special case of an rs.DataSet, where you flag specific columns as:

  • amplitudes
  • phases
  • uncertainties (optional)

And you can only have ONE of each of these special columns. This opens up a lot of nice logic: converting to real space and back, comparing two map objects (think of a scaling function: scale(map1, map2)), etc. The simple big feature is that by declaring which columns are amplitudes/phases at the time of instantiation, you get around having to keep track of column label names.


what decisions do we face?

First, feedback would be appreciated as to if we want to implement a Map object in rs. I am a yes vote on this. But obviously we need some consensus before proceeding.

If we decide to do it, we need to decide on a general implementation strategy. Here are the possibilities as I see them -- if I didn't have the imagination to come up with a good option, please speak up!

  1. Subclass rs.DataSet and simply annotate what columns are amplitudes, phases, uncertainties. Place dtype requirements on these columns. The instance, though, can have other, unrelated columns (this differs from current implementation). This is probably the simplest model to implement.
  2. Build a new class from the ground up. This is more code, but provides for maximum control, and would allow us to be strict (only keep columns we want).
  3. Try and clean up the current model. Subclass rs.DataSet, be strict about what columns are allowed. Aggressively test inherited methods to ensure they work.

These appear in my order of preference. Each model has advantages and disadvantages, though. Feedback wanted.

@JBGreisman @kmdalton @alisiafadini

@tjlane tjlane added the wishlist New feature or request for the future label Nov 16, 2024
@tjlane tjlane self-assigned this Nov 16, 2024
@JBGreisman
Copy link
Member

I like the idea of having "limited scope, defined content" objects that can be used to simplify the maintenance of certain code. Big picture, I think something like Map could be very useful for providing Amplitudes/Phases/HKLs/crystal metadata (optionally uncertainties) in a way that downstream code can use.

My main design concern is with regard to subclassing DataSet. In general, that opens up a lot of corner cases that can lead to "reversions" from Map-->parent (DataSet), which will require a lot of overloading of methods to correct. Given that the purposes of this Map class are more restricted than the general-purpose DataSet, I think there are two options that could simplify our lives:

  1. Store underlying data as numpy arrays and write getters/setters/IO methods to support uses. IO methods can convert to rs.DataSet to simplify implementation:

This could look something like this, and we can also provide :

class Map:
    millers : n x 3 np array
    amplitudes : n np array
    phases : n np array
    uncertainties : None or n np array
    cell : gemmi.UnitCell
    spacegroup : gemmi.SpaceGroup
    ...
  1. Similar to above, but store rs.DataSet as an attribute. Use methods to access underlying content
class Map:
    _dataset : rs.DataSet
    ...
    def getters/setters
    ...

I think these sorts of implementations will be easier to maintain than starting with a general-purpose object and restricting its use. I'm not sure of all the intended uses (and I haven't yet dug into much of the meteor code), so I don't have a great sense of what's best

@kmdalton
Copy link
Member

Jack is correct in that it is much easier to maintain classes that have a rs.DataSet attribute than ones that subclass rs.DataSet. We should probably express this sentiment somewhere prominent on the documentation page.

Now the question becomes whether the overhead from subclassing is worth it in this case. I can't quite answer that without knowing more about the use cases for rs.Map. @tjlane , can you

  • share some illustrative examples where the Map class is used inside meteor?
  • give an estimate (in lines of code perhaps) of what it would take to implement a satisfactory "Map" class from scratch which has a DataSet attribute rather than as a subclass.

@tjlane
Copy link
Collaborator Author

tjlane commented Nov 18, 2024

Hi guys,

First, some examples of why this is useful. Just check out these clean call signatures, which don't need to refer to any column labels:

https://github.com/rs-station/meteor/blob/7a5ae5552a88ba17be6fa19529ad30212c34bafd/meteor/scale.py#L143-L148

https://github.com/rs-station/meteor/blob/7a5ae5552a88ba17be6fa19529ad30212c34bafd/meteor/tv.py#L107-L112

If you check out those functions, you can see the Map class in action. We use it everywhere.

I think @JBGreisman has convinced me to give a "non-inheriting" implementation a go, where I'd just store the underlying amplitudes, phases, and uncertainties as attributes, who's values are rs.DataSeries or maybe numpy.arrays. It shouldn't be too much work.

My plan is to experiment with that, perhaps with the code in meteor, and see if I hit an unsolvable roadblock. If not, I'll ask for a review of that code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist New feature or request for the future
Projects
None yet
Development

No branches or pull requests

3 participants