Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to map existing files to Model instances #236

Open
kgpayne opened this issue Oct 11, 2021 · 6 comments
Open

Best way to map existing files to Model instances #236

kgpayne opened this issue Oct 11, 2021 · 6 comments
Labels
question Further information is requested

Comments

@kgpayne
Copy link

kgpayne commented Oct 11, 2021

Is there a way to map existing files with the same schema that do not match a repeatable pattern on disk to a datafiles Model instance manually? The use case is config files spread across arbitrary-depth subfolders below a top-level project directory. Using glob I can find the files I am interested in mapping, but I am not having much success creating mapped instances of those discovered files.

I have tried:

  • Overriding Model.Meta.dataclass_pattern with each discovered files path and calling Model.objects.get()
  • Creating instances of a model with default values and no pattern defined and then overriding both the instance.Meta.datafiles_pattern and instance.datafile.path attributes on the instance, with the correct path for the discovered file, before calling instance.datafile.load().

However in both cases this results in an odd behaviour where all instances with nested attributes contain pointers to the most recently loaded files' nested object rather than their own 🤦‍♂️

Is this a completely unsupported use-case, or is there another way to use datafiles to map files discovered outside of the supported 'pattern' construct to instances of a datafiles Model? Thank you!

@jacebrowning
Copy link
Owner

There are ways to make this work but it's not well documented.

Internally, datafiles replaces /*/ with /**/ in patterns for searching arbitrary depths with iglob():

pattern = str(path.resolve())
splatted = pattern.format(self=Splats()).replace(
f'{os.sep}*{os.sep}', f'{os.sep}**{os.sep}'
)
log.info(f'Finding files matching pattern: {splatted}')
for index, filename in enumerate(iglob(splatted, recursive=True)):

So, if you include part of the path in pattern with a default value of * then model.objects.all() should find all matching config files and set the partial path attribute on loaded instances. Here's an example of me doing that in another library that uses datafiles:

https://github.com/jacebrowning/pomace/blob/6511f04e502c5980f1172504ee5dc35224524c79/pomace/models.py#L294-L301


Let me know if that works for you! I think the feature needs to be made more explicit and documented.

@kgpayne
Copy link
Author

kgpayne commented Oct 11, 2021

Thanks for getting back to me! I have a really basic implementation working (using a simple my_project/{self.name).yaml pattern) but adding a /*/ to my pattern didn't work 🤔 Still, to complicate matters there are multiple kinds of yaml config in the folder hierarchy. We are trying to allow our users to break up one large project.yaml file into a parent project.yaml and an arbitrary number of child configs referenced as glob patterns under a key in the parent project.yaml. Here is a paired-down example:

# my_project/project.yaml
include_paths:
  - '**.yaml'  # list of discovered files will always exclude the statically-located project.yaml at the root of the project to avoid duplication
plugins:
  extractors:
    - name: project-tap-1
      variant: meltano
# my_project/team_one/subfile_1.yaml
plugins:
  extractors:
    - name: subfile-1-tap-1
      variant: custom
# my_project/subfile_2.yaml
plugins:
  extractors:
    - name: subfile-2-tap-1
      variant: custom
# all plain dataclasses
from .base import ConfigBase, ExtractorConfig, LoaderConfig, ScheduleConfig

@dataclass
class Plugins:
    extractors: List[ExtractorConfig] = field(default_factory=list)
    loaders: List[LoaderConfig] = field(default_factory=list)


@dataclass
class MeltanoFile:
    plugins: Plugins = Plugins()
    schedules: List[ScheduleConfig] = field(default_factory=list)
    include_paths: List[str] = field(default_factory=list)
    version: int = 1


@dataclass
class SubFile:
    plugins: Plugins = Plugins()
    schedules: List[ScheduleConfig] = field(default_factory=list)

I wan't to be able to take over responsibility for discovering the 'root' project.yaml file and then, using the glob patterns in include_paths, discovering any matching file paths and passing them to SubFile to be mapped. Does that make sense?

If this is possible, we can then build a Project class to index plugins (in this case extractors) and provide a CRUD interface to modify plugin config wherever the actual files are in the project hierarchy 😅

@kgpayne
Copy link
Author

kgpayne commented Oct 11, 2021

The way I am thinking about this is conceptually similar to how SQLAlchemy's Classical Mapper works. Object and persistence defined separately and then explicitly mapped 🙂 Ideally the schema and converters would be attached to a File class, with instances representing individual files on the filesystem. Then, by Mapping one of the file instances to a Dataclass with matching attribute names/types you get a mutable python object who's changes are reflected on disk.

It looks like datafiles is doing this under the hood, but I can't figure out the Mapping step.

@jacebrowning
Copy link
Owner

but adding a /*/ to my pattern didn't work

I'd be curious to see more sample code of what you tried and the result.

and then, using the glob patterns in include_paths, discovering any matching file paths

For that, you could possibly use create_model directly:

from datafiles.model import create_model

parent_config = MeltanoFile(name='project')

for pathname in _iterate_globs(parent_config.include_paths):
    model = create_model(SubFile, pattern=pathname)
    child_config = model()  # 'pattern' should only match a single file

@kgpayne
Copy link
Author

kgpayne commented Oct 12, 2021

Glad we are on the same lines - I tried create_model first before playing with subclassing. Here is the full poc codebase, with a notebook I have been working in. For the project file (meltano.yaml) which is only instantiated once, everything works as expected. However the subfiles are garbled - e.g. the subfile_1.datafile.text of subfile_1 is a strange concatenation of the nested objects from both subfile_1 and subfile_3 even though the path is correct 🤔 This of course means the written information on .save() is incorrect. Hopefully this is just a bug with nested objects and this use case isn't as far out of scope as I imagined 🤞

@jacebrowning
Copy link
Owner

Since create_model patches the class, I could see how calling it multiple times with the same class could create strange results -- the expectation is that pattern defines all possible instances' files.

Hopefully this is just a bug with nested objects

To confirm that perhaps you could try pairing down SubFile to only include builtin types?

@jacebrowning jacebrowning added the question Further information is requested label Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants