Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded extension ".root" in _get_gbasf2_dataset_query in b2luigi/batch/processes/gbasf2.py #180

Closed
schmitca opened this issue Jan 12, 2023 · 2 comments · Fixed by #181
Labels
bug Something isn't working gbasf2 Concerns the gbasf2/grid b2luigi wrapper good first issue Good for newcomers

Comments

@schmitca
Copy link
Contributor

schmitca commented Jan 12, 2023

When using b2luigi with gbasf2 jobs with udst outputs, these will have the ending ".udst.root". Then the download gbasf2_data_query will be faulty:

f"sub*/{output_file_stem}_*{output_file_ext}"= sub*/filename.udst_*.root

instead of "sub*/filename_*.udst.root". This will cause the download to fail.

@meliache meliache added gbasf2 Concerns the gbasf2/grid b2luigi wrapper bug Something isn't working help wanted Extra attention is needed labels Jan 12, 2023
@meliache
Copy link
Collaborator

I'm not 100% sure, can you clarify

gbasf2_data_query will be faulty:
f"sub*/{output_file_stem}_*{output_file_ext}"= sub*/filename.udst_*.root
instead of "sub*/filename_*.udst.root". This will cause the download to fail.

Which is correct, the first glob-expression or the second glob-expression?

I looked the b2luigi code for _get_gbasf2_dataset_query to see what's actually happening, the important part is:

   output_file_stem, output_file_ext = os.path.splitext(output_file_name)
   # [...]
   dataset_query_string = \
            f"{output_lpn_dir}/{self.gbasf2_project_name}/sub*/{output_file_stem}_*{output_file_ext}"

We didn't hardcode .root as you claim in the issue name. Instead, os.path.splitext just gives the top-level filename-extension , i.e. the part after the last dot. I tested it in IPython:

os.path.splitext("/path/to/filename.udst.root")
Out[11]: ('/path/to/filename.udst', '.root')

which results in the glob pattern "sub*/filename.udst*.root".
Does gbasf2 always insert the job number after the first dot in the filename instead? If so, I'm wondering what happens if you add arbitrary dots to your filename, e.g. a.b.c.root. Will it create outputs like a_*.b.c.root? If I know that this is generally the case, and they don't explicityly check for udst in the filename, then it will be easy to implement.

Btw, from reading the release notes I thought that gbasf2 puts udst and mdst files into separate subdirectories, which is why I had created issue #58 in the past, but it seems that this is not the case anymore. I never worked with udst/mdst files and currently am not even using gbasf2 actively anymore. Actually I don't have time to work on b2luigi due to my thesis but this seems like it should be easy to fix. Thanks for reporting 🙏 .

@meliache
Copy link
Collaborator

I created a hotfix in pull request #181, but I didn't test that at all and don't have time for that right now, so I would wait for your feedback on that and if it works merge. I'd also be happy if you code check my code if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gbasf2 Concerns the gbasf2/grid b2luigi wrapper good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants