-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add provider for downloading datasets from scicat #18
Conversation
conda/meta.yaml
Outdated
@@ -14,6 +14,7 @@ requirements: | |||
- python>=3.10 | |||
- scipp>=24.02.0 | |||
- scippnexus>=24.03.0 | |||
- scitacean[sftp] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not allowed in conda. You need to add paramiko
as a dependency manually.
src/ess/reduce/scicat.py
Outdated
token: str, | ||
version: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better parametrize over the client. This allows users to construct their own if they have special needs (e.g., special SSH auth) and we can use a fake in tests.
tests/scicat_test.py
Outdated
return setup_fake_client() | ||
|
||
|
||
ess.reduce.scicat.Client = Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about parametrizing the download function.
tests/scicat_test.py
Outdated
from ess.reduce.scicat import download_scicat_file | ||
|
||
|
||
class Client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use https://scicatproject.github.io/scitacean/generated/modules/scitacean.testing.backend.fixtures.fake_client.html instead. (This is a pytest fixture)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want to use this function? It's not a provider and quite restrictive in how it wants to download data. In a real workflow, we need the dataset (metadata) along with the file. And we may also need more than one file per dataset.
So I would split it and have a provider that downloads a dataset and a separate one that downloads files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I wasn't sure how to write it to make it directly useful in the workflows as a provider without needing wrapping.
But splitting into dataset provider and a separate file download provider is a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely need to be wrapped, just like the nexus loaders. This means that the basic ones here will be very simple. So I think the main task here is figuring out what parameters should be requested.
pyproject.toml
Outdated
@@ -32,6 +32,7 @@ requires-python = ">=3.10" | |||
dependencies = [ | |||
"scipp >= 24.02.0", | |||
"scippnexus >= 24.03.0", | |||
"scitacean[sftp, test]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we avoid using extras then the new way of defining meta.yaml
automatically (see copier_template
) will work. Also, why do we depend on the test
extra, that seems odd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll remove the test
extra. How can we avoid depending on the sftp
extra? By manually adding the dependencies?
src/ess/reduce/scicat.py
Outdated
if target is None: | ||
target = Path(f'~/.cache/essreduce/{dataset_id}') | ||
dset = client.get_dataset(dataset_id) | ||
dset = client.download_files(dset, target=target, select=filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking this is a "side effect", which we recommend to avoid... but if I understand correctly then this is an exception because it is more like caching a file?
) | ||
|
||
|
||
def download_scicat_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Will this re-download a file if it has been fetched before?
- Is this provider thread-safe? What if Dask calls it multiple times?
pyproject.toml
Outdated
@@ -32,7 +32,7 @@ requires-python = ">=3.10" | |||
dependencies = [ | |||
"scipp >= 24.02.0", | |||
"scippnexus >= 24.03.0", | |||
"scitacean[sftp, test]", | |||
"scitacean[sftp]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using extras?
src/ess/reduce/scicat.py
Outdated
with _scicat_download_lock: | ||
dset = client.get_dataset(dataset_id) | ||
dset = client.download_files(dset, target=target, select=filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this will break when multiple process / nodes are used that share the same filesystem? Can we entirely avoid downloading in the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will not work in that situation.
If the user has already downloaded the dataset then client.download_files
will not download anything (right? @jl-wynen) and then there won't be any race condition.
So the answer to the question is yes, we can avoid downloading in the pipeline. Is that good enough?
What kind of solution would you like to see @SimonHeybrock ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have any specific solution in mind. I just think that we should take a step back and see if we can take a different approach (such as running something before the pipeline) that can avoid the problem, instead of trying to fix the symptoms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloading the data before running the pipeline is already an option. The user can do that using scitacean or calling the function here directly.
I'll leave this PR as is for now until we figure out what kind of solution we want here.
We need to be careful with how we pass datasets and files to pipelines, see the service blueprint (https://project.esss.dk/nextcloud/index.php/apps/files/?dir=%2F&openfile=18026378). This will require prefetching files and providing their path or a dataset that contains the local path as parameters. Based on this, I don't see much utility in the simple functions implemented in this PR. @jokasimr can you check how they fit into the blueprint? If they don't I would suggest closing this PR. |
In the blueprint I think the parts that are relevant to this pr are the following entries:
(1) already exists in scitacean and I don't think it makes sense to wrap that further. This is a bare-bones implementation, sure, but I don't think it makes sense to build much more before we've tried integrating scicat datasets/files in the workflows and discovered what more is needed. Maybe it's easier to talk about this in person so you can explain more what kind of solution you had in mind for this issue. |
Please check out scipp/esssans#136 and in particular the addition of helper functions in https://github.com/scipp/esssans/pull/136/files#diff-21119170cad13c69e1a8e14143c2527950f4aa5308d96851c51a3fc280faff1d such as |
|
I think my question really was: Would such an approach fit into the user story / service blueprint? |
Essentially, yes. This is what the user story looks like. 1. Select and download the datasets + files. 2. provide them to the workflow. |
Fixes #15