Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add provider for downloading datasets from scicat #18

Closed
wants to merge 12 commits into from

Conversation

jokasimr
Copy link
Contributor

@jokasimr jokasimr commented Apr 8, 2024

Fixes #15

@jokasimr jokasimr requested a review from jl-wynen April 8, 2024 08:41
conda/meta.yaml Outdated
@@ -14,6 +14,7 @@ requirements:
- python>=3.10
- scipp>=24.02.0
- scippnexus>=24.03.0
- scitacean[sftp]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not allowed in conda. You need to add paramiko as a dependency manually.

Comment on lines 14 to 15
token: str,
version: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better parametrize over the client. This allows users to construct their own if they have special needs (e.g., special SSH auth) and we can use a fake in tests.

return setup_fake_client()


ess.reduce.scicat.Client = Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about parametrizing the download function.

from ess.reduce.scicat import download_scicat_file


class Client:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to use this function? It's not a provider and quite restrictive in how it wants to download data. In a real workflow, we need the dataset (metadata) along with the file. And we may also need more than one file per dataset.
So I would split it and have a provider that downloads a dataset and a separate one that downloads files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I wasn't sure how to write it to make it directly useful in the workflows as a provider without needing wrapping.

But splitting into dataset provider and a separate file download provider is a good start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely need to be wrapped, just like the nexus loaders. This means that the basic ones here will be very simple. So I think the main task here is figuring out what parameters should be requested.

pyproject.toml Outdated
@@ -32,6 +32,7 @@ requires-python = ">=3.10"
dependencies = [
"scipp >= 24.02.0",
"scippnexus >= 24.03.0",
"scitacean[sftp, test]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we avoid using extras then the new way of defining meta.yaml automatically (see copier_template) will work. Also, why do we depend on the test extra, that seems odd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll remove the test extra. How can we avoid depending on the sftp extra? By manually adding the dependencies?

if target is None:
target = Path(f'~/.cache/essreduce/{dataset_id}')
dset = client.get_dataset(dataset_id)
dset = client.download_files(dset, target=target, select=filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this is a "side effect", which we recommend to avoid... but if I understand correctly then this is an exception because it is more like caching a file?

)


def download_scicat_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Will this re-download a file if it has been fetched before?
  • Is this provider thread-safe? What if Dask calls it multiple times?

pyproject.toml Outdated
@@ -32,7 +32,7 @@ requires-python = ">=3.10"
dependencies = [
"scipp >= 24.02.0",
"scippnexus >= 24.03.0",
"scitacean[sftp, test]",
"scitacean[sftp]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using extras?

Comment on lines 47 to 49
with _scicat_download_lock:
dset = client.get_dataset(dataset_id)
dset = client.download_files(dset, target=target, select=filename)
Copy link
Member

@SimonHeybrock SimonHeybrock Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this will break when multiple process / nodes are used that share the same filesystem? Can we entirely avoid downloading in the pipeline?

Copy link
Contributor Author

@jokasimr jokasimr Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will not work in that situation.
If the user has already downloaded the dataset then client.download_files will not download anything (right? @jl-wynen) and then there won't be any race condition.
So the answer to the question is yes, we can avoid downloading in the pipeline. Is that good enough?

What kind of solution would you like to see @SimonHeybrock ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have any specific solution in mind. I just think that we should take a step back and see if we can take a different approach (such as running something before the pipeline) that can avoid the problem, instead of trying to fix the symptoms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading the data before running the pipeline is already an option. The user can do that using scitacean or calling the function here directly.

I'll leave this PR as is for now until we figure out what kind of solution we want here.

@jl-wynen
Copy link
Member

We need to be careful with how we pass datasets and files to pipelines, see the service blueprint (https://project.esss.dk/nextcloud/index.php/apps/files/?dir=%2F&openfile=18026378). This will require prefetching files and providing their path or a dataset that contains the local path as parameters.

Based on this, I don't see much utility in the simple functions implemented in this PR. @jokasimr can you check how they fit into the blueprint? If they don't I would suggest closing this PR.

@jokasimr
Copy link
Contributor Author

In the blueprint I think the parts that are relevant to this pr are the following entries:

  1. request a datasets by id.
  2. extracting relevant related datasets.
  3. request files that are not already local or out of date.

(1) already exists in scitacean and I don't think it makes sense to wrap that further.
(2) and (3) are implemented in this PR.

This is a bare-bones implementation, sure, but I don't think it makes sense to build much more before we've tried integrating scicat datasets/files in the workflows and discovered what more is needed.

Maybe it's easier to talk about this in person so you can explain more what kind of solution you had in mind for this issue.

@SimonHeybrock
Copy link
Member

Please check out scipp/esssans#136 and in particular the addition of helper functions in https://github.com/scipp/esssans/pull/136/files#diff-21119170cad13c69e1a8e14143c2527950f4aa5308d96851c51a3fc280faff1d such as get_sans2d_tutorial_data_folder, which runs before the workflow and handles downloading files. Can we take a similar approach with Scicat data?

@jl-wynen
Copy link
Member

jl-wynen commented May 2, 2024

get_sans2d_tutorial_data_folder is specialised to the tutorial. We need something more general to work with user data. But yes, we may be able to write a function that takes a dataset ID and filter to identify the data file and downloads the dataset, file, and all required associated files.

@SimonHeybrock
Copy link
Member

get_sans2d_tutorial_data_folder is specialised to the tutorial. We need something more general to work with user data. But yes, we may be able to write a function that takes a dataset ID and filter to identify the data file and downloads the dataset, file, and all required associated files.

I think my question really was: Would such an approach fit into the user story / service blueprint?

@jl-wynen
Copy link
Member

jl-wynen commented May 2, 2024

I think my question really was: Would such an approach fit into the user story / service blueprint?

Essentially, yes. This is what the user story looks like. 1. Select and download the datasets + files. 2. provide them to the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mixing local files and files from SciCat
3 participants