feat: add provider for downloading datasets from scicat #18

jokasimr · 2024-04-08T08:41:36Z

Fixes #15

jl-wynen · 2024-04-08T10:16:44Z

conda/meta.yaml

@@ -14,6 +14,7 @@ requirements:
    - python>=3.10
    - scipp>=24.02.0
    - scippnexus>=24.03.0
+    - scitacean[sftp]


This is not allowed in conda. You need to add paramiko as a dependency manually.

jl-wynen · 2024-04-08T12:13:52Z

src/ess/reduce/scicat.py

+    token: str,
+    version: Optional[str] = None,


Better parametrize over the client. This allows users to construct their own if they have special needs (e.g., special SSH auth) and we can use a fake in tests.

jl-wynen · 2024-04-08T12:15:01Z

tests/scicat_test.py

+        return setup_fake_client()
+
+
+ess.reduce.scicat.Client = Client


See my comment about parametrizing the download function.

jl-wynen · 2024-04-08T12:16:06Z

tests/scicat_test.py

+from ess.reduce.scicat import download_scicat_file
+
+
+class Client:


Use https://scicatproject.github.io/scitacean/generated/modules/scitacean.testing.backend.fixtures.fake_client.html instead. (This is a pytest fixture)

jl-wynen · 2024-04-08T12:17:53Z

src/ess/reduce/scicat.py

How do you want to use this function? It's not a provider and quite restrictive in how it wants to download data. In a real workflow, we need the dataset (metadata) along with the file. And we may also need more than one file per dataset.
So I would split it and have a provider that downloads a dataset and a separate one that downloads files.

That makes sense. I wasn't sure how to write it to make it directly useful in the workflows as a provider without needing wrapping.

But splitting into dataset provider and a separate file download provider is a good start.

It's definitely need to be wrapped, just like the nexus loaders. This means that the basic ones here will be very simple. So I think the main task here is figuring out what parameters should be requested.

SimonHeybrock · 2024-04-09T05:48:57Z

pyproject.toml

@@ -32,6 +32,7 @@ requires-python = ">=3.10"
 dependencies = [
    "scipp >= 24.02.0",
    "scippnexus >= 24.03.0",
+    "scitacean[sftp, test]",


If we avoid using extras then the new way of defining meta.yaml automatically (see copier_template) will work. Also, why do we depend on the test extra, that seems odd?

Yes I'll remove the test extra. How can we avoid depending on the sftp extra? By manually adding the dependencies?

SimonHeybrock · 2024-04-09T05:52:30Z

src/ess/reduce/scicat.py

+    if target is None:
+        target = Path(f'~/.cache/essreduce/{dataset_id}')
+    dset = client.get_dataset(dataset_id)
+    dset = client.download_files(dset, target=target, select=filename)


Strictly speaking this is a "side effect", which we recommend to avoid... but if I understand correctly then this is an exception because it is more like caching a file?

SimonHeybrock · 2024-04-09T05:54:36Z

src/ess/reduce/scicat.py

+    )
+
+
+def download_scicat_file(


Will this re-download a file if it has been fetched before?

Is this provider thread-safe? What if Dask calls it multiple times?

SimonHeybrock · 2024-04-09T11:04:46Z

pyproject.toml

@@ -32,7 +32,7 @@ requires-python = ">=3.10"
 dependencies = [
    "scipp >= 24.02.0",
    "scippnexus >= 24.03.0",
-    "scitacean[sftp, test]",
+    "scitacean[sftp]",


Can we avoid using extras?

SimonHeybrock · 2024-04-09T11:35:55Z

src/ess/reduce/scicat.py

+    with _scicat_download_lock:
+        dset = client.get_dataset(dataset_id)
+        dset = client.download_files(dset, target=target, select=filename)


I suppose this will break when multiple process / nodes are used that share the same filesystem? Can we entirely avoid downloading in the pipeline?

Yes it will not work in that situation.
If the user has already downloaded the dataset then client.download_files will not download anything (right? @jl-wynen) and then there won't be any race condition.
So the answer to the question is yes, we can avoid downloading in the pipeline. Is that good enough?

What kind of solution would you like to see @SimonHeybrock ?

I do not have any specific solution in mind. I just think that we should take a step back and see if we can take a different approach (such as running something before the pipeline) that can avoid the problem, instead of trying to fix the symptoms.

Downloading the data before running the pipeline is already an option. The user can do that using scitacean or calling the function here directly.

I'll leave this PR as is for now until we figure out what kind of solution we want here.

jl-wynen · 2024-04-26T07:53:12Z

We need to be careful with how we pass datasets and files to pipelines, see the service blueprint (https://project.esss.dk/nextcloud/index.php/apps/files/?dir=%2F&openfile=18026378). This will require prefetching files and providing their path or a dataset that contains the local path as parameters.

Based on this, I don't see much utility in the simple functions implemented in this PR. @jokasimr can you check how they fit into the blueprint? If they don't I would suggest closing this PR.

jokasimr · 2024-04-29T07:36:05Z

In the blueprint I think the parts that are relevant to this pr are the following entries:

request a datasets by id.
extracting relevant related datasets.
request files that are not already local or out of date.

(1) already exists in scitacean and I don't think it makes sense to wrap that further.
(2) and (3) are implemented in this PR.

This is a bare-bones implementation, sure, but I don't think it makes sense to build much more before we've tried integrating scicat datasets/files in the workflows and discovered what more is needed.

Maybe it's easier to talk about this in person so you can explain more what kind of solution you had in mind for this issue.

SimonHeybrock · 2024-05-01T08:14:19Z

Please check out scipp/esssans#136 and in particular the addition of helper functions in https://github.com/scipp/esssans/pull/136/files#diff-21119170cad13c69e1a8e14143c2527950f4aa5308d96851c51a3fc280faff1d such as get_sans2d_tutorial_data_folder, which runs before the workflow and handles downloading files. Can we take a similar approach with Scicat data?

jl-wynen · 2024-05-02T07:31:50Z

get_sans2d_tutorial_data_folder is specialised to the tutorial. We need something more general to work with user data. But yes, we may be able to write a function that takes a dataset ID and filter to identify the data file and downloads the dataset, file, and all required associated files.

SimonHeybrock · 2024-05-02T07:43:48Z

get_sans2d_tutorial_data_folder is specialised to the tutorial. We need something more general to work with user data. But yes, we may be able to write a function that takes a dataset ID and filter to identify the data file and downloads the dataset, file, and all required associated files.

I think my question really was: Would such an approach fit into the user story / service blueprint?

jl-wynen · 2024-05-02T08:01:22Z

I think my question really was: Would such an approach fit into the user story / service blueprint?

Essentially, yes. This is what the user story looks like. 1. Select and download the datasets + files. 2. provide them to the workflow.

feat: add provider for downloading datasets from scicat

6778d7c

jokasimr requested a review from jl-wynen April 8, 2024 08:41

jokasimr added 2 commits April 8, 2024 10:47

fix

0c1fb42

remove unnecessary whitespace

4fb9418

jl-wynen reviewed Apr 8, 2024

View reviewed changes

use scitacean testing fixtures, separate client provider

ffa86ed

SimonHeybrock reviewed Apr 9, 2024

View reviewed changes

jokasimr added 3 commits April 9, 2024 11:58

refactor tests to use fake file transfer

cb9036c

better name

bb795bd

simplify

fc2ca3a

SimonHeybrock reviewed Apr 9, 2024

View reviewed changes

jokasimr added 3 commits April 9, 2024 13:13

remove extras, add conftest

405fbfc

fix: lock for threading

1e7a61c

fix: allow parallel download of different data

09b361c

SimonHeybrock reviewed Apr 9, 2024

View reviewed changes

jokasimr added 2 commits April 9, 2024 16:35

fix: add test returning local path but doesnt download file

29aa5f1

add related datasets getter

5f23eac

jl-wynen mentioned this pull request May 2, 2024

UX: Improve readability for ISIS workflows scipp/esssans#136

Merged

jokasimr closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add provider for downloading datasets from scicat #18

feat: add provider for downloading datasets from scicat #18

jokasimr commented Apr 8, 2024

jl-wynen Apr 8, 2024

jl-wynen Apr 8, 2024

jl-wynen Apr 8, 2024

jl-wynen Apr 8, 2024

jl-wynen Apr 8, 2024

jokasimr Apr 8, 2024

jl-wynen Apr 8, 2024

SimonHeybrock Apr 9, 2024

jokasimr Apr 9, 2024

SimonHeybrock Apr 9, 2024

SimonHeybrock Apr 9, 2024

SimonHeybrock Apr 9, 2024

SimonHeybrock Apr 9, 2024 •

edited

Loading

jokasimr Apr 9, 2024 •

edited

Loading

SimonHeybrock Apr 9, 2024

jokasimr Apr 9, 2024

jl-wynen commented Apr 26, 2024

jokasimr commented Apr 29, 2024

SimonHeybrock commented May 1, 2024

jl-wynen commented May 2, 2024

SimonHeybrock commented May 2, 2024

jl-wynen commented May 2, 2024

		from ess.reduce.scicat import download_scicat_file


		class Client:

		)


		def download_scicat_file(

feat: add provider for downloading datasets from scicat #18

feat: add provider for downloading datasets from scicat #18

Conversation

jokasimr commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

jokasimr Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jl-wynen commented Apr 26, 2024

jokasimr commented Apr 29, 2024

SimonHeybrock commented May 1, 2024

jl-wynen commented May 2, 2024

SimonHeybrock commented May 2, 2024

jl-wynen commented May 2, 2024

SimonHeybrock Apr 9, 2024 •

edited

Loading

jokasimr Apr 9, 2024 •

edited

Loading