You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
Describe the bug
When running inference, AbstractWriterCallback loops over all datasets to construct the _dataset_size dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.
To Reproduce
Run inference on-the-fly (#87) with your data_dir and glob_pattern set up to find many whole-slide images.
Expected behavior
You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.
In my case
[2024-06-07 12:24:32,332][ahcore.data.dataset.DlupDataModule][INFO] - Dataset for stage predict has 773079 samples and the following statistics:
- Mean: 485.30
- Std: 145.56
- Min: 48.00
- Max: 1056.00
[2024-06-07 12:29:30,294][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback
Environment
dlup version: 0.3.38
How installed: unsure
Python version: 3.11.9
Operating System: linux
AbstractFileWriterCallback._dataset_sizes is only used internally to track size of datasets and being in the last batch or not
AbstractFileWriterCallback._dataset_sizes keys are curnetly being set using teh slide_identifier, which requires opening the slide, which is slow
Generally, the slide_identifier in a SlideImage is the filepath, if it is not explicitly set.
Later in AbstractFileWriterCallback, this is even entirely expected, e.g. on line 302, we see if self._tile_counter[curr_filename] == self._dataset_sizes[curr_filename]:....
Hence, it is currently of no value to set it to the identifier. If the identifier WOULD be set, this part of the code would break since it tries to use the filename
Hence, it is best to simply use the _path from the dataset class to set the _dataset_sizes keys, which will be faster en not lose any unctionality.
If, in the future, we want to support identifier WITHIN this class, this can be considered a feature request that requires some more refactoring.
Describe the bug
When running inference,
AbstractWriterCallback
loops over all datasets to construct the_dataset_size
dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.To Reproduce
Run inference on-the-fly (#87) with your
data_dir
andglob_pattern
set up to find many whole-slide images.Expected behavior
You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.
In my case
Environment
dlup version: 0.3.38
How installed: unsure
Python version: 3.11.9
Operating System: linux
Quick solution to reduce time by half;
in
ahcore/ahcore/callbacks/abstract_writer_callback.py
Line 181 in 93274e5
to
which will likely reduce the time by half
The text was updated successfully, but these errors were encountered: