Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Series support #298

Open
amercader opened this issue Sep 2, 2024 · 7 comments
Open

Dataset Series support #298

amercader opened this issue Sep 2, 2024 · 7 comments

Comments

@amercader
Copy link
Member

amercader commented Sep 2, 2024

DCAT 3 introduced a new class for Dataset Series, essentially defined as a collection of datasets that have a common characteristic (both DCA-AP and DCAT-US provide implementation guidance and examples).

Note that the definition of Dataset Series is very loose and not ncessarily restricted to time series. Some potential examples:

  • Budget data release monthly or yearly
  • Data split by country / region
  • Data big in size split into smaller chuncks
  • Geospatial data distributed in grids

The concept of "collections" of dataset is not new in CKAN and there have been previous implementations:

These are all conceptually similar, just a higher level entity that datasets can belong to. In DCAT terms this is expressed using the dcat:DatasetSeries class and the dcat:inSeries properties in dcat:Datasets.

ex:budget a dcat:DatasetSeries ;
  dcterms:title "Budget data"@en ;
  .
  
ex:budget-2024 a dcat:Dataset ;
  dcterms:title "Budget data for year 2024"@en ;
  dcat:inSeries ex:budget ;

dcat:DatasetSeries shares a subset of dcat:Dataset properties (e.g. in DCAT AP v3 or DCAT US v3)

When there is a liner relation between the datasets in the series, links and navigation can be implemented using dcat:first, dcat:prev, and dcat:last (and the inverse dcat:next), e.g.:

ex:budget a dcat:DatasetSeries ;
  dcterms:title "Budget data"@en ;
  dcat:first ex:budget-2018 ;
  dcat:last ex:budget-2020 ;
  .
  
ex:budget-2018 a dcat:Dataset ;
  dcterms:title "Budget data for year 2018"@en ;
  dcat:inSeries ex:budget ;
  dcat:next ex:budget-2019 ;
  .
  
ex:budget-2019 a dcat:Dataset ;
  dcterms:title "Budget data for year 2019"@en ;
  dcat:inSeries ex:budget ;
  dcat:prev ex:budget-2018 ;
  dcat:next ex:budget-2020 ;
  .
  
ex:budget-2020 a dcat:Dataset ;
  dcterms:title "Budget data for year 2020"@en ;
  dcat:inSeries ex:budget ;
  dcat:prev ex:budget-2019 ;

Potential implementations

Setting aside the navigation part of it, I think that implementing series using a custom dataset type (dataset_series) rather than Groups has more benefits. For starters, Dataset Series are subclasses of Datasets and share several properties. Secondly we can index them for free so they can be returned as results of the standard dataset search, or excluded depending on the instance needs.

A custom in_series field can be added to the dataset schema, to store the id of the dataset_series dataset it belongs to. This field will also allow member datasets to not be displayed in the default search if that is a requirement.

Navigation (prev/next) within a series should be optional.

UI changes needed:

  • Some indicator in a Dataset Series search result
  • Some indicator in a Dataset Series dataset page
  • A "Datasets" tab in a Dataset Series dataset page that lists the member datasets and allows to add / remove them (and reorder them, see below)
  • Indicator that a dataset belongs to a series in search result page
  • Indicator that a dataset belongs to a series in the dataset page
  • (Optional) Next/Prev links on the Dataset page

The linking / navigation part is what presents more challenges. We need an efficient way of

  1. Present what datasets are previous and next at the dataset level
  2. Present what datasets are first and last at the dataset series level

So essentially storing the order of the datasets within the series. Both previous/next and first/last could be computed at index time.

We have similar cases in CKAN core with the resource and resource view ordering. Resource order is just stored as the order in which resources are stored in the database Resources have a position field, but it's not comparable because all resources are updated with a single package_update call. Resource views do have a dedicated order field and the resource_view_reorder action updates all DB records.

We could follow a similar approach to resource views in series. Although it would be nice to not have to rely on a new table (dataset_series with series_id, dataset_id, order columns) I don't think we can update the order efficiently for big series by updating custom fields in the dataset.

We would need to test performance for very large series.

@wardi
Copy link
Contributor

wardi commented Sep 2, 2024

Can a dataset be part of multiple series, e.g. separated by release and separated by region?

For ordering, the metadata should already include a field that can be used to order the results so we don't need to track order separately e.g. a release date field or a region field. Adjusting the order within a series becomes updating the metadata for a dataset.

We could allow navigation through dataset series from the display of the metadata field the series is based on, e.g. a date picker for release (only the dates with releases selectable) or a choice list for region.

How about we identify fields that define a series as part of the schema definition something like:

data_series_fields:
  # blank for non-data-series datasets, set to the same value for all datasets in the same series
  # required
  identifier_field: my_series_group
  # enable chronological series e.g. a date or year+month field:
  temporal_field: my_release_date
  # enable geographic region series e.g. a choice list of locations:
  region_field: my_region

  # enable series by chunk of data for large datasets e.g. an integer field:
  # partial_field: my_part_number
  # enable series by geographic grid identifier:
  # grid_field: my_grid

Facets in search will give us all the neighboring series datasets and we don't need new tables or extensive changes to the UI

@amercader
Copy link
Member Author

Thanks for the feedback @wardi

Can a dataset be part of multiple series, e.g. separated by release and separated by region?

Technically yes. DCAT-AP discourages it though to avoid complexity. Also in theory there is even support for nested series, but again this would add even more complexity.

For ordering, the metadata should already include a field that can be used to order the results so we don't need to track order separately

I get how this simplifies the implementation, but wouldn't that mean that we need to support each of these cases separately, as they all have slightly different implementations (e.g. sorting by date vs alphabetically by code vs chunk of a larger file..). Perhaps we can abstract the cases a bit and have one for time-based sort, one for numeric sort and one for text sort, and maybe one that allows to define a callable for more complex sorts.

If we are not tracking order separately would you call these sorting algorithms at index time to calculate the first, last, prev and next fields? I guess this would mean re-indexing the whole series when a member is updated. On the other hand, if we compute them at view time (i.e. dataset page or when generating a DCAT representation) that might affect performance.

@wardi
Copy link
Contributor

wardi commented Sep 10, 2024

I like it. Having time sort, numeric sort, text sort as generic types would be more flexible.

We could have an index on the order fields in the search back end. That way the search can give us all the datasets in an series in order without needing to re-index them.

@wardi
Copy link
Contributor

wardi commented Sep 10, 2024

Regarding re-indexing a whole series were you thinking of the case where we've assigned a integer numeric index and need to insert a dataset in the series and bump all the later numbers?

Maybe we could store a float value instead, insert at the half-way value and somehow display integer indexes from the indexed search result

@alexandradanyi
Copy link

Hi all, not sure if this is the right place to ask but would it be possible to get a rough timeline for when this can be implemented? Thanks in advance!

@amercader
Copy link
Member Author

Quick status update. After discussing the implementation more in detail with @wardi and @smotornyuk (thanks for the great feedback!) I now have a POC for this that I think it's a great way forward for a first implementation.

This will be implemented with a custom dataset type for Dataset Series plus computed fields for series navigation based on Solr queries (i.e. no persistent navigation items stored). This allows a really straight-forward implementation and avoids out of sync problems. As this is a valuable feature for all sites and to avoid adding complexity to this extension, the bulk of the logic will live in a new, separate ckanex-dataset-series extension:

  • ckanext-dataset-series:
    • Example schema for dataset-series
    • Extend package_show to include navigation links
    • Custom indexing to store in_series field as a list
    • UI widgets, e.g.:
      • scheming form widget to select Dataset Series
      • Navigation widget to show next/prev links in dataset pages
      • Custom page for Dataset Series with member dataset list
      • ...
  • ckanext-dcat
    • Adapt profiles to serialize Dataset Series following the DCAT spec
    • Support parsing and creating Dataset Series (?)

I'll push a first draft of ckanext-dataset-series early next week so we can create specific issues there. Once that is in place and the overall feature documented it will be much easier to contribute the different bits missing @alexandradanyi @hcvdwerf . You can focus on the ones needed for your particular use case.
I'll let you know when that is in place

@hcvdwerf
Copy link
Contributor

hcvdwerf commented Jan 27, 2025

Sounds good approach @amercader ! Thanks a lot!

We used discussed a Use case

Use Case: Publishing Regional and Annual Environmental Data

One concrete use case is the publication of regional environmental data collected annually. For example:

Scenario
A government agency collects air quality data for different regions (e.g., Texas, California) on a yearly basis. Each year, datasets are generated per region. Over time, these datasets need to be grouped into series for easier navigation and analysis.

Implementation with Dataset Series
Datasets: Individual files for air quality data for a specific region and year (e.g., "Air Quality - Texas 2014").
Dataset Series: Group datasets into series per year (e.g., "Air Quality 2014") or per region (e.g., "Air Quality texas").

Benefits:
Navigation: Users can seamlessly navigate datasets within a series (e.g., from "Air Quality - Texas 2014" to "Air Quality -California 2015").

Overview:
A series page provides a comprehensive view of all datasets for a given year or region.

in_series is indeed the field to link datasets with dataseries

Another confusion example is ;-): https://catalogue.bbmri.nl/menu/main/app-molgenis-app-biobank-explorer#/collection/bbmri-eric:ID:NL_aaaacz5nbabrsacqk2mgyyqaae:collection:IBD. Collection can be seen as dataseries and subcollection as dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants