-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Series support #298
Comments
Can a dataset be part of multiple series, e.g. separated by release and separated by region? For ordering, the metadata should already include a field that can be used to order the results so we don't need to track order separately e.g. a release date field or a region field. Adjusting the order within a series becomes updating the metadata for a dataset. We could allow navigation through dataset series from the display of the metadata field the series is based on, e.g. a date picker for release (only the dates with releases selectable) or a choice list for region. How about we identify fields that define a series as part of the schema definition something like: data_series_fields:
# blank for non-data-series datasets, set to the same value for all datasets in the same series
# required
identifier_field: my_series_group
# enable chronological series e.g. a date or year+month field:
temporal_field: my_release_date
# enable geographic region series e.g. a choice list of locations:
region_field: my_region
# enable series by chunk of data for large datasets e.g. an integer field:
# partial_field: my_part_number
# enable series by geographic grid identifier:
# grid_field: my_grid Facets in search will give us all the neighboring series datasets and we don't need new tables or extensive changes to the UI |
Thanks for the feedback @wardi
Technically yes. DCAT-AP discourages it though to avoid complexity. Also in theory there is even support for nested series, but again this would add even more complexity.
I get how this simplifies the implementation, but wouldn't that mean that we need to support each of these cases separately, as they all have slightly different implementations (e.g. sorting by date vs alphabetically by code vs chunk of a larger file..). Perhaps we can abstract the cases a bit and have one for time-based sort, one for numeric sort and one for text sort, and maybe one that allows to define a callable for more complex sorts. If we are not tracking order separately would you call these sorting algorithms at index time to calculate the first, last, prev and next fields? I guess this would mean re-indexing the whole series when a member is updated. On the other hand, if we compute them at view time (i.e. dataset page or when generating a DCAT representation) that might affect performance. |
I like it. Having time sort, numeric sort, text sort as generic types would be more flexible. We could have an index on the order fields in the search back end. That way the search can give us all the datasets in an series in order without needing to re-index them. |
Regarding re-indexing a whole series were you thinking of the case where we've assigned a integer numeric index and need to insert a dataset in the series and bump all the later numbers? Maybe we could store a float value instead, insert at the half-way value and somehow display integer indexes from the indexed search result |
Hi all, not sure if this is the right place to ask but would it be possible to get a rough timeline for when this can be implemented? Thanks in advance! |
Quick status update. After discussing the implementation more in detail with @wardi and @smotornyuk (thanks for the great feedback!) I now have a POC for this that I think it's a great way forward for a first implementation. This will be implemented with a custom dataset type for Dataset Series plus computed fields for series navigation based on Solr queries (i.e. no persistent navigation items stored). This allows a really straight-forward implementation and avoids out of sync problems. As this is a valuable feature for all sites and to avoid adding complexity to this extension, the bulk of the logic will live in a new, separate ckanex-dataset-series extension:
I'll push a first draft of ckanext-dataset-series early next week so we can create specific issues there. Once that is in place and the overall feature documented it will be much easier to contribute the different bits missing @alexandradanyi @hcvdwerf . You can focus on the ones needed for your particular use case. |
Sounds good approach @amercader ! Thanks a lot! We used discussed a Use case Use Case: Publishing Regional and Annual Environmental Data One concrete use case is the publication of regional environmental data collected annually. For example: Scenario Implementation with Dataset Series Benefits: Overview: in_series is indeed the field to link datasets with dataseries Another confusion example is ;-): https://catalogue.bbmri.nl/menu/main/app-molgenis-app-biobank-explorer#/collection/bbmri-eric:ID:NL_aaaacz5nbabrsacqk2mgyyqaae:collection:IBD. Collection can be seen as dataseries and subcollection as dataset |
DCAT 3 introduced a new class for Dataset Series, essentially defined as a collection of datasets that have a common characteristic (both DCA-AP and DCAT-US provide implementation guidance and examples).
Note that the definition of Dataset Series is very loose and not ncessarily restricted to time series. Some potential examples:
The concept of "collections" of dataset is not new in CKAN and there have been previous implementations:
These are all conceptually similar, just a higher level entity that datasets can belong to. In DCAT terms this is expressed using the
dcat:DatasetSeries
class and thedcat:inSeries
properties indcat:Datasets
.dcat:DatasetSeries
shares a subset ofdcat:Dataset
properties (e.g. in DCAT AP v3 or DCAT US v3)When there is a liner relation between the datasets in the series, links and navigation can be implemented using
dcat:first
,dcat:prev
, anddcat:last
(and the inversedcat:next
), e.g.:Potential implementations
Setting aside the navigation part of it, I think that implementing series using a custom dataset type (
dataset_series
) rather than Groups has more benefits. For starters, Dataset Series are subclasses of Datasets and share several properties. Secondly we can index them for free so they can be returned as results of the standard dataset search, or excluded depending on the instance needs.A custom
in_series
field can be added to the dataset schema, to store the id of thedataset_series
dataset it belongs to. This field will also allow member datasets to not be displayed in the default search if that is a requirement.Navigation (prev/next) within a series should be optional.
UI changes needed:
The linking / navigation part is what presents more challenges. We need an efficient way of
So essentially storing the order of the datasets within the series. Both previous/next and first/last could be computed at index time.
We have similar cases in CKAN core with the resource and resource view ordering.
Resource order is just stored as the order in which resources are stored in the databaseResources have aposition
field, but it's not comparable because all resources are updated with a singlepackage_update
call. Resource views do have a dedicatedorder
field and theresource_view_reorder
action updates all DB records.We could follow a similar approach to resource views in series. Although it would be nice to not have to rely on a new table (
dataset_series
withseries_id
,dataset_id
,order
columns) I don't think we can update the order efficiently for big series by updating custom fields in the dataset.We would need to test performance for very large series.
The text was updated successfully, but these errors were encountered: