-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support csw dcat harvest #2800
Support csw dcat harvest #2800
Conversation
udata/harvest/backends/dcat.py
Outdated
</csw:GetRecords>''' | ||
headers = {"Content-Type": "application/xml"} | ||
|
||
content = requests.post(url, data=body.format(start=1), headers=headers).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the best between making a POST using our crafted body or expecting a complete url to make a GET query on it?
I would say a POST
with our body would make it easier on the user side, but it may prevent the user from adding custom params to the targeted endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSW spec allows GET and/or POST and this is advertised in the GetCapabilities
<ows:Operation name="GetRecords">
<ows:DCP>
<ows:HTTP>
<ows:Get xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>
<ows:Post xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>
but most implementations probably support both.
documentStandard
is quite specific to GeoNetwork and in this case related to the fact that we want to only request metadata records which are in a standard which provides a mapping to DCAT (currently ISO19139 and ISO19115-3) and avoid failures on catalogues containing records on other standards (eg. Dublin core or ISO19110). No criteria is needed if the target catalogue contains only ISO19139.
You can probably increase page size - we have been using 200 records per page for INSPIRE member states harvesting with no issue.
Can be good to add a sort by clause to avoid potential issue of records appearing in various pages if index changes occur during harvesting
<ogc:SortBy xmlns:ogc="http://www.opengis.net/ogc">
<ogc:SortProperty>
<ogc:PropertyName>identifier</ogc:PropertyName>
<ogc:SortOrder>ASC</ogc:SortOrder>
</ogc:SortProperty>
</ogc:SortBy>
</csw:Query>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About documentStandard
filter, you should probably remove it from your side and ask GeoNetwork users to set up a dedicated portal on their catalogue in order to filter on their side the records they want to be harvested (so they can filter on documentStandard
if they have metadata records not supporting mapping to DCAT - which should not happen often).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the inputs!
- We're keeping with the POST request for an improved user experience. We'll mention it in our harvesting doc (as well as an example of an expected endpoint to provide).
- I've removed
documentStandard
, we should probably add a mention about the need to expose a catalog with records that support mapping to DCAT in our documentation also! - I've increased the page size to 200 and added the sort clause.
It only adds a toogle between default and GeoDCAT-AP. We may want to make it entirely configurable later if other makes sense. However only HarvestFeature and HarvestFilter exists for now.
GeoDCAT-AP schema mapping is not working yet with our harvester
A proposed solution for datagouv/data.gouv.fr#913.
Use DCAT export served by the CSW endpoint, available in GeoNetwork for harvesting.
This CSW backend is actually based on the DCAT backend with a different
parse_graph
function.