Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support csw dcat harvest #2800

Merged
merged 38 commits into from
Oct 19, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
8e28bd7
Test parsing CSW graph
maudetes Oct 24, 2022
c1d6f50
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Jan 9, 2023
89079e1
More work in progress
maudetes Jan 12, 2023
3633007
Fix CSW pagination
maudetes Jan 13, 2023
9cc93ad
Add dedicated CSW backend
maudetes Jan 13, 2023
7156205
Remove temporary verify=False param in requests
maudetes Jan 13, 2023
be50028
Add TODO on the node to parse in CSW response
maudetes Jan 13, 2023
0dbfc2d
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Jan 13, 2023
02b5624
Find SearchResults in tree
maudetes Jan 20, 2023
0f2bf14
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Jan 20, 2023
f0e736c
Comment out on identifier missing ValueError
maudetes Jan 20, 2023
7617e66
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Feb 2, 2023
dcac080
Add max items limit in CSW harvest
maudetes Feb 23, 2023
7b60d7f
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Mar 3, 2023
ef27113
Clean up a bit
maudetes Mar 3, 2023
23ab5c3
Add tests on paginated CSW GeoNetwork v4
maudetes Mar 3, 2023
382772a
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Mar 7, 2023
496ee8c
Improve query based on review comments
maudetes Mar 8, 2023
e573378
Improve break conditions to deal with edge cases
maudetes Mar 8, 2023
e324779
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Mar 8, 2023
70dd0b3
Specify CSW namespace instead of guessing
maudetes Mar 9, 2023
966c2e9
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Mar 17, 2023
74ff3e3
Mock better SearchResults pagination value
maudetes Mar 17, 2023
3a7992f
Update changelog
maudetes Mar 17, 2023
f3c8793
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Mar 29, 2023
b1f2d86
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Apr 13, 2023
7b48b8e
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Apr 18, 2023
b409af8
Add support for CswDcatBackend in parse_url command
maudetes Apr 20, 2023
26cbbfc
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes May 10, 2023
8d9539f
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes May 17, 2023
f2fa89a
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Jun 19, 2023
d1117ee
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Jul 19, 2023
dd98a8b
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Sep 1, 2023
f44c56e
Add a schema feature on CswDcatBackend
maudetes Sep 6, 2023
29e3021
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Oct 9, 2023
e9fd92b
Remove configurable schema feature for now
maudetes Oct 18, 2023
7be2386
Update changelog
maudetes Oct 18, 2023
8ebe100
Merge branch 'master' into feat/support-csw-dcat-harvest
maudetes Oct 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ def pip(filename):
],
'udata.harvesters': [
'dcat = udata.harvest.backends.dcat:DcatBackend',
'csw = udata.harvest.backends.dcat:CswBackend',
],
'udata.avatars': [
'internal = udata.features.identicon.backends:internal',
Expand Down
5 changes: 3 additions & 2 deletions udata/harvest/backends/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,8 +268,9 @@ def process(self, item):
raise NotImplementedError

def add_item(self, identifier, *args, **kwargs):
if identifier is None:
raise ValueError('DCT.identifier is required for all DCAT.Dataset records')
# TODO: Raise an error on identifier missing in dcat dedicated code instead of add_item?
# if identifier is None:
# raise ValueError('DCT.identifier is required for all DCAT.Dataset records')
item = HarvestItem(remote_id=str(identifier), args=args, kwargs=kwargs)
self.job.items.append(item)
return item
Expand Down
60 changes: 60 additions & 0 deletions udata/harvest/backends/dcat.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from rdflib import Graph, URIRef, BNode
from rdflib.namespace import RDF
import xml.etree.ElementTree as ET
from typing import List

from udata.rdf import (
Expand Down Expand Up @@ -120,3 +121,62 @@ def process(self, item):
dataset = self.get_dataset(item.remote_id)
dataset = dataset_from_rdf(graph, dataset, node=node)
return dataset


class CswBackend(DcatBackend):
display_name = 'CSW'

def parse_graph(self, url, fmt):
graphs = []

page = 0
body = '''<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
maudetes marked this conversation as resolved.
Show resolved Hide resolved
xmlns:gmd="http://www.isotc211.org/2005/gmd"
service="CSW" version="2.0.2" resultType="results"
startPosition="{start}" maxPosition="15"
outputSchema="http://www.w3.org/ns/dcat#">
<csw:Query typeNames="gmd:MD_Metadata">
maudetes marked this conversation as resolved.
Show resolved Hide resolved
<csw:ElementSetName>full</csw:ElementSetName>
<csw:Constraint version="1.1.0">
<Filter xmlns="http://www.opengis.net/ogc"><PropertyIsEqualTo>
<PropertyName>documentStandard</PropertyName>
<Literal>iso19139</Literal>
</PropertyIsEqualTo></Filter>
</csw:Constraint>
</csw:Query>
</csw:GetRecords>'''
headers = {"Content-Type": "application/xml"}

content = requests.post(url, data=body.format(start=1), headers=headers).text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the best between making a POST using our crafted body or expecting a complete url to make a GET query on it?

I would say a POST with our body would make it easier on the user side, but it may prevent the user from adding custom params to the targeted endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSW spec allows GET and/or POST and this is advertised in the GetCapabilities

<ows:Operation name="GetRecords">
<ows:DCP>
<ows:HTTP>
<ows:Get xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>
<ows:Post xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>

but most implementations probably support both.

documentStandard is quite specific to GeoNetwork and in this case related to the fact that we want to only request metadata records which are in a standard which provides a mapping to DCAT (currently ISO19139 and ISO19115-3) and avoid failures on catalogues containing records on other standards (eg. Dublin core or ISO19110). No criteria is needed if the target catalogue contains only ISO19139.

You can probably increase page size - we have been using 200 records per page for INSPIRE member states harvesting with no issue.

Can be good to add a sort by clause to avoid potential issue of records appearing in various pages if index changes occur during harvesting

    <ogc:SortBy xmlns:ogc="http://www.opengis.net/ogc">
      <ogc:SortProperty>
        <ogc:PropertyName>identifier</ogc:PropertyName>
       <ogc:SortOrder>ASC</ogc:SortOrder>
      </ogc:SortProperty>
    </ogc:SortBy>
  </csw:Query>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About documentStandard filter, you should probably remove it from your side and ask GeoNetwork users to set up a dedicated portal on their catalogue in order to filter on their side the records they want to be harvested (so they can filter on documentStandard if they have metadata records not supporting mapping to DCAT - which should not happen often).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the inputs!

  • We're keeping with the POST request for an improved user experience. We'll mention it in our harvesting doc (as well as an example of an expected endpoint to provide).
  • I've removed documentStandard, we should probably add a mention about the need to expose a catalog with records that support mapping to DCAT in our documentation also!
  • I've increased the page size to 200 and added the sort clause.

tree = ET.fromstring(content)

while tree:
graph = Graph(namespace_manager=namespace_manager)
# TODO: could we find a better way to deal with namespaces?
namespace = tree.tag.split('}')[0].strip('{}')
search_results = tree.find('csw:SearchResults', {'csw': namespace})
if not search_results:
# TODO: may be worth an investigation if it happens
log.error(f'No search results found for {url} on page {page}')
break
for child in search_results:
subgraph = Graph(namespace_manager=namespace_manager)
subgraph.parse(data=ET.tostring(child), format=fmt)
graph += subgraph

for node in subgraph.subjects(RDF.type, DCAT.Dataset):
id = subgraph.value(node, DCT.identifier)
kwargs = {'nid': str(node), 'page': page}
kwargs['type'] = 'uriref' if isinstance(node, URIRef) else 'blank'
self.add_item(id, **kwargs)
graphs.append(graph)
page += 1

if search_results.attrib['nextRecord'] != '0':
tree = ET.fromstring(
requests.post(url, data=body.format(start=search_results.attrib['nextRecord']),
headers=headers).text)
else:
break

return graphs