Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add guidance on quality assets #567

Merged
merged 6 commits into from
Jan 23, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,93 @@ sidebar_position: 4

## Quality Data Offers at EDC

For the process described in the KIT Quality, data exchange between the participating partner companies is necessarily to be done for large vehicle and product populations. The data exchange therefore tshould be done as a file download via EDC according to the following specifications.
When a Data Consumer calls the Catalog of a Data Provider, the Data Provider must signal in each Data Offer what exactly
a Consumer could negotiate for. Data Offers in the Catalog are sorted by dcat:Datasets which are registered in the EDC-
Management API as edc:Asset. Each Asset has private and public properties. The public properties are shown in the
catalog and give hints to the Data Consumer what API and data they may expect. There are some properties that are
mandatory for the entire Catena-X network and some that are mandatory only in specific Business Scenarios (like
Quality).

The dataAddress object's structure is determined by the dataplane implementation as it configures the details of the
data transfer. They are not visible via the catalog.

The following suggestion is a non-standardized draft how Assets (and thus by proxy, dcat:Datasets) should be registered
in the Quality Use-Case.

```json
{
"@context": {
"cx-taxo": "https://w3id.org/catenax/taxonomy#",
"cx-common": "https://w3id.org/catenax/ontology/common#",
"dct": "https://purl.org/dc/terms/",
"dcat": "http://www.w3.org/ns/dcat#",
"edc": "https://w3id.org/edc/v0.0.1/ns/"
},
"@id": "someId",
"@type": "edc:Asset",
"edc:properties": {
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
"dct:type": {
"@id": "cx-taxo:ProductDescription"
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
},
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
"cx-common:version": "1.0",
"dct:language": {
"@id": "https://w3id.org/idsa/code/EN"
},
"dcat:qualifiedRelation": {
"dct:isPartOf": {
"@id": "http://my.quality/task"
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
}
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
},
"dct:conformsTo": {
"@id": "urn:samm:io.catenax.vehicle.product_description:3.0.0#ProductDescription"
},
"dct:description": "TBD",
arnoweiss marked this conversation as resolved.
Show resolved Hide resolved
"dct:format": "application/octet-stream;type=parquet-snappy",
"edc:type": "AmazonS3"
},
"edc:dataAddress": {
"@type": "edc:DataAddress",
"edc:type": "AmazonS3",
"edc:region": "eu-west-1",
"edc:bucketName": "int-xcod-quality-aspect-models-eu-west-1",
"edc:keyName": "myCompany/myTag/QualityTask.parquet",
"edc:accessKeyId": "…",
"edc:secretAccessKey": "…"
}
}

```

### S3 Data Address

This section is not use-case specific but since the EDC's AmazonS3 dataplane is basically undocumented, here is an
explanation:

| Property | Value | Description |
|-----------------------|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `edc:type` | `"AmazonS3"` | This shows which data source the Data Plane will query. It also determines what other content the `dataAddress` object must hold. |
| `edc:region` | `"eu-west-1"` | This property represents the AWS-region where the source bucket is located. |
| `edc:bucketName` | `"provider-quality-bucket"` | This is the name of the source bucket that the data to-be-transferred resides in. |
| `edc:keyName` | `"path/through/provider/s3"` | This is the path of the file that shall be offered to the dataspace. |
| `edc:accessKeyId` | `"<keyId>"` | Amazon S3 uses this property similarly to how oauth2 client credentials use the `clientId`. Note that this can also be set during deployment-time for the whole S3-dataplane. If it's set here, it will override the default config. |
| `edc:secretAccessKey` | `"<secretAccessKey>"` | This secret is used similarly to a `clientSecret` in oauth2 client credentials |

### Properties

| Property | Value | Description |
|----------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `https://purl.org/dc/terms/type` | `{"@id": "cx-taxo:QualityAsset"}` | CX-0018 mandates the usage of the dct:type property to signal what kind of Asset a consumer can expect behind a dcat:Dataset. In the Quality Use-Case, this is identified as `https://w3id.org/catenax/taxonomy#QualityAsset`. The expected payload this API serves is determined by the `dcat:conformsTo` property. |
| `https://purl.org/dc/terms/language` | `{"@id": "https://w3id.org/idsa/code/EN"}` | This property is QM-specific. As it points to an IRI, it must be embedded in a json-object with the `@id` key. The use of this is unclear. |
| `https://purl.org/dc/terms/format` | `"application/octet-stream;type=parquet-snappy"` | This property is QM-specific. dct:format usually points to the correct IANA Media Type. As currently only parquet files are used, the type application/octet-stream with the added property type=parquet-snappy must be used. The syntax is expained [here](https://www.iana.org/assignments/media-types-parameters/media-types-parameters.xhtml). If in the future csv shall be supported, the value could also be `text/csv`. |
| `https://purl.org/dc/terms/description` | `<whatever>` | This property is QM-specific. For human-readable content, rdfs:comment is the usual property but would introduce another namespace so the dct-native property is chosen here. |
| `https://purl.org/dc/terms/conformsTo` | `{"@id":"<urnOfTheCorrespondingAspectModel>"}` | This property is QM-specific. It holds the exact aspect-model-URN that defines the schema of the presented dataset including its version. The version in here refers to the data model's version while the EDC-property `cx-common:version` defines the version of the underlying API serving the data. |
| `http://www.w3.org/ns/dcat#qualifiedRelation` | `{"dct:isPartOf": {"@id": "<idOfTheCorrespondingQualityTask>"}}` | This property is QM-specific. All Asset types defined in this Kit must include this property as it links the data behind an asset with the correct QualityTask. Note that the id of the QualityTask must be used, not the id of the EDC-Asset shielding said QualityTask. |
| `https://w3id.org/edc/v0.0.1/ns/type` | `AmazonS3` | This property signifies the EDC dataplane that the QM data will be transferred over. The expectation that this would be signaled via the dcat:DataSet-dcat:distribution property of the catalog currently isn't implemented in the EDC. Thus the data must be replicated here and is presented via the same property that the consumer-side `transferprocesses` API uses for this same signal. |
| `https://w3id.org/catenax/ontology/common#version` | `"1.0"` | CX-0018 recommends to use cx-common:version to signify the API's version. Since QM has a tight connection between the API and the datamodel, this value could describe the version of the CX-API-standard for the Quality use-case. Creation is currently in progress as CX-0123 v1.0.0. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version is fine, but I think we need to make sure that the Quality KIT clearly states what versions are included in the current release and which versions should be use. As of now, this information is missing, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cx-common:version is the version of CX-0123.
CX-0123 will likely link to CX-0018 v2.1.0.
CX-0018 demands DSP 0.8.

I don't see any trouble here. Should I explain the logical reason above in the Kit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, understood and fine for me.


For the process described in the KIT Quality, data exchange between the participating partner companies is necessarily
to be done for large vehicle and product populations. The data exchange therefore should be done as a file download via
EDC according to the following specifications.

### **Asset File type**

Expand All @@ -23,13 +109,18 @@ File transfer is recommended to be done via EDC S3 plane, The transfer via EDC h

### **Asset consumption**

**File** flattening **rules**: The data provided in the asset is build from 4 to 6 structures. To assure a secure and smooth exchange flatting rules for the file (csv / xls / Parquet / json) must be applied. This includes checks for format and possible values for each column and will be part of next version regulations. If the rules are not applied correctly the mapping of content will not be possible without manual handling effort.
**File** flattening **rules**: The data provided in the asset is build from 4 to 6 structures. To assure a secure and
smooth exchange flatting rules for the file (csv / xls / Parquet / json) must be applied. This includes checks for
format and possible values for each column and will be part of next version regulations. If the rules are not applied
correctly the mapping of content will not be possible without manual handling effort.

## Sample Data

Standard version from: 09.2023

In the following, example data for the standardized data models are provided as download in zip format. The sample data is generated according the current standards. It contains a virtual fleet of 50.000 vehicles where two quality issues are implemented.
In the following, example data for the standardized data models are provided as download in zip format. The sample data
is generated according the current standards. It contains a virtual fleet of 50.000 vehicles where two quality issues
are implemented.

- Production failure of product "zehn" at Tier 1
- Specification failure
Expand Down Expand Up @@ -143,3 +234,5 @@ CX_release32_partsanalyses_200_testdata_100_json
As **data provider** please add the **PARQUET file** from folder tesdata_CX32
as EDC asset id to **EDC S3 data plane**:
CX_release32_partsanalyses_200_testdata_100_parquet

##