Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join specification when logical source is the same #74

Open
bjdmeest opened this issue Jan 18, 2023 · 12 comments
Open

Join specification when logical source is the same #74

bjdmeest opened this issue Jan 18, 2023 · 12 comments
Assignees
Labels
proposal issue has a proposal to be solved RML-join issues relate to joins of RML

Comments

@bjdmeest
Copy link
Member

bjdmeest commented Jan 18, 2023

Let's say we have two triple maps that refer to the same logical source (and with same, we really mean same URI, not "same because the descriptions lead to the semantically same logical source").

Sample source (CSV)

id,parent_id
1,2
2,1

Base mapping (YARRRML)

prefixes
  ex: http://example.com#
sources:
  test: [data.csv]
mappings:
  test1:
    s: ex:$(id)
    po:
      p: ex:parent
      o:
        mapping: test2
  test2:
    s: ex:$(parent_id)

We have following use cases that are underspecified in de spec

the spec currently says If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

  1. If a join condition is specified AND the logical source is not the same: common case, execute join condition between each iteration pair
  2. If a join condition is specified AND the logical source is the same: same as above
  3. If no join condition is specified AND the logical source is not the same: do a full join (i.e., take all iterations into account)
  • example output: ex:1 ex:parent ex:2, ex:1 ex:parent ex:1, ex:2 ex:parent ex:2, ex:2 ex:parent ex:1
  1. If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account
  • example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
  • this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?
@dachafra dachafra added the documentation Improvements or additions to documentation label Jan 20, 2023
@pmaria
Copy link
Collaborator

pmaria commented Jan 22, 2023

  • If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account

    • example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
    • this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?

Why is this last one an edge case? IMO this is regular and intended behavior.

R2RML states:

The child query of a referencing object map is the effective SQL query of the logical table of the term map containing the referencing object map.

The parent query of a referencing object map is the effective SQL query of the logical table of its parent triples map.

If the child query and parent query of a referencing object map are not identical, then the referencing object map must have at least one join condition.

The joint SQL query of a referencing object map is:

If the referencing object map has no join condition:
SELECT * FROM ({child-query}) AS tmp

[...]
The joint SQL query is used when generating RDF triples from referencing object maps.

So if the query of the logical table / iterator + ref. formulation + source of the logical source are equal, the child query / child logical source will be used for the reference object values. Thus the join will be executed per iteration.

@bjdmeest
Copy link
Member Author

Ah, so then the current spec doesn't allow the use case "join over all iterations without join condition".

I see 2 potential paths:

  1. if we agree that we don't really need that use case: maybe clarify the spec a bit that both "the same logical source ID is reused in a different mapping" and "a different logical source with the same description (i.e. equal query of the logical table / iterator + ref. formulation + source of the logical source)" are interpreted as "identical logical sources".
  • we can still support this use case by, e.g. adding a join condition that always returns true
  1. if we agree that we do need that use case: one way to resolve that is to make the distinction between reusing the logical source ID vs have the same descriptions in different logical sources.

If there are no alternative opinions, I think I would vote for option 1: less edge cases and potential full joins are made explicit, with a trade-off that we need for every logical source type specify what are its 'identifying attributes' (e.g. a CSVW requires separator, null value specifier, ... all in its 'identifying attributes'), and the edge case is more verbose.

@dachafra
Copy link
Member

As it's a join issue, I'm going to move it to its corresponding repo

@dachafra dachafra transferred this issue from kg-construct/rml-core Aug 29, 2023
@elsdvlee
Copy link
Collaborator

elsdvlee commented Sep 22, 2023

The ambiguity of 'identical logical sources' came up again during the TF Meeting on join 8/9/2023.

  1. Can we agree to add to the spec that 'identical logical sources' are logical sources with the same identifier (and not "same because the descriptions lead to the semantically same logical source")? This allows a clean definition that cannot be misunderstood. I see no need to accept also 'descriptions that lead to the semantically same logical source' as 'identical sources', as you can always chose that first option when you write a mapping file. Mapping engines can still implement strategies to optimize (some) joins with join conditions over 'semantically same logical sources', however that doesn't have to be described in the spec.
  2. In the new spec draft (https://kg-construct.github.io/rml-core/spec/docs/#joins) I don't find the reference to the joint queries of R2RML nor the requirement that a join condition is needed when logical sources are not identical (see the corresponding description of the old spec below). Can we agree to also add the old description to the new spec? Was there any reason to leave this out? Or is this underspecified because we are waiting for the outcome of the TF Meeting on Joins?

The joint query is used when generating RDF triples from referencing object maps. If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

@pmaria
Copy link
Collaborator

pmaria commented Sep 22, 2023

The ambiguity of 'identical logical sources' came up again during the TF Meeting on join 8/9/2023.

  1. Can we agree to add to the spec that 'identical logical sources' are logical sources with the same identifier (and not "same because the descriptions lead to the semantically same logical source")? This allows a clean definition that cannot be misunderstood. I see not need to accept also 'descriptions that lead to the semantically same logical source' as 'identical sources', as you can always chose that first option when you write a mapping file. Mapping engines can still implement strategies to optimize (some) joins with join conditions over 'semantically same logical sources', however that doesn't have to be described in the spec.

I don't see why we would need this. Nowhere else is this currently needed. It is, however, important to define that descriptions that lead to the same logical source are considered equal. For example, if you describe the same logical source on different triples maps with inline blank nodes. Equality via URI is basically a free gift from RDF.

I would also prefer to not use RDF language specifics to define behavior, since this may impede the introduction of non-RDF language bindings.

@elsdvlee
Copy link
Collaborator

elsdvlee commented Jan 22, 2024

We still need an agreement and specification in RML-core on:

  1. the definition of same logical source:
    option 1.1: same identifier
    option 1.2: descriptions that lead to the same logical source.
  2. the behaviour in case of a referencing object map / referencing term map without join conditions:
    option 2.1: natural join in case of same logical sources, error in case of different logical sources (this is in line with the R2RML spec and the old RML spec)
    option 2.2: natural join in case of same logical sources, cartesian product in case of different logical sources

In case of option 1.2: can we add also a clear description in the spec on how to compare the descripitons, e.g. string-wise comparison of rml:source and rml:iterator objects? Or should also a full comparision of the nested source be done. What in case of e.g. 2 decat descriptions with different identifiers leading to the same source?

A proposal for the spec, see join types in: https://github.com/elsdvlee/rml-core/blob/main/spec/docs/joinconditions.md
Feel free to suggest improvements. This proposal is still to be adapted to any decision from the communitiy on the open questions raised above.

@elsdvlee
Copy link
Collaborator

elsdvlee commented Jan 26, 2024

@dachafra I propose to move this issue back to the rml-core repo.

@dachafra
Copy link
Member

yep! Transferring it....!

@dachafra dachafra transferred this issue from kg-construct/rml-jc Jan 26, 2024
@dachafra
Copy link
Member

To continue the discussion of this issue, and considering that there is already a spec written, I would suggest making a PR @elsdvlee so the rest can review it and provide comments!

@dachafra dachafra added the proposal issue has a proposal to be solved label Jan 31, 2024
@dachafra dachafra added the RML-join issues relate to joins of RML label Jan 31, 2024
@elsdvlee
Copy link
Collaborator

elsdvlee commented Jan 31, 2024

@dachafra see pull request #78

@dachafra
Copy link
Member

awesome! Please assign @andimou @pmaria @bjdmeest @DylanVanAssche as potential reviewers

@pmaria
Copy link
Collaborator

pmaria commented Apr 24, 2024

My view on defining equality of logical sources:

Object equality in programming languages is used as the basis for many things. For example comparison in different data structures for uniqueness and hashing. (Think dictionaries, sets etc.)
I strongly believe we should be able to leverage this for logical sources. I think source and logical source equality is something that is very useful to have when building RML processors.

Therefor, I would propose to come up with a definition of equality which can be implemented as such.

My proposal would be to define a logical source or source to be equal to another logical source or source if the RML-defined properties of the description of both are equal.

RML-defined: Those properties that are defined by a specification to have behavior that influences the behavior of an RML processor.

These properties MUST be listed for the rml:LogicalSource specification.

These properties MUST be listed for any rml:Source description.

Doing so will allow RML processors to map these descriptions to standard object equality mechanisms in their respective programming languages to best leverage the language's abilities.

@dachafra dachafra removed the documentation Improvements or additions to documentation label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal issue has a proposal to be solved RML-join issues relate to joins of RML
Projects
None yet
Development

No branches or pull requests

4 participants