Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KGCreator Interface #7

Open
heindorf opened this issue Aug 25, 2020 · 3 comments
Open

KGCreator Interface #7

heindorf opened this issue Aug 25, 2020 · 3 comments

Comments

@heindorf
Copy link
Member

Currently, KGCreator is initialized with __init__(self, path, logger=None) and the transform method returns a file path.

I would suggest to omit both parameters:

  • the transform method should not return a path, but the actual data (to make the class KGCreator more independent of the file system and because sklearn's method returns the actual data instead of a file path.
  • I would suggest the return type of the transform method to be an rdflib Graph.
  • I believe it is bad practice to pass a logger as an argument. All the logging should be handled via the logging module, e.g., via logging.get_logger(...)
@heindorf
Copy link
Member Author

heindorf commented Aug 25, 2020

Possible implementation:

class DataFrame2Graph(BaseEstimator, TransformerMixin):
    def __init__(self, base_url=u'http://example.com/'):
        self.base_url = base_url
    
    def fit(self, x, y=None):
        return self

    def transform(self, df):
        df = df.astype(str)
        graph = rdflib.Graph()
        
        for subject, row in df.iterrows():
            subject = urllib.parse.quote(subject)
            subject = rdflib.term.URIRef(self.base_url + subject)
            for predicate, obj in row.iteritems():
                predicate = urllib.parse.quote(predicate)
                obj = urllib.parse.quote(obj)
                
                predicate = rdflib.term.URIRef(self.base_url + predicate)
                obj = rdflib.term.URIRef(self.base_url + obj)
                
                graph.add((subject, predicate, obj))

        return graph

@Demirrr
Copy link
Member

Demirrr commented Sep 2, 2020

We omit rdflib due to scalability reason. Please try your suggested implementation on the provided datasets. You will see that it would take ages :)

@Demirrr
Copy link
Member

Demirrr commented Oct 22, 2020

A similar class as suggested above implemented in here, although creating KG using rdflib appears to require more than creating rdf ntriples from scratch, i.e. avoiding graph.add(). We used the Graph class of the rdflib so that invalid nt issue will not occur again (see #6).

FYI:

  1. Transform method still returns the path of serialised graph. This is different than the workflow in sklearn as the sklearn is not concern with graphs, not with graph serialisation. Moreover, the path of kg is send to PYKE, so we do not need to keep graph in memory.

  2. Passing a logger as an argument might be indeed a bad practice. However, it is better than not having a logging module right ? Moreover your suggestion is not clear to me as we already used logging module. Consequently, no changes are made regarding the logging since the project started we have not experience any negative outcome of this style.

Please @heindorf close this issue if answers are satisfying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants