KGCreator Interface #7

heindorf · 2020-08-25T11:10:39Z

Currently, KGCreator is initialized with __init__(self, path, logger=None) and the transform method returns a file path.

I would suggest to omit both parameters:

the transform method should not return a path, but the actual data (to make the class KGCreator more independent of the file system and because sklearn's method returns the actual data instead of a file path.
I would suggest the return type of the transform method to be an rdflib Graph.
I believe it is bad practice to pass a logger as an argument. All the logging should be handled via the logging module, e.g., via logging.get_logger(...)

The text was updated successfully, but these errors were encountered:

heindorf · 2020-08-25T14:15:55Z

Possible implementation:

class DataFrame2Graph(BaseEstimator, TransformerMixin):
    def __init__(self, base_url=u'http://example.com/'):
        self.base_url = base_url
    
    def fit(self, x, y=None):
        return self

    def transform(self, df):
        df = df.astype(str)
        graph = rdflib.Graph()
        
        for subject, row in df.iterrows():
            subject = urllib.parse.quote(subject)
            subject = rdflib.term.URIRef(self.base_url + subject)
            for predicate, obj in row.iteritems():
                predicate = urllib.parse.quote(predicate)
                obj = urllib.parse.quote(obj)
                
                predicate = rdflib.term.URIRef(self.base_url + predicate)
                obj = rdflib.term.URIRef(self.base_url + obj)
                
                graph.add((subject, predicate, obj))

        return graph

Demirrr · 2020-09-02T08:16:59Z

We omit rdflib due to scalability reason. Please try your suggested implementation on the provided datasets. You will see that it would take ages :)

Demirrr · 2020-10-22T07:04:35Z

A similar class as suggested above implemented in here, although creating KG using rdflib appears to require more than creating rdf ntriples from scratch, i.e. avoiding graph.add(). We used the Graph class of the rdflib so that invalid nt issue will not occur again (see #6).

FYI:

Transform method still returns the path of serialised graph. This is different than the workflow in sklearn as the sklearn is not concern with graphs, not with graph serialisation. Moreover, the path of kg is send to PYKE, so we do not need to keep graph in memory.
Passing a logger as an argument might be indeed a bad practice. However, it is better than not having a logging module right ? Moreover your suggestion is not clear to me as we already used logging module. Consequently, no changes are made regarding the logging since the project started we have not experience any negative outcome of this style.

Please @heindorf close this issue if answers are satisfying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KGCreator Interface #7

KGCreator Interface #7

heindorf commented Aug 25, 2020

heindorf commented Aug 25, 2020 •

edited

Loading

Demirrr commented Sep 2, 2020

Demirrr commented Oct 22, 2020

KGCreator Interface #7

KGCreator Interface #7

Comments

heindorf commented Aug 25, 2020

heindorf commented Aug 25, 2020 • edited Loading

Demirrr commented Sep 2, 2020

Demirrr commented Oct 22, 2020

heindorf commented Aug 25, 2020 •

edited

Loading