Welcome to oc_graphenricher’s documentation!
A tool to enrich any OCDM compliant Knowledge Graph, finding new identifiers and deduplicating entities.
OC GraphEnricher
A tool to enrich any OpenCitations Data Model (OCDM) compliant Knowledge Graph, finding new identifiers and deduplicating entities.
You can use integrate this package in your own python program or use it from the CLI.
License
Distributed under the ISC License. See LICENSE for more information.
Contact
Gabriele Pisciotta - ga.pisciotta@gmail.com
Project Link: https://github.com/opencitations/oc_graphenricher
Acknowledgements
This project has been developed as part of the Wikipedia Citations in Wikidata research project, under the supervision of prof. Silvio Peroni.
How to install
Installing from Pypi
To get the official and updated version of this package, follow these simple steps:
install python >= 3.8:
sudo apt install python3
Install oc_graphenricher via pip:
pip install oc-graphenricher
Installing from the sources
It’s also possible to build the package from the sources. To do that, follow the following:
Having already installed python, you can also install GraphEnricher via cloning this repository:
git clone https://github.com/opencitations/oc_graphenricher` cd ./oc_graphenricher
install poetry:
pip install poetry
install all the dependencies:
poetry install
build the package:
poetry build
install the package:
pip install ./dist/oc_graphenricher-<VERSION>.tar.gz
Run the tests
To run the tests (from the root of the project):
poetry run test
Tutorial
The OC GraphEnricher is supposed to accept only graph set objects. To create one:
from oc_ocdm.reader import Reader from oc_ocdm.graph import GraphSet from rdflib import Graph g = Graph() g = g.parse('../data/test_dump.ttl', format='nt11') reader = Reader() g_set = GraphSet(base_iri='https://w3id.org/oc/meta/') entities = reader.import_entities_from_graph(g_set, g, enable_validation=False, resp_agent='https://w3id.org/oc/meta/prov/pa/2')
Enrichment
At this point, to run the enrichment phase:
from oc_graphenricher.enricher import Enricher enricher = GraphEnricher(g_set) enricher.enrich()
You’ll see the progress bar with an estimate of the time needed and the average time spent for each Bibliographic Resource (BR) enriched.
Deduplication
Then, having serialized the enriched graph set, and having read it again as the g_set object, to run the deduplication step do:
from oc_graphenricher.instancematching import InstanceMatching matcher = InstanceMatching(g_set) matcher.match()
The match method will run sequentially: - deduplication of Responsible Agents (RAs) - deduplication of Bibliographic Resources (BRs) - deduplication of Identifiers (IDs) - save to file
If you need to, you can also deduplicate one of those independently of each other.
To deduplicate Responsible Agents (RAs):
from oc_graphenricher.instancematching import InstanceMatching matcher = InstanceMatching(g_set) matcher.instance_matching_ra() matcher.save()
To deduplicate Bibliographic Resources (BRs):
from oc_graphenricher.instancematching import InstanceMatching matcher = InstanceMatching(g_set) matcher.instance_matching_br() matcher.save()
To deduplicate Identifiers (IDs):
from oc_graphenricher.instancematching import InstanceMatching matcher = InstanceMatching(g_set) matcher.instance_matching_id() matcher.save()