oc_graphenricher.enricher package

Module contents

Copyright 2021 Gabriele Pisciotta - ga.pisciotta@gmail.com

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

class oc_graphenricher.enricher.GraphEnricher(g_set, graph_filename='enriched.rdf', provenance_filename='provenance.rdf', info_dir='', debug=False, serialize_in_the_middle=False)[source]

Bases: object

The GraphEnricher class is the one responsible to enrich all the entities in a given graph set compliant to the OpenCitations Data Model (OCDM). You have to specify in input the graph set, the output file name of the enriched graph and the provenance file name. It’s also possible to specify a debug flag to get more details about the enrichment process.

Parameters:
  • g_set (oc_ocdm.graph.GraphSet) –

  • graph_filename (str) –

  • provenance_filename (str) –

  • info_dir (str) –

  • debug (bool) –

  • serialize_in_the_middle (bool) –

enrich()[source]

The enricher iterates each BR contained in the graph set. For each BR (avoiding issues and journals), get the list of the identifiers already contained in the graph set and check if it already has a DOI, an ISSN and a Wikidata ID:

  • If an ISSN is specified, it query Crossref to extract other ISSNs.

  • If there’s no DOI, it query Crossref to get one by means of all the other data extracted

  • If there’s no Wikidata ID, it query Wikidata to get one by means of all the other identifiers

Any new identifier found will be added to the BR.

Then, for each AR related to the BR, get the list of all the identifier already contained and:
  • If doesn’t have an ORCID, it query ORCID to get it

  • If doesn’t have a VIAF, it query VIAF to get it

  • If doesn’t have a Wikidata ID, it query Wikidata by means of all the other identifier to get one

  • If the AR is related to a publisher, it query Crossref to get its ID by means of its DOI

Any new identifier found will be added to the AR.

In the end it will store a new graph set and its provenance.

NB: Even if it’s not possible to have an identifier duplicated for the same entity, it’s possible that in the whole graph set you could find different identifiers that share the same schema and literal. For this purpose, you should use the instancematching module after that you’ve enriched the graph set.

Return type:

None

_add_id(entity, literal, schema, by_means_of=None)[source]

Method that let you add a new identifier to an entity, having specified the literal value, the schema and optionally the API used

Parameters:
  • entity (Union[oc_ocdm.graph.entities.bibliographic.bibliographic_resource.BibliographicResource, oc_ocdm.graph.entities.bibliographic.responsible_agent.ResponsibleAgent]) – a bibliographic resource or an agent role

  • literal (str) – the literal value of the identifier

  • schema (str) – the schema of the identifier

  • by_means_of (Optional[str]) – an optional string that let you specify the API used

Return type:

None

__std_out_err_redirect_tqdm()

This method is used to print messages with the TQDM progress bar