oc_graphenricher.instancematching package

Submodules

oc_graphenricher.instancematching.generate_test_graphset module

Copyright 2021 Gabriele Pisciotta - ga.pisciotta@gmail.com

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

oc_graphenricher.instancematching.generate_test_graphset.add_one_author_with_single_id(type, literal)[source]
oc_graphenricher.instancematching.generate_test_graphset.add_one_author_with_two_id(type, literal)[source]
oc_graphenricher.instancematching.generate_test_graphset.add_article()[source]
oc_graphenricher.instancematching.generate_test_graphset.add_br_with_one_author(name)[source]
oc_graphenricher.instancematching.generate_test_graphset.add_id(entity, literal, schema, g_set)[source]

oc_graphenricher.instancematching.test_instancematching module

Copyright 2021 Gabriele Pisciotta - ga.pisciotta@gmail.com

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

class oc_graphenricher.instancematching.test_instancematching.TestInstanceMatching(methodName='runTest')[source]

Bases: TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

Return type:

None

test_ras_merged()[source]
test_ids_not_duplicated()[source]
test_orphan_ra()[source]
test_orphan_ar()[source]
test_brs_merged()[source]
test_brs_have_only_one_list_of_authors()[source]
test_remove_files()[source]

Module contents

Copyright 2021 Gabriele Pisciotta - ga.pisciotta@gmail.com

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

class oc_graphenricher.instancematching.InstanceMatching(g_set, graph_filename='matched.rdf', provenance_filename='provenance.rdf', info_dir='', debug=False)[source]

Bases: object

The InstanceMatching class is the one responsible to deduplicate all the entities (Bibliographic Resources, Agent Roles) in a given graph set compliant to the OpenCitations Data Model (OCDM). You have to specify in input the graph set. It’s also possible to specify the output file name of the deduplicated graph, the provenance file name, and a debug flag to get more details about the enrichment process.

Parameters:
  • g_set (oc_ocdm.graph.GraphSet) –

  • info_dir (str) –

match()[source]

Start the matching process that will do, in sequence: - match the Responsible Agents (RAs) - match the Bibliographic Resources (BRs) - match the IDs

In the end, this process will produce:
  • matched.rdf that will contain the graph set specified previously without the duplicates.

  • provenance.rdf that will contain the provenance, tracking record of all the changes done.

save()[source]

Serialize the graph set into the specified RDF file, and the provenance in another specified RDF file.

instance_matching_ra()[source]

Discover all the Responsible Agents (RAs) that share the same identifier’s literal, creating a graph of them. Then merge each connected component (cluster of Responsible Agents (RAs) linked by the same identifier) into one. For each couple of Responsible Agent (RA) that are going to be merged, substitute the references of the Responsible Agent (RA) that will no longer exist, by removing the Responsible Agent (RA) from each of its referred Agent Role (AR) and add, instead, the merged one)

If the Responsible Agent (RA) linked by the Agent Role (AR) that will no longer exist is not linked by any other Agent Role (AR), then it will be marked as to be deleted, otherwise not.

In the end, generate the provenance and commit pending changes in the graph set

instance_matching_br()[source]

Discover all the Bibliographic Resources (BRs) that share the same identifier’s literal, creating a graph of them. Then merge each connected component (cluster of Be Responsible Agent (RA) associated to the Rs linked by the same identifier) into one. For each couple of Bibliographic Resource (BR) that are going to be merged, merge also:

  • their containers by matching the proper type (issue of BR1 -> issue of BR2)

  • their publisher

In the end, generate the provenance and commit pending changes in the graph set

instance_matching_id()[source]

Discover all the IDs related to Bibliographic Resources (BRs) and Responsible Agents (RAs) that share the same schema and literal, then merge all into one and substitute all the reference with the merged one. In the end, generate the provenance and commit pending changes in the graph set

__get_part_of()

Given a Bibliographic Resource (BR) in input (e.g.: a journal article), walk the full ‘part-of’ chain. Returns a list of Bibliographic Resource (BR) that are the hierarchy of of containers (e.g: given an article-> [issue, journal])

Parameters:

br (oc_ocdm.graph.entities.bibliographic.bibliographic_resource.BibliographicResource) – a Bibliographic Resource (BR)

Return partofs:

a list that contains the Bibliographic Resources (BRs) of the hierarchy

__get_publisher()

Given a Bibliographic Resource (BR) as input, returns the Agent Role (AR) that is a publisher

__get_association_ar_ra()

This let you take all the ARs associated to the same RA

Return association:

a dictionary having Responsible Agent (RA) as key, and a list of Agent Role (AR) as value

__get_association_ar_br()

This let you take all the Bibliographic Resources (BRs) associated to the same AR

Return association:

a dictionary having Agent Role (AR) as key, and a list of Bibliographic Resource (BR) as value