-
Notifications
You must be signed in to change notification settings - Fork 36
Finding Linksets among Linked Data Bubbles
- CKAN - a walk through of how to add and annotate dataset entries (and the extra requirements to suit the lodcloud group).
- One click data dump - easy access to the list of all URIs in a csv2rdf4lod node.
- https://github.com/jimmccusker/twc-healthdata/wiki/Listing-twc-healthdata-as-a-LOD-Cloud-Bubble
- The analysis described in this page can be enabled as part of csv2rdf4lod's Secondary Derivative Datasets framework.
This page describes how to calculate VoID Linksets between a csv2rdf4lod node and all other bubbles in the Linked Data Diagram, using csv2rdf4lod-automations' one-click data dump and lodcloud's "namespace" annotations. Calculating the Linksets makes it easier to find out how a bubble is connected to others, which also makes it easier to assert the CKAN lodcloud annotation required to get into the diagram.
To find links, we need two things:
- A list of all RDF nodes in a bubble. We can get this rather easily by running csv2rdf4lod's one-click data dump through nt-nodes.sh.
- The namespace for each Linked Data bubble, which is given with the "namespace" annotation in CKAN. For example,
-
http://datahub.io/dataset/2000-us-census-rdf's namespace is
http://www.rdfabout.com/rdf/usgov/geo/
, and -
http://datahub.io/dataset/a-seobook-dataset's namespace is
http://seobook.blog.com
.
-
http://datahub.io/dataset/2000-us-census-rdf's namespace is
We can get a bubble's namespace by POSTing its URI to a deployed instance of lift-ckan.py (e.g. here), which provides a good RDF description of the contorted annotations in the CKAN data entry.
curl -H "Content-Type: text/turtle" \
-d '<http://datahub.io/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#CKANDataset> .' \
http://aquarius.tw.rpi.edu/projects/datafaqs/services/sadi/ckan/lift-ckan
returns the following RDF triples (among others). The one we need is void:uriSpace.
<http://datahub.io/dataset/2000-us-census-rdf> a datafaqs:CKANDataset;
ov:shortName "US Census (rdfabout)";
dcterms:title "2000 U.S. Census in RDF (rdfabout.com)";
void:sparqlEndpoint <http://www.rdfabout.com/sparql>;
void:triples 1002848918;
void:uriSpace "http://www.rdfabout.com/rdf/usgov/geo/" .
http://datahub.io/dataset/twc-logd's namespace is http://logd.tw.rpi.edu/
, and http://datahub.io/dataset/twc-healthdata references URIs http://logd.tw.rpi.edu/id/medicare-gov/provider/340070 and
http://logd.tw.rpi.edu/id/medicare-gov/provider/340071.
cr-linksets.sh creates a versioned dataset. Use find automatic -type f -size +0b -name linkset.txt
to find non-zero linksets.
When 50 URIs occur in both http://datahub.io/dataset/twc-healthdata and http://datahub.io/dataset/2000-us-census-rdf, it is represented in VoID like this:
<http://datahub.io/dataset/twc-healthdata>
void:subset :linkset_2000c93158fafa9776550172052af7dc .
:linkset_2000c93158fafa9776550172052af7dc
a void:Linkset, void:Dataset;
void:target
<http://datahub.io/dataset/twc-healthdata>,
<http://datahub.io/dataset/2000-us-census-rdf>;
void:triples 50;
.
<http://www.rdfabout.com/rdf/usgov/geo/blah_1> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .
<http://www.rdfabout.com/rdf/usgov/geo/blah_2> void:inDataset :linkset_2000c93158fafa9776550172052af7dc .
We can name the Linkset by hashing the targets and current date. For example:
md5.sh -qs http://datahub.io/dataset/twc-healthdata`date +%s`http://datahub.io/dataset/2000-us-census-rdf
2000c93158fafa9776550172052af7dc
- http://opendap.tw.rpi.edu/instances/void:Linkset
- http://ieeevis.tw.rpi.edu/instances/void:Linkset
- http://healthdata.tw.rpi.edu/instances/void:Linkset
This is cheaper to calculate because we don't need to go through the hassle of finding and retrieving the full data dump of each bubble, and we don't have as much instance data to process. However, this will miss connections between our bubble and others' when they mention the same URIs that we do, but are not in their own namespace.
- The analysis described in this page can be enabled as part of csv2rdf4lod's Secondary Derivative Datasets framework.
- How hard is it to get one click data dumps for bubbles that do not use csv2rdf4lod-automation?
- What is the disparity between the manual assertion on the CKAN entry and what was actually found?
- How can we model the Linkset calculation so that it naturally provides justification for the resulting CKAN annotation? (SIO-qualifying the void:triples triple and saying it prov:wasDerivedFrom the analysis that produced it. Tie into Jim's aggregation thesis?)
- Some thoughts on How to characterize a list of RDF node URIs
- CKAN lodcloud RDF vocabulary to use add-metadata.py to submit the Linksets to CKAN (done automatically with cr pingback).
- Finding Vocabularies that Datasets Use
- https://github.com/timrdf/vsr/wiki/Centrifuge - a new view on the lodcloud, instead of the bubble blob.
- edu.rpi.tw.string.uri.NamingAuthorityMatrix implements the PLD citation graph.