-
Notifications
You must be signed in to change notification settings - Fork 36
Script: cache queries.sh
This page describes how to use cache-queries.sh to query a SPARQL endpoint and capturing the provenance of your query.
Path: $CSV2RDF4LOD_HOME/bin/util/cache-queries.sh
usage: cache-queries.sh <endpoint> [-p {output,format}] [-o {sparql,gvds,xml,exhibit,csv}+] [-q a.sparql b.sparql ...]*
execute SPARQL queries against an endpoint requesting the given output formats
-p : the URL parameter name used to request a different output/format.
default -p : output
-o : the URL parameter value(s) to request.
default -o : sparql xml
default -q : *.sparql *.rq
Generalized to handle LOGD's SparqlProxy and DBpedia (they use different att-vars to control output).
Have a SPARQL query in a file:
bash-3.2$ cat types.sparql
SELECT distinct ?type
WHERE {
GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
[] a ?type
}
} order by ?type
Submit the query to an endpoint, requesting csv results format:
bash-3.2$ cache-queries.sh http://logd.tw.rpi.edu/sparql -o csv -q types.sparql
types.sparql
csv
cache-queries.sh
sets up a results directory with the results for each query (one, here) and result format (one, here). It also stores the provenance of the results.
bash-3.2$ l results/
total 24
-rw-r--r-- 1 lebot staff 4301 Mar 2 09:24 types.sparql.csv
-rw-r--r-- 1 lebot staff 2621 Mar 2 09:24 types.sparql.csv.pml.ttl
See some of the results:
bash-3.2$ head results/types.sparql.csv
"type"
"http://inference-web.org/2.0/pml-justification.owl#InferenceStep"
"http://inference-web.org/2.0/pml-justification.owl#NodeSet"
"http://inference-web.org/2.0/pml-provenance.owl#DocumentFragmentByRowCol"
"http://inference-web.org/2.0/pml-provenance.owl#InferenceEngine"
"http://inference-web.org/2.0/pml-provenance.owl#Information"
"http://inference-web.org/2.0/pml-provenance.owl#Source"
"http://inference-web.org/2.0/pml-provenance.owl#SourceUsage"
"http://inference-web.org/2.1exper/pml-provenance.owl#AntecedentRole"
"http://logd.tw.rpi.edu/source/ars-usda-gov/vocab/Dataset"
See all of the provenance:
bash-3.2$ cat results/types.sparql.csv.pml.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
@prefix pmlp: <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlb: <http://inference-web.org/2.b/pml-provenance.owl#> .
@prefix pmlj: <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix conv: <http://purl.org/twc/vocab/conversion/> .
<types.sparql.csv>
a pmlp:Information;
nfo:hasHash <md5_409b9fc94fff24f9ac2547fe2714c33c>;
.
<md5_409b9fc94fff24f9ac2547fe2714c33c>
a nfo:FileHash;
nfo:hashAlgorithm "md5";
nfo:hashValue "409b9fc94fff24f9ac2547fe2714c33c";
.
<types.sparql.csv>
a pmlp:Information;
pmlp:hasModificationDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;
pmlp:hasReferenceSourceUsage <sourceusage_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
.
<sourceusage_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
a pmlp:SourceUsage;
pmlp:hasSource <http://logd.tw.rpi.edu/sparql?query=SELECT%20distinct%20%3Ftype%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fvocab%2FDataset%3E%20%20%7B%0A%20%20%20%20%5B%5D%20a%20%3Ftype%0A%20%20%7D%0A%7D%20order%20by%20%3Ftype%0A&output=csv>;
pmlp:hasUsageDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;
.
<http://logd.tw.rpi.edu/sparql?query=SELECT%20distinct%20%3Ftype%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fvocab%2FDataset%3E%20%20%7B%0A%20%20%20%20%5B%5D%20a%20%3Ftype%0A%20%20%7D%0A%7D%20order%20by%20%3Ftype%0A&output=csv>
a pmlj:Query, pmlp:Source;
pmlj:isFromEngine <http://logd.tw.rpi.edu/sparql>;
pmlj:hasAnswer <nodeset_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
.
<http://logd.tw.rpi.edu/sparql>
a pmlp:InferenceEngine, pmlp:WebService;
.
<nodeset_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
a pmlj:NodeSet;
pmlj:hasConclusion <types.sparql.csv>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList (
[ a pmlj:NodeSet; pmlp:hasConclusion <query_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7> ]
[ a pmlj:NodeSet; pmlp:hasConclusion [
a pmlb:AttributeValuePair;
pmlb:attribute "output"; pmlb:value "csv"
]
]
);
];
.
<query_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
a pmlb:AttributeValuePair;
pmlb:attribute "query";
pmlb:value """SELECT distinct ?type
WHERE {
GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
[] a ?type
}
} order by ?type""";
.
The --limit-offset
argument will cause cache-queries.sh
to iterate the given query with the OFFSET
keyword until no useful results are returned.
For example, with a SPARQL query file svn-files.rq already containing at LIMIT
:
construct {
?svn a rdfs:Resource;
prov:wasAttributedTo ?developer .
}
where {
...
}
limit 1000000
We can run the query without the argument --limit-offset
to get the one result that follows from submitting the query exactly as-is:
data/source/us/opendap-svn-file-hierarchy/version/blah$
cache-queries.sh http://opendap.tw.rpi.edu/sparql -o ttl -q ../../src/svn-files.rq -od source
data/source/us/opendap-svn-file-hierarchy/version/blah$
ls -ln source/
-rw-r--r-- 1 528 301 221661 2014-01-19 16:30 svn-files.rq.ttl
-rw-r--r-- 1 528 301 7719 2014-01-19 16:30 svn-files.rq.ttl.prov.ttl
But, if we want to get all results, we can add the argument --limit-offset
to get all results, in chunks of the LIMIT
defined in the original query:
data/source/us/opendap-svn-file-hierarchy/version/blah$
cache-queries.sh http://opendap.tw.rpi.edu/sparql -o ttl -q ../../src/svn-files.rq --limit-offset -od source
data/source/us/opendap-svn-file-hierarchy/version/blah$
ls -ln source/
???
If --limit-offset
is specified, but no LIMIT
is found in the original query, then a default LIMIT
of 10,000 will be used. Or, the LIMIT to use can be defined as --limit-offset 100000