Skip to content

Script: cache queries.sh

Tim L edited this page Jan 19, 2014 · 16 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

What we will cover

This page describes how to use cache-queries.sh to query a SPARQL endpoint and capturing the provenance of your query.

Let's get to it!

Path: $CSV2RDF4LOD_HOME/bin/util/cache-queries.sh

Usage:

usage: cache-queries.sh <endpoint> [-p {output,format}] [-o {sparql,gvds,xml,exhibit,csv}+] [-q a.sparql b.sparql ...]*
    execute SPARQL queries against an endpoint requesting the given output formats
            -p : the URL parameter name used to request a different output/format.
    default -p : output
            -o : the URL parameter value(s) to request.
    default -o : sparql xml
    default -q : *.sparql *.rq

Description

Generalized to handle LOGD's SparqlProxy and DBpedia (they use different att-vars to control output).

Example Usage

Have a SPARQL query in a file:

bash-3.2$ cat types.sparql 
SELECT distinct ?type
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    [] a ?type
  }
} order by ?type

Submit the query to an endpoint, requesting csv results format:

bash-3.2$ cache-queries.sh http://logd.tw.rpi.edu/sparql -o csv -q types.sparql
types.sparql
  csv

cache-queries.sh sets up a results directory with the results for each query (one, here) and result format (one, here). It also stores the provenance of the results.

bash-3.2$ l results/
total 24
-rw-r--r--  1 lebot  staff  4301 Mar  2 09:24 types.sparql.csv
-rw-r--r--  1 lebot  staff  2621 Mar  2 09:24 types.sparql.csv.pml.ttl

See some of the results:

bash-3.2$ head results/types.sparql.csv
"type"
"http://inference-web.org/2.0/pml-justification.owl#InferenceStep"
"http://inference-web.org/2.0/pml-justification.owl#NodeSet"
"http://inference-web.org/2.0/pml-provenance.owl#DocumentFragmentByRowCol"
"http://inference-web.org/2.0/pml-provenance.owl#InferenceEngine"
"http://inference-web.org/2.0/pml-provenance.owl#Information"
"http://inference-web.org/2.0/pml-provenance.owl#Source"
"http://inference-web.org/2.0/pml-provenance.owl#SourceUsage"
"http://inference-web.org/2.1exper/pml-provenance.owl#AntecedentRole"
"http://logd.tw.rpi.edu/source/ars-usda-gov/vocab/Dataset"

See all of the provenance:

bash-3.2$ cat results/types.sparql.csv.pml.ttl 
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix sioc:    <http://rdfs.org/sioc/ns#> .
@prefix pmlp:    <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlb:    <http://inference-web.org/2.b/pml-provenance.owl#> .
@prefix pmlj:    <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix conv:    <http://purl.org/twc/vocab/conversion/> .

<types.sparql.csv>
   a pmlp:Information;
   nfo:hasHash <md5_409b9fc94fff24f9ac2547fe2714c33c>;
.

<md5_409b9fc94fff24f9ac2547fe2714c33c>
   a nfo:FileHash; 
   nfo:hashAlgorithm "md5";
   nfo:hashValue "409b9fc94fff24f9ac2547fe2714c33c";
.

<types.sparql.csv>
   a pmlp:Information;
   pmlp:hasModificationDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;
   pmlp:hasReferenceSourceUsage <sourceusage_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
.
<sourceusage_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
   a pmlp:SourceUsage;
   pmlp:hasSource        <http://logd.tw.rpi.edu/sparql?query=SELECT%20distinct%20%3Ftype%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fvocab%2FDataset%3E%20%20%7B%0A%20%20%20%20%5B%5D%20a%20%3Ftype%0A%20%20%7D%0A%7D%20order%20by%20%3Ftype%0A&output=csv>;
   pmlp:hasUsageDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;
.

<http://logd.tw.rpi.edu/sparql?query=SELECT%20distinct%20%3Ftype%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fvocab%2FDataset%3E%20%20%7B%0A%20%20%20%20%5B%5D%20a%20%3Ftype%0A%20%20%7D%0A%7D%20order%20by%20%3Ftype%0A&output=csv>
   a pmlj:Query, pmlp:Source;
   pmlj:isFromEngine <http://logd.tw.rpi.edu/sparql>;
   pmlj:hasAnswer    <nodeset_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
.
<http://logd.tw.rpi.edu/sparql>
   a pmlp:InferenceEngine, pmlp:WebService;
.

<nodeset_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
   a pmlj:NodeSet;
   pmlj:hasConclusion <types.sparql.csv>;
   pmlj:isConsequentOf [
      a pmlj:InferenceStep;
      pmlj:hasIndex 0;
      pmlj:hasAntecedentList (
         [ a pmlj:NodeSet; pmlp:hasConclusion <query_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7> ]
         [ a pmlj:NodeSet; pmlp:hasConclusion [
               a pmlb:AttributeValuePair;
               pmlb:attribute "output"; pmlb:value "csv"
             ]
         ]
      );
   ];
.

<query_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>
   a pmlb:AttributeValuePair;
   pmlb:attribute "query";
   pmlb:value     """SELECT distinct ?type
WHERE {
  GRAPH <http://logd.tw.rpi.edu/vocab/Dataset>  {
    [] a ?type
  }
} order by ?type""";
.

--limit-offset

The --limit-offset argument will cause cache-queries.sh to iterate the given query with the OFFSET keyword until no useful results are returned.

For example, with a SPARQL query file svn-files.rq already containing at LIMIT:

construct {
   ?svn a rdfs:Resource;
      prov:wasAttributedTo ?developer .
}
where {
...
}
limit 1000000

We can run the query without the argument --limit-offset to get the one result that follows from submitting the query exactly as-is:

data/source/us/opendap-svn-file-hierarchy/version/blah$ 
 cache-queries.sh http://opendap.tw.rpi.edu/sparql -o ttl -q ../../src/svn-files.rq -od source

data/source/us/opendap-svn-file-hierarchy/version/blah$ 
 ls -ln source/

-rw-r--r-- 1 528 301 221661 2014-01-19 16:30 svn-files.rq.ttl
-rw-r--r-- 1 528 301   7719 2014-01-19 16:30 svn-files.rq.ttl.prov.ttl

But, if we want to get all results, we can add the argument --limit-offset to get all results, in chunks of the LIMIT defined in the original query:

data/source/us/opendap-svn-file-hierarchy/version/blah$ 
 cache-queries.sh http://opendap.tw.rpi.edu/sparql -o ttl -q ../../src/svn-files.rq --limit-offset -od source

data/source/us/opendap-svn-file-hierarchy/version/blah$ 
 ls -ln source/

???

If --limit-offset is specified, but no LIMIT is found in the original query, then a default LIMIT of 10,000 will be used. Or, the LIMIT to use can be defined as --limit-offset 100000

What is next?

Clone this wiki locally