Jan 19, 2014
csv2rdf4lod-automation is licensed under the Apache License, Version 2.0

This page describes how to use to query a SPARQL endpoint and capturing the provenance of your query.

Path: $CSV2RDF4LOD_HOME/bin/util/


usage: <endpoint> [-p {output,format}] [-o {sparql,gvds,xml,exhibit,csv}+] [-q a.sparql b.sparql ...]*
    execute SPARQL queries against an endpoint requesting the given output formats
            -p : the URL parameter name used to request a different output/format.
    default -p : output
            -o : the URL parameter value(s) to request.
    default -o : sparql xml
    default -q : *.sparql *.rq


Generalized to handle LOGD's SparqlProxy and DBpedia (they use different att-vars to control output).

Example Usage

Have a SPARQL query in a file:

bash-3.2$ cat types.sparql 
SELECT distinct ?type
  GRAPH <>  {
    [] a ?type
} order by ?type

Submit the query to an endpoint, requesting csv results format:

bash-3.2$ -o csv -q types.sparql
  csv sets up a results directory with the results for each query (one, here) and result format (one, here). It also stores the provenance of the results.

bash-3.2$ l results/
total 24
-rw-r--r--  1 lebot  staff  4301 Mar  2 09:24 types.sparql.csv
-rw-r--r--  1 lebot  staff  2621 Mar  2 09:24 types.sparql.csv.pml.ttl

See some of the results:

bash-3.2$ head results/types.sparql.csv

See all of the provenance:

bash-3.2$ cat results/types.sparql.csv.pml.ttl 
@prefix rdfs:    <> .
@prefix xsd:     <> .
@prefix foaf:    <> .
@prefix dcterms: <> .
@prefix sioc:    <> .
@prefix pmlp:    <> .
@prefix pmlb:    <> .
@prefix pmlj:    <> .
@prefix conv:    <> .

   a pmlp:Information;
   nfo:hasHash <md5_409b9fc94fff24f9ac2547fe2714c33c>;

   a nfo:FileHash; 
   nfo:hashAlgorithm "md5";
   nfo:hashValue "409b9fc94fff24f9ac2547fe2714c33c";

   a pmlp:Information;
   pmlp:hasModificationDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;
   pmlp:hasReferenceSourceUsage <sourceusage_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
   a pmlp:SourceUsage;
   pmlp:hasSource        <>;
   pmlp:hasUsageDateTime "2011-03-02T09:24:58-05:00"^^xsd:dateTime;

   a pmlj:Query, pmlp:Source;
   pmlj:isFromEngine <>;
   pmlj:hasAnswer    <nodeset_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7>;
   a pmlp:InferenceEngine, pmlp:WebService;

   a pmlj:NodeSet;
   pmlj:hasConclusion <types.sparql.csv>;
   pmlj:isConsequentOf [
      a pmlj:InferenceStep;
      pmlj:hasIndex 0;
      pmlj:hasAntecedentList (
         [ a pmlj:NodeSet; pmlp:hasConclusion <query_79d94ae0-ea25-48f0-a67b-2d7fb4c553b7> ]
         [ a pmlj:NodeSet; pmlp:hasConclusion [
               a pmlb:AttributeValuePair;
               pmlb:attribute "output"; pmlb:value "csv"

   a pmlb:AttributeValuePair;
   pmlb:attribute "query";
   pmlb:value     """SELECT distinct ?type
  GRAPH <>  {
    [] a ?type
} order by ?type""";


The --limit-offset argument will cause to iterate the given query with the OFFSET keyword until no useful results are returned.

For example, with a SPARQL query file svn-files.rq already containing at LIMIT:

construct {
   ?svn a rdfs:Resource;
      prov:wasAttributedTo ?developer .
where {
limit 1000000

We can run the query without the argument --limit-offset to get the one result that follows from submitting the query exactly as-is:

data/source/us/opendap-svn-file-hierarchy/version/blah$ -o ttl -q ../../src/svn-files.rq -od source

 ls -ln source/

-rw-r--r-- 1 528 301 221661 2014-01-19 16:30 svn-files.rq.ttl
-rw-r--r-- 1 528 301   7719 2014-01-19 16:30 svn-files.rq.ttl.prov.ttl

But, if we want to get all results, we can add the argument --limit-offset to get all results, in chunks of the LIMIT defined in the original query:

data/source/us/opendap-svn-file-hierarchy/version/blah$ -o ttl -q ../../src/svn-files.rq --limit-offset -od source

 ls -ln source/


If --limit-offset is specified, but no LIMIT is found in the original query, then a default LIMIT of 10,000 will be used. Or, the LIMIT to use can be defined as --limit-offset 100000

What is next?

