-
Notifications
You must be signed in to change notification settings - Fork 0
LubmDataGeneration
This pages describes the way that we loaded the data generated by LUBM into Quest
At the moment Quest uses the OWLAPi to load TBox and ABoxes. This is very inefficient for large ABoxes. We need a lighter mechanism where little parsing is done and where streaming of triples is possible.
Solution:
- Generate all LUBM data files.
- Transform and merge all the data in a simple triple file (e.g., N-Triple)
- Create a new ABox assertion streamer that read the file line by line with very simple parsing.
This is done using the traditional LUBM data generator tool using the command:
java -cp classes/ edu.lehigh.swat.bench.uba.Generator -univ 1000 -onto http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl
Transforming and merging (Source)
To do this we will use Jena, in particular the command rdfcat.
- Setting up Jena. Download Jena and setup your environment as follows:
- Add the following to your .bashrc file.
-
export JENAROOT=~/Documents/OBDA/related_software/Jena-2.6.4 export PATH=$JENAROOT/bin:$PATH
- Execute the command
-
chmod u+x $JENAROOT/bin/*
With Jena configured, we can now process the original data and dump it as N-Triples with the command
find . -type f -name "University*.owl" -exec rdfcat -out N-TRIPLE -x {} >> University0-99.nt <br>;
We also need to remove imports and other non-data triples with the commands:
cat University0-99.nt | grep -v http://www.w3.org/2002/07/owl#Ont > University0-99-clean.nt cat University0-99-clean.nt | grep -v http://www.w3.org/2002/07/owl#imports > University0-99-clean2.nt
To merge each university into a single nt file we used the following bash script:
#sh #!/bin/bash echo "Generating nt files" for i in {0..99} do echo "Doing uni $i" find . -type f -name "University$i_*.owl" -exec rdfcat -out N-TRIPLE -x {} >> uni$i.nt \; done
To clean all files we did
#sh #!/bin/bash echo "Cleaning nt files" for i in {0..99} do echo "Doing uni $i" cat university-data-$i.nt | grep -v http://www.w3.org/2002/07/owl#Ont | grep -v http://www.w3.org/2002/07/owl#imports > university-data-$i.nt.tmp rm university-data-$i.nt mv university-data-$i.nt.tmp university-data-$i.nt done
To load the triples we are going to use Quest .load(Iterator<Assertion>) and we will implement and N-triple reader that is able to generate an Iterator for the data it reads. The reader is very simple, and doesn't support all features of N-triple. Specifically, typing and blanknodes and literal typing are not supported yet.
Requirements
- One triple per line, finished with "."
- URI's delimited by <>
- Literals delimited by ""
- No other content in the file.
<http://www.Department9.University9.edu/UndergraduateStudent73> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course2> . <http://www.Department9.University9.edu/UndergraduateStudent73> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course21> . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#UndergraduateStudent> . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#name> "UndergraduateStudent306" . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#memberOf> <http://www.Department9.University9.edu> . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress> "UndergraduateStudent306@Department9.University9.edu" . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#telephone> "xxx-xxx-xxxx" . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course20> . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department9.University9.edu/Course10> . <http://www.Department9.University9.edu/UndergraduateStudent306> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#advisor> <http://www.Department9.University9.edu/AssociateProfessor5> .
before
shared_buffers = 32MB #work_mem = 1MB # min 64kB #maintenance_work_mem = 16MB # min 1MB #max_stack_depth = 2MB # min 100kB #wal_level = minimal # minimal, archive, or hot_standby #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #archive_mode = off #max_wal_senders = 0 # max number of walsender processes checkpoint_timeout = 5min # range 30s-1h #effective_cache_size = 128MB #fsync = on # turns forced synchronization on or off #synchronous_commit = on # synchronization level; on, off, or local
Now
shared_buffers = 3GB work_mem = 24MB # min 64kB maintenance_work_mem = 256MB # min 1MB max_stack_depth = 7680KB # min 100kB wal_level = minimal # minimal, archive, or hot_standby checkpoint_segments = 15 # in logfile segments, min 1, 16MB each archive_mode = off max_wal_senders = 0 # max number of walsender processes checkpoint_timeout = 10min # range 30s-1h effective_cache_size = 4GB
Now
shared_buffers = 2GB work_mem = 10MB # min 64kB maintenance_work_mem = 128MB # min 1MB max_stack_depth = 4MB # min 100kB wal_level = minimal # minimal, archive, or hot_standby checkpoint_segments = 10 # in logfile segments, min 1, 16MB each archive_mode = off max_wal_senders = 0 # max number of walsender processes checkpoint_timeout = 10min # range 30s-1h effective_cache_size = 1GB fsync = off # turns forced synchronization on or off synchronous_commit = off # synchronization level; on, off, or local
- Quick Start Guide
- Easy-Tutorials
- More Tutorials
- Examples
- FAQ
- Using Ontop
- Learning more
- Troubleshooting
- Developer Guides
- Links