-
Notifications
You must be signed in to change notification settings - Fork 36
frbr: CSHALS 2011 tutorial
http://bitly.com/hfQTNE (this page)
If you have questions, please email http://tw.rpi.edu/instances/TimLebo.
Jim McCusker used csv2rdf4lod to incorporate some data for his Semantic Healthcare and Life Sciences Tutorial (his slides). This (on-the-fly!) tutorial provides some more detail on how he did it. I am piecing it together from the Provenance captured by csv2rdf4lod while Jim originally used it for his demo.
Blog about our tutorials: http://www.genomeweb.com/informatics/semantic-technologies-bear-fruit-spite-development-challenges.
You can get the source at:
https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/ncbi-nih-gov
Installing csv2rdf4lod automation
Data: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Step a.1: [name](Conversion process phase: name) the data:
- base URI:
http://sparql.tw.rpi.edu/ontowiki/
- source:
ncbi-nih-gov
- dataset:
gene2go
- version:
2011-Feb-23
(see Conversion process phase: name)
Use the HTTP modification date to name the version
:
bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Last-Modified: Wed, 23 Feb 2011 07:49:05 GMT
Content-Length: 12359614
Accept-ranges: bytes
Step a.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:
mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
Step a.3: Get the zip, uncompress, and log the provenance:
pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
gunzip -c gene2go.gz > gene2go
justify.sh gene2go.gz gene2go uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/
Step a.4: [csv-ify](Conversion process phase: csv-ify) the data: We only need homo sapien, and it's tab-delimited (we need comma-separated). We also want to strip out GO:
so that we can construct URIs that overlap with bio2rdf. We make a manual tweak and store it in manual/
(capturing the provenance):
mkdir manual/
grep "^9606" source/gene2go | perl -pe 's/^/"/; s/GO://; s/\t/","/g; s/$/"/' > manual/gene2go-9606.csv
Step a.5: Create verbatim interpretation of tabular literals ([create](Conversion process phase: create conversion trigger) and [pull](Conversion process phase: pull conversion trigger) the conversion trigger):
cr-create-convert-sh.sh -w manual/gene2go-9606.csv
./convert-gene2go.sh
Step a.6: Cheat and get Jim's [tweaked](Conversion process phase: tweak enhancement parameters) enhancedinterpretation parameters:
curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/manual/gene2go-9606.csv.e1.params.ttl > manual/gene2go-9606.csv.e1.params.ttl
Step a.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):
./convert-gene2go.sh
Step a.8: Check out automatic/gene2go-9606.csv.e1.ttl
<http://bio2rdf.org/geneid:2>
dcterms:isReferencedBy <http://sparql.tw.rpi.edu/ontowiki/source/ncbi-nih-gov/dataset/gene2go/version/2011-Feb-23> ;
a gene2go_vocab:Gene ;
dcterms:identifier "2" ;
e1:has_species <http://bio2rdf.org/taxon:9606> ;
e1:has_evidence_code "IDA" ;
e1:has_evidence_code "TAS" ;
e1:has_evidence_code "IPI" ;
e1:has_evidence_code "NAS" ;
e1:has_evidence_code "IEA" ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0001869> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0002576> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0004867> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005096> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005515> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005576> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005615> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0005829> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0006953> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007264> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007584> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007596> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0007597> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0010037> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019838> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019899> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019959> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0019966> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0030168> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0031093> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0043120> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0051056> ;
skos:broadMatch <http://purl.org/obo/owl/GO#GO_0051384> ;
ov:csvRow "4"^^xsd:integer ;
ov:csvRow "5"^^xsd:integer ;
ov:csvRow "6"^^xsd:integer ;
ov:csvRow "7"^^xsd:integer ;
ov:csvRow "8"^^xsd:integer ;
ov:csvRow "9"^^xsd:integer , "10"^^xsd:integer ;
ov:csvRow "11"^^xsd:integer ;
ov:csvRow "12"^^xsd:integer ;
ov:csvRow "13"^^xsd:integer ;
ov:csvRow "14"^^xsd:integer ;
ov:csvRow "15"^^xsd:integer ;
ov:csvRow "16"^^xsd:integer ;
ov:csvRow "17"^^xsd:integer ;
ov:csvRow "18"^^xsd:integer ;
ov:csvRow "19"^^xsd:integer ;
ov:csvRow "20"^^xsd:integer ;
ov:csvRow "21"^^xsd:integer ;
ov:csvRow "22"^^xsd:integer ;
ov:csvRow "23"^^xsd:integer ;
ov:csvRow "24"^^xsd:integer ;
ov:csvRow "25"^^xsd:integer ;
ov:csvRow "26"^^xsd:integer ;
ov:csvRow "27"^^xsd:integer .
Data: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Step b.1: [name](Conversion process phase: name) the data:
- base URI:
http://sparql.tw.rpi.edu/ontowiki/
- source:
ncbi-nih-gov
- dataset:
gene-mammalia-homo-sapien
- version:
2011-Feb-23
Use the HTTP modification date to name the version
:
bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Last-Modified: Wed, 23 Feb 2011 08:04:26 GMT
Content-Length: 2402004
Accept-ranges: bytes
Step b.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:
mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
Step b.3: Get the zip, uncompress, and log the provenance.
pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
gunzip -c Homo_sapiens.gene_info.gz > Homo_sapiens.gene_info
justify.sh Homo_sapiens.gene_info.gz Homo_sapiens.gene_info uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/
Step b.4: [csv-ify](Conversion process phase: csv-ify) the data: NIH's data is tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/
(capturing the provenance):
cat source/Homo_sapiens.gene_info | perl -pe 's/^/"/; s/\t/","/g; s/$/"/' > manual/Homo_sapiens.gene_info.csv
justify.sh source/Homo_sapiens.gene_info manual/Homo_sapiens.gene_info.csv tab2comma
Step b.5: Create verbatim interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger):
cr-create-convert-sh.sh -w manual/Homo_sapiens.gene_info.csv
./convert-gene-mammalia-homo-sapien.sh
Step b.6: Cheat and get Jim's [tweaked](Conversion process phase: tweak enhancement parameters) enhanced interpretation parameters:
curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/manual/Homo_sapiens.gene_info.csv.e1.params.ttl > manual/Homo_sapiens.gene_info.csv.e1.params.ttl
Step b.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):
./convert-gene-mammalia-homo-sapien.sh
Step b.8: Check out automatic/Homo_sapiens.gene_info.csv.e1.ttl
<http://bio2rdf.org/geneid:1>
dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/ncbi-nih-gov/dataset/gene-mammalia-homo-sapien/version/2011-Feb-23> ;
a <http://purl.obolibrary.org/obo/SO_0000704> , local_vocab:Gene ;
e1:has_species <http://bio2rdf.org/taxon:9606> ;
dcterms:identifier "1" ;
jim:has_symbol "A1BG" ;
rdfs:label "A1BG" ;
e1:has_symbol "HYST2477" , "DKFZp686F0970" , "ABG" , "GAB" , "A1B" ;
dcterms:identifier "MIM:138670" , "HGNC:5" , "HPRD:00726" , "Ensembl:ENSG00000121410" ;
e1:has_location <http://bio2rdf.org/mapviewer:19q13_4> ;
dcterms:description "alpha-1-B glycoprotein" ;
e1:has_gene_type "protein-coding" ;
e1:has_symbol_from_nomenclature_authority "A1BG" ;
e1:has_name_from_nomenclature_authority "alpha-1-B glycoprotein" ;
e1:has_nomenclature_status "Official" ;
e1:has_other_designation "alpha-1B-glycoprotein" ;
dcterms:modified "2011-02-06"^^xsd:date ;
ov:csvRow "1"^^xsd:integer .