-
Notifications
You must be signed in to change notification settings - Fork 36
Conversion process phase: name
[up](Conversion process phases)
- Installing csv2rdf4lod automation
- Know of a dataset that you want to convert to RDF.
Consistent naming conventions makes working with our data easier. For all identifiers, we highly recommend that you:
- Use lower case
- Replace spaces with underscores
- Avoid acronyms; try to expand them
Establishing identifiers for source, dataset, and version affects the naming of directories used to organize all of the aggregated data (see Directory Conventions), the Enhancement parameters given to the converter, the URI naming of the resulting RDF datasets, and the URI naming of instances within the RDF datasets. Therefore, thought, care, and consideration should be taken when establishing these identifiers. Keep in mind that any single URI created could end up in someone else's hands in isolation and it is incredibly useful to humans if they have a good guess as to what it is before dereferencing it and starting to crawl it as linked data.
conversion:base_uri "http://sparql.tw.rpi.edu/ontowiki"^^xsd:anyURI;
NOTE: do not include a slash at the end of this; we'll add it for you.
Here, source
indicates a person, organization, or other agent providing you the data that you want to convert. The intent here is a living or social entity and not something rote like a web service or external hard drive. If you grabbed the White House visitors list, identify your source as whitehouse-gov
. If you got the data from a new acquaintance, identify them as the source using something like hotmail-com-joey
. If you've got an inside scoop and someone from the White House handed you next week's visitors list on a thumb drive, identify them as the source using something like whitehouse-gov-potus
. Make like an investigative reporter and mind your source. For several examples, see the list of source identifiers that LOGD has used.
Directory holding all datasets from this source will be:
URI of source will become:
The source identifier will be encoded in the conversion parameter: `conversion:source_identifier`
The web page describing the source identified will be:
- Reuse DNS name for the organization, ignoring all non-organization identifying fragments such as "www", "www2", etc.
(Note that this perspective of source
does not align with dcterms:source because our dataset is not derived from the source that we are citing. Dublin Core's dcterms:publisher is closer to what we are referencing, though our source
may be an intermediary that was not the original publisher -- as is the case in hand-me down data sharing in cases such as scraperwiki.com (which scrapes gov sites and rehosts as csv) , impacteen.org (which aggregates statistics from many federal surveys that are not readily accessible), and Xian's company financial earnings (which states that the reports are from the gov -- but one can not be sure -- and in reality the reports were submitted to the government by the individual companies))
The dataset identifier will be encoded in the conversion parameter: `conversion:dataset_identifier`
- Reuse the source organization's identifier for the dataset whenever possible.
- If none given, construct a clear descriptive name based on the web pages' descriptions of the dataset.
For several examples, see the list of dataset identifiers that LOGD has used. Note that most of the identifiers at the bottom are reused from data.gov's numeric convention.
It is almost certain that any dataset you retrieved from another organization has been updated since that last time you grabbed it. For example, http://www.uniprot.org/downloads releases every four weeks. So, when you tell a colleague that your analysis results showed X, they might want to know which data you analyzed. The version identifier handles this situation.
The version identifier will be encoded in the conversion parameter: `conversion:version_identifier`
- It is highly recommended to REUSE the source organization's name for the version.
- Next, the Last Modified date as reported by the URLs should be used
- Finally, today's date (i.e., date of retrieval) can be used.
- Optionally, a curator's tag could be used (e.g., we used "mashathon" during a mashathon)
When using a date, use the form 2010-Dec-31
in "year-mon-day" form. This follows the "larger to smaller" convention of the URI decomposition. Also, Dec
instead of 12
avoids confusion for less technical folks and for those used to international conventions -- THEY ARE NOT INTENDED FOR PARSING, ONLY AS HUMAN AIDS. If date modeling is required, do so with appropriate RDF vocabularies and xsd:date
formats elsewhere.
For several examples, see the list of version identifiers that LOGD has used.
There are a lot of conversion:version_identifiers that look like 2010-Dec-09
and dataset URIs that have version/2010-Dec-09
. What does that date mean? Does it have to be a date? What methodology should a curator use to name the Version?
After establishing identifiers for source, dataset, and version, they can be used to construct the conversion cockpit -- the place to be when converting a dataset.
- Conversion process phase: retrieve
- Conversion process phase: csv-ify
- Conversion process phase: create conversion trigger
- Conversion process phase: pull conversion trigger
- ... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger
- Conversion process phase: publish
This page aggregates and replaces: