Skip to content

Conversion process phase: name

timrdf edited this page Mar 21, 2011 · 85 revisions

[up](Conversion process phases)

What's first?

How do I name?

Consistent naming conventions makes working with our data easier. For all identifiers, we highly recommend that you:

  • Use lower case
  • Replace spaces with underscores
  • Avoid acronyms; try to expand them

Establishing identifiers for source, dataset, and version affects the naming of directories used to organize all of the aggregated data (see Directory Conventions), the Enhancement parameters given to the converter, the URI naming of the resulting RDF datasets, and the URI naming of instances within the RDF datasets. Therefore, thought, care, and consideration should be taken when establishing these identifiers. Keep in mind that any single URI created could end up in someone else's hands in isolation and it is incredibly useful to humans if they have a good guess as to what it is before dereferencing it and starting to crawl it as linked data.

Step 0 of 3: Establish your base URI

conversion:base_uri           "http://sparql.tw.rpi.edu/ontowiki"^^xsd:anyURI;

NOTE: do not include a slash at the end of this; we'll add it for you.

Step 1 of 3: Establish an identifier for the source

Here, source indicates a person, organization, or other agent providing you the data that you want to convert. The intent here is a living or social entity and not something rote like a web service or external hard drive. If you grabbed the White House visitors list, identify your source as whitehouse-gov. If you got the data from a new acquaintance, identify them as the source using something like hotmail-com-joey. If you've got an inside scoop and someone from the White House handed you next week's visitors list on a thumb drive, identify them as the source using something like whitehouse-gov-potus. Make like an investigative reporter and mind your source. For several examples, see the list of source identifiers that LOGD has used.

Directory holding all datasets from this source will be:
URI of source will become:
The source identifier will be encoded in the conversion parameter: `conversion:source_identifier`
The web page describing the source identified will be:
  • Reuse DNS name for the organization, ignoring all non-organization identifying fragments such as "www", "www2", etc.

(Note that this perspective of source does not align with dcterms:source because our dataset is not derived from the source that we are citing. Dublin Core's dcterms:publisher is closer to what we are referencing, though our source may be an intermediary that was not the original publisher -- as is the case in hand-me down data sharing in cases such as scraperwiki.com (which scrapes gov sites and rehosts as csv) , impacteen.org (which aggregates statistics from many federal surveys that are not readily accessible), and Xian's company financial earnings (which states that the reports are from the gov -- but one can not be sure -- and in reality the reports were submitted to the government by the individual companies))

Step 2 of 3: Establish an identifier for the dataset

The dataset identifier will be encoded in the conversion parameter: `conversion:dataset_identifier`
  • Reuse the source organization's identifier for the dataset whenever possible.
  • If none given, construct a clear descriptive name based on the web pages' descriptions of the dataset.

For several examples, see the list of dataset identifiers that LOGD has used. Note that most of the identifiers at the bottom are reused from data.gov's numeric convention.

Step 3 of 3: Establish an identifier for the version

It is almost certain that any dataset you retrieved from another organization has been updated since that last time you grabbed it. For example, http://www.uniprot.org/downloads releases every four weeks. So, when you tell a colleague that your analysis results showed X, they might want to know which data you analyzed. The version identifier handles this situation.

The version identifier will be encoded in the conversion parameter: `conversion:version_identifier`
  • It is highly recommended to REUSE the source organization's name for the version.
  • Next, the Last Modified date as reported by the URLs should be used
  • Finally, today's date (i.e., date of retrieval) can be used.
  • Optionally, a curator's tag could be used (e.g., we used "mashathon" during a mashathon)

When using a date, use the form 2010-Dec-31 in "year-mon-day" form. This follows the "larger to smaller" convention of the URI decomposition. Also, Dec instead of 12 avoids confusion for less technical folks and for those used to international conventions -- THEY ARE NOT INTENDED FOR PARSING, ONLY AS HUMAN AIDS. If date modeling is required, do so with appropriate RDF vocabularies and xsd:date formats elsewhere.

For several examples, see the list of version identifiers that LOGD has used.

There are a lot of conversion:version_identifiers that look like 2010-Dec-09 and dataset URIs that have version/2010-Dec-09. What does that date mean? Does it have to be a date? What methodology should a curator use to name the Version?

After establishing identifiers for source, dataset, and version, they can be used to construct the conversion cockpit -- the place to be when converting a dataset.

What's next?

Historical note

This page aggregates and replaces:

Clone this wiki locally