Skip to content
This repository has been archived by the owner on Apr 4, 2024. It is now read-only.

Translation File Format

Till Helge Helwig edited this page Jun 29, 2020 · 11 revisions

Goals

  • Ideally we use a well-established format
    • This would allow us to later move more easily to a TMS (Translation Management System)
    • We could make more use of existing libraries and don't have to re-invent the wheel
    • There is translation software available for most of them, which might be easier to use for translators than pure GitHub
  • We want to make it as easy as possible for translators
    • The translation files should follow the structure of the corresponding page as closely as possible
    • Text blocks should not be separated into multiple entries in the translation file
  • Content will be updated regularly (especially in the beginning), so changes should be easily recognizable

Principles

  • We use simple inline formatting in text blocks rather than splitting text blocks on formatting changes
  • For inline formatting we want to use Markdown rather than HTML
  • We want to have one translation file per page rather than one big translation file

Possible Solutions

Gettext aka PO/MO

GNU Gettext is the absolute i18n standard for many applications and the default implementation used by many languages.

Example

msgid "Start folding"
msgstr "Jetzt Mitfalten"

msgid "The Folding@home software runs while you do other things."
msgstr "Die Folding@home Software läuft im Hintergrund während Sie andere Dinge tun."

msgid "While you keep going with your everyday activities, your computer will be working to help us find cures for diseases like cancer, ALS, Parkinson’s, Huntington’s, Influenza and many others."
msgstr "Während Sie Ihrem Alltag nachgehen, arbeitet ihr Computer daran uns dabei zu helfen Therapien für Krankheiten wie Krebs, ALS, Parkinson, Corea Huntington, Influence und viele andere zu finden."

;...

Documentation

XLIFF

This is a standardized format, accepted by OASIS, to digitally exchange translations.

Example

<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="de-DE">
  <file id="start-folding" original="start-folding.html">
    <unit id="start-folding">
      <segment>
        <source>Start folding</source>
        <target>Jetzt Mitfalten</target>
      </segment>
      <segment>
        <source>The Folding@home software runs while you do other things.</source>
        <target>Die Folding@home Software läuft im Hintergrund während Sie andere Dinge tun.</target>
      </segment>
      <segment>
        <source>While you keep going with your everyday activities, your computer will be working to help us find cures for diseases like cancer, ALS, Parkinson’s, Huntington’s, Influenza and many others.</source>
        <target>Während Sie Ihrem Alltag nachgehen, arbeitet ihr Computer daran uns dabei zu helfen Therapien für Krankheiten wie Krebs, ALS, Parkinson, Corea Huntington, Influence und viele andere zu finden.</target>
      </segment>
      <!-- ... -->
    </unit>
  </file>
</xliff>

Documentation

YAML

YAML can be used in many different formats. One that is somewhat standardized is the format used by Ruby on Rails.

Example

de:
   start-folding:
      headline: Jetzt Mitfalten
      welcome_text: Die Folding@home Software läuft im Hintergrund während Sie andere Dinge tun.
      while_you_keep_going: "Während Sie Ihrem Alltag nachgehen, arbeitet ihr Computer daran uns dabei zu helfen Therapien für Krankheiten wie Krebs, ALS, Parkinson, Corea Huntington, Influence und viele andere zu finden."

Documentation

Comparision of Solution Options

Criterion Gettext XLIFF YAML
human-readable
format is easy to validate
text blocks with inline formatting ✔ (not HTML)
no need to escape special characters
Python library available (✔)* (✔)*

*) Parser is available, but no full i18n library

Recommendations

Till (Tar-Minyatur): Gettext / PO Files

I think this format is the one with the most convicing history. There are countless projects that build on GNU Gettext and they cannot all be wrong. Gettext will also integrate really well with other applications we intend to translate later on. There is plenty of tooling around to make it easier for translator to handle these files. And there are a bunch of additional features in PO files that might come in handy at some point (like providing the context for a translation or having different kinds of semantic comments in the file).

A ready-to-use implementation of Gettext is available for most programming languages. We only need to integrate the additional concept of having inline Markdown formatting, which can be easily achieved by decorating the Gettext library.

Here is how I would imagine the parsing/template generation/language extraction and page rendering to work: https://drive.google.com/file/d/19DBFlUnolDjnKqV1wBgEgcj-sofihCII/view

By wrapping the text in a custom tag we preserve the original wrapping tag more easily...and during rendering we can use XPath to find all nodes much more easily.

Migration to the final file format

  • We should not overload the translators while they're still working on old formats

  • Ideally, conversion from the old to the new format should be automatic

  • @kn1cht: I suggest migrations steps as follows:

    1. We create a directory like "version-next" and work on the new format.
    2. After we have done, we rename "version-next" with "version-{release date}" and generate YAMLs of all the languages.
      • I think we should keep old directories (such as Localization/de-DE) in this step because someone may open PR for old directories.
    3. After we confirmed that all forked repositories are up to date, we move old directories to a directory for them.
    4. When the texts of the source website are changed, we add more directories for the new version.
    • versioning