From a software engineering perspective, the localization process can be an entropy-increasing stage in your devops pipeline.
Localization tools need to extract a snapshot of the user experience, usually from resource files, and generate translated equivalents without adversely affecting the integrity of the application. User interface strings must be unpicked from (sometimes deeply nested) mark-up and presented to translators, who prepare target language strings, which must be ready to nest back into place within identically structured mark-up.
The tendency for small inconsistencies in the source to become large ones in target language files and for non-breaking anomalies to become breaking ones – this is entropy in UI projects.
At Rubric, we use a mix of automated tests and manual checks by both linguists and engineers, to help minimize this effect. Below I’ll work through a typical example to show how you can help your global content partner by minimizing entropy at the start of the process. (Look out for the inconsistencies in the original source.)
An example resource file
The following XML is based on a typical resource file for an Android app:
<strings> <check_mobile_devices_wifi> <![CDATA[Check your mobile device’s Wi-Fi settings and make sure your mobile device is connected to your home network##REPLACE_WITH_HOME_NETWORK##.<br /><br />Or, if you still can't connect, click START OVER.]]> </check_mobile_devices_wifi> <we_are_here_to_help> <![CDATA[We’re here to help]]> </we_are_here_to_help> <firmware_system_setup> <![CDATA[How would you like to connect your speaker to your network?]]> </firmware_system_setup> </strings>
Step 1 – Identify content type and unwrap nested formats
The file is first put through an Android Strings XML parser to extract the value of each key. Content type within CDATA sections (HTML) is identified and handed off to a secondary parser
- Note: there are two right single quotation marks, highlighted in yellow. One of them is HTML encoded as ’ but the other is a literal ’ character. This is an example of an inconsistency, which could lead to problems down the line.
Step 2 – Parse HTML and protect tags and placeholders
Here the Entities are decoded (second key) and HTML tags and application-specific placeholders are protected.
Step 3 – Present translatable strings to translators
Translations are pre-populated from translation memory where possible and the translator fills any gaps which remain. The placeholders shown in purple cannot be altered by the translator but may be re-arranged if required by the sentence structure of the target language.
Step 4 – Write out target files
This is often the most technically complex part of the process where inconsistencies in the source can become amplified. The translated segments are processed (through each of the above steps in reverse), eventually reconstituting the original format.
First, placeholders and tags are re-injected and special characters are re-encoded or escaped:
The escaped single quote will probably not do any harm if it is decoded at right points down the line in your devops pipeline. However, if the structure source is internally consistent (less entropy!) this kind of ambiguity can be avoided.
Finally, the translated strings are re-injected into the original markup:
<strings> <check_mobile_devices_wifi> <![CDATA[Vérifiez les paramètres Wi-Fi de votre périphérique mobile pour vous assurer que ce dernier est connecté à votre réseau domestique##REPLACE_WITH_HOME_NETWORK##.<br /><br />Si vous ne pouvez toujours pas vous connecter, cliquez sur RECOMMENCER.]]> </check_mobile_devices_wifi> <we_are_here_to_help> <![CDATA[Nous sommes là pour vous aider]]> </we_are_here_to_help> <firmware_system_setup> <![CDATA[Comment souhaitez-vous connecter l’enceinte à votre réseau?]]> </firmware_system_setup> </strings>
How you can help your Global Content partner
As well as providing source files which are structured in a consistent way, there are a couple of other ways in which you can help optimize the localization process and enhance the quality of the end product:
Provide a complete set of files with every localization request
At Rubric, we typically run diff reports at the end of every localization project in order to review changes in the English source and compare those against changes in the target files. This helps us to pick up any unexpected changes (for example, escaped characters introduced in error). Working with a complete set of files for each revision simplifies the diff process and makes reports easier to analyze.
Say something when you find anomalies
If you find that you are having to apply fixes to localized resource files, please tell your Global Content partner, as this will enable them to correct any misconfigurations.
*first image of a black hole courtesy of the Event Horizon Telescope (EHT) network.