One of the most common challenges for Localization Engineers, Linguists, DTP specialists and many other people in this industry is the word character. Some letters of the target languages may sometimes not appear,or not properly shown due to the different character system between the source language and target languages. Mr. Tex Texin, from Xen Craft, which is a consulting firm specializing in software globalization, gave us a comprehensive and detailed introduction of Unicode, computer character system – such as ASCII, and how computers process the texts.
To understanding encoding, we need to first understand how computers record the information that we input by pressing the keyboard. Every time we press the keyboard, a row number and a column number will be generated, and each character has a corresponding value, just like a coordinate system. To make the system support the language, we first create a character set with letter,digits, punctuation and also symbols, and each one is assigned a specific value. ASCII and Unicode are some examples; ASCII defines 128 symbols, and Unicode defines even more than ASCII.
However, different countries might use different character sets because of the differences in both language and culture. If a person types a character that the Unicode value of it is not found in another character set,or does not match the same Unicode value of the character that he is typing, an error, typically mojibake, is appeared. Non-Roman languages, such as Chinese,Japanese, Korean and Russian are likely to encounter such problems as the letters they are using are so different from the Roman languages.
The mojibake issue reminds me of one of my experiences of handling the characters. I was in a team project to train Statistical machine translation (SMT). The goal was to improve the translation quality of machine translation by “feeding” bilingual and monolingual texts to the machine, tuning and testing the machine by providing with sample translations. The BLEU score,that reflected the translation quality, was extremely low at first; later, we found that the problem was caused by the garbled texts in our corpus files.However, the original files that was uploaded to the machine were free of such errors. We then asked about how she prepared such files, she said the files were downloaded as an SRT file (the text format that is widely used for subtitles),and then transformed to TMX files. But she also mentioned that for some of the files, they were first copied and pasted on Microsoft Word. We then realized that the different encoding systems using for the machine translation system and in Microsoft Word is very likely to be the reason that cause mojibake, and it was! Even though the texts we pasted on Word appeared normally, it doesn’t mean that when we output the file, every character retains the same encoding value.
In my opinion, always check the encoding of the text editor we are using when output a file is a good habit for localizers. For example,when we save a txt file, actually there is a little dropdown menu next to the “Save”button. If someone types Chinese characters and saves the file as ANSI encoding, mojibake may occur if he tries to open this same file in Microsoft Word with a UTF-8 encoding. Therefor, it is a good practice to keep using same encoding format, if possible, to save or open a file.
Mr. Texin also mentioned about text direction – some languages,for example Arabic, write from right to the left, and some languages even use bidirectional texts. Character System is an interesting topic in Localization industry. It is not a super technical problem, but it requires us to remain sensitive to such issues.