Statistical analysis of strings in popular Open Source Projects
3rd April 2007 - 23011 Reads
At Arabeyes we have several Open Source Projects for translation, totaling more than 300000 strings. Our biggest challenge is preserving consistency and correctness across all of these projects. From experience, while some of the words seem obvious in English, their counterparts in another langauge (such as Arabic) can sparke heated debates. A while back, in an effort to tackle this, we introduced the Arabic Technical Computing Dictionary, and we hosted it on a Wiki for open discussion by any translator. A few scripts extract the messages every week into various neat formats for translators, including .csv, .po and even a .pdf that is suitable for printing (I still need to fix some issues the latter).
However, we still had problems prioritising discussions: which words should we discuss first? which need immediate attention? are we missing any important words? are we over analysing words that are not important? I believe these are important questions that every translation project to any language should consider. They are especially very important for languages that do not yet have a concise and established list of terminology translations.
The solution seems quite obvious: analysis of existing projects. While computers are quite bad at translation with human level accuracy, they are extremely good at statistics and counting. So why not exploit that?
So I put together a number of scripts that analyse .po files and output statistical data that can help us answer the previous questions. I operated on the biggest four open source projects we have: KDE, GNOME, OpenOffice.org and Mozilla (including Firefox and Thunderbird), the string pool had nearly 300000 strings. Reading the .po files, the scripts count the number of occurences of each word. The top 10 most used words* are:
- 4734 file
- 3002 name
- 2538 error
- 2268 text
- 2110 use
- 1946 list
- 1931 window
- 1869 select
- 1826 open
- 1825 show
Again this list may seem obvious, but a word like “select” has a few equivalents in Arabic, and we struggled to agree to one term. The complete list is available in this file. A .pot [0.5 MB] template is also available, but beware that it contains a lot of rubbish, and there are nearly 20000 entries so I can’t clean it all. If you clean it, I’d be interested in having a copy.
This only gives us that most popular words. We also want the most popular technical dictionary entries (including combinations such as “system administrator”, the previous list contains only singular words). The most important technical dictionary entries are in this list.
The difference between the complete list and the technical dictionary gives us the list of words that are not in the technical dictionary. Many of them are very important, I was honestly surprised to see words like “toolbar” and tab” not being in the wiki.
Analysis of individual projects is also available. Here are the most popular words for KDE, GNOME, Mozilla and OpenOffice.org.
The complete set of scripts and results reside in Arabeyes CVS. feel free to make use of them. The scripts are GPL but the data follows the license of the individual projects**. If you have a different way of analysis, or have another set of words from your language I would be very interested in hearing from you.
Special thanks to Chahibi for helping me with some ideas.
* KDE was excluded because bash complained of too many files (arguments). If you know of a way to increase the limit please let me know.
** I believe they are comptaible with the GPL. If you disagree, please send me an email (no need to yell).
April 4th, 2007 at 3:03 am
Rumours are that importing Mozilla translations to Open-tran.eu infringes copyrights of Mozilla as cited in the project’s blog.
One important thing is to link to the servers from which the POT files were imported for analysis. I also wonder if documentation is not to be ignored from the statistics.
One has to know that these statistics do not necessarily mean how important the terms are, since some very frequent term could only be translated in a library’s PO file and thus reused by the software in numerous applications. I guess analysing documentation could bring more results.
Options should be added to make statistics on multiple word collocations.
April 4th, 2007 at 12:07 pm
Youssef , documentation for GNOME and KDE has been included.
“One has to know that these statistics do not necessarily mean how important the terms are, since some very frequent term could only be translated in a library’s PO file and thus reused by the software in numerous applications.” While this may be true, it’s only a small factor and the results are still a good indication of what is important, and what is not important to a an acceptable degree of accuracy, but as with all statistics, they are not 100% accurate. May be they downplay a few words, but they does not overestimate the high ranking terms. On the other hand, one has to place enough emphasis on the few highly recurring terms in important packages such as GTK+ in GNOME (as we already do).
On Mozilla licensing, they triple license with the GPL, and I included a notice in the README.
On collocations, I’ll add that when I have time.