Djihed Afifi

Archive for April, 2007

Downloads of the Technical Dictionary

18th April 2007

About 1 month ago I wrote some scripts to get the technical dictionary contents from the Wiki to a pdf.

At the time I wondered how many people would download it, so I did not spend a good time to make it neatly formatted. The end result pdf was not very good.

Today, however, I decided to check if people are actually downloading it. Doing some Data Mining on the Apache logs, I was quite surprised to see 387 downloads, 197 are from unique IP addresses. The break down of unique downloaders by country is shown below.

Encouraging, time to go back, beautify it and make it look good.

On another front, parsing the referers (which page directed people to the pdf), about 70% were from the arabic page, 30% from the English page. This highlights the importance of having pages in both languages for Arabic and English speakers.

Breakdown of unique Technical Dictionary downloads by country:

38 : EG, Egypt
22 : US, United States
12 : SA, Saudi Arabia
10 : GB, United Kingdom
9 : PS, Palestine
9 : DZ, Algeria
8 : TR, Turkey
8 : AE, United Arab Emirates
7 : MA, Morocco
6 : JO, Jordan
6 : DE, Germany
5 : IL, Israel
3 : TN, Tunisia
3 : QA, Qatar
3 : OM, Oman
3 : --, N/A
3 : IT, Italy
3 : FR, France
3 : CZ, Czech Republic
2 : SD, Sudan
2 : LY, Libyan Arab Jamahiriya
2 : KW, Kuwait
2 : IN, India
2 : FI, Finland
2 : CN, China
2 : BH, Bahrain
2 : A2, Satellite Provider
1 : ZA, South Africa
1 : UA, Ukraine
1 : TH, Thailand
1 : SY, Syrian Arab Republic
1 : RU, Russian Federation
1 : PK, Pakistan
1 : NZ, New Zealand
1 : NO, Norway
1 : MG, Madagascar
1 : LB, Lebanon
1 : HK, Hong Kong
1 : ES, Spain
1 : BG, Bulgaria
1 : AU, Australia
1 : AL, Albania

Posted in Arabisation | 9 Comments »

tclgeoip 0.2

15th April 2007

I would like to announce a new version of tclgeoip, the TCL extension for GeoIP.

Amongst the changes in 0.2:

  1. Fixed a segfault when loading databases.
  2. Introduced a new function to check presence of databases (db_avail)
  3. Updated documentation.

Grab the .tar.gz sources here. Also, sources from SVN repository.

A debian package is being cooked.

Posted in Linux, tclgeoip | No Comments »

Statistical analysis of strings in popular Open Source Projects

3rd April 2007

At Arabeyes we have several Open Source Projects for translation, totaling more than 300000 strings. Our biggest challenge is preserving consistency and correctness across all of these projects. From experience, while some of the words seem obvious in English, their counterparts in another langauge (such as Arabic) can sparke heated debates. A while back, in an effort to tackle this, we introduced the Arabic Technical Computing Dictionary, and we hosted it on a Wiki for open discussion by any translator. A few scripts extract the messages every week into various neat formats for translators, including .csv, .po and even a .pdf that is suitable for printing (I still need to fix some issues the latter).

However, we still had problems prioritising discussions: which words should we discuss first? which need immediate attention? are we missing any important words? are we over analysing words that are not important? I believe these are important questions that every translation project to any language should consider. They are especially very important for languages that do not yet have a concise and established list of terminology translations.

The solution seems quite obvious: analysis of existing projects. While computers are quite bad at translation with human level accuracy, they are extremely good at statistics and counting. So why not exploit that?

So I put together a number of scripts that analyse .po files and output statistical data that can help us answer the previous questions. I operated on the biggest four open source projects we have: KDE, GNOME, OpenOffice.org and Mozilla (including Firefox and Thunderbird), the string pool had nearly 300000 strings. Reading the .po files, the scripts count the number of occurences of each word. The top 10 most used words* are:

  1. 4734 file
  2. 3002 name
  3. 2538 error
  4. 2268 text
  5. 2110 use
  6. 1946 list
  7. 1931 window
  8. 1869 select
  9. 1826 open
  10. 1825 show

Again this list may seem obvious, but a word like “select” has a few equivalents in Arabic, and we struggled to agree to one term. The complete list is available in this file. A .pot [0.5 MB] template is also available, but beware that it contains a lot of rubbish, and there are nearly 20000 entries so I can’t clean it all. If you clean it, I’d be interested in having a copy.

This only gives us that most popular words. We also want the most popular technical dictionary entries (including combinations such as “system administrator”, the previous list contains only singular words). The most important technical dictionary entries are in this list.

The difference between the complete list and the technical dictionary gives us the list of words that are not in the technical dictionary. Many of them are very important, I was honestly surprised to see words like “toolbar” and tab” not being in the wiki.

Analysis of individual projects is also available. Here are the most popular words for KDE, GNOME, Mozilla and OpenOffice.org.

The complete set of scripts and results reside in Arabeyes CVS. feel free to make use of them. The scripts are GPL but the data follows the license of the individual projects**. If you have a different way of analysis, or have another set of words from your language I would be very interested in hearing from you.

Special thanks to Chahibi for helping me with some ideas.

* KDE was excluded because bash complained of too many files (arguments). If you know of a way to increase the limit please let me know.
** I believe they are comptaible with the GPL. If you disagree, please send me an email (no need to yell).

Posted in Arabisation | 2 Comments »