May 14, 2007

Subversion in a multinational context

Working in a multinational context is not without challenges, and sometimes it's not the cultural barriers or the language that gets in the way. Sometimes the simplest things can go wrong, like character encoding:

$ svn update
subversion/libsvn_subr/utf.c:466: (apr_err=22)
svn: Can't convert string from 'UTF-8' to native encoding:
subversion/libsvn_subr/utf.c:464: (apr_err=22)
svn: foo/bar/?\226?\128?\147v.1.3-1.xls

The SVN book is quite comprehensive on the subject of svn localisation (even though they spell it with a 'z'), providing both technical and sociological advice:

The solution is either to set your locale to something which can represent the incoming UTF-8 data, or to change the filename or log message in the repository. (And don't forget to slap your collaborator's hand—projects should decide on common languages ahead of time, so that all participants are using the same locale.)

Nice idea, but I don't think it's entirely reasonable for me to demand that everyone on a multinational team switches their default locale to English, especially in their own working repositories. A far simpler technical solution is just to switch to UTF-8 myself, via a modified version of Torsten's hint. Before the switch:

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

And the switch to Great British English (now added to my .bash_profile):

$ export LC_ALL=en_GB.UTF-8
$ export LC_CTYPE=UTF-8
$ export LANG=en_GB

And the result:

$ locale
LANG="en_GB"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL="en_GB.UTF-8"

Now SVN at least seems happy. And lo, there was much rejoicing and repository updating. Hopefully this won't break everything else.

Posted by savs at May 14, 2007 11:56 AM
Comments

My 'why I hate svn' blog post is ever growing. I really ought to write it sometime!

Posted by: Thom May at May 15, 2007 1:11 PM

svn: foo/bar/?\226?\128?\147v.1.3-1.xls

This is complete non-sense. Filenames contains no metadata so you can't be sure about the encoding used. A filename should *never* contain Unicode characters. I was discussing this with the main engineer responsible for Walmart's upgrade to Java 1.5, back in the days. I told him we were enforcing ASCII compliant filenames for our files (a subset of ASCII actually) and he loved the idea. I work with Japan, it makes no sense at all to have a Unicode kanji appear in a filename (btw is the Japanese dev using Unicode or, more likely, shit-JIS !? Once more filenames have no metadata on most systems, so I can't even know which encoding to use). How am I suppose to search for such a file!? I can't even *enter* part of its name on my computer. It is exactly the same for, say, american developers: they can't enter your country/language specific letters.

Not only should *project* filenames never ever use non-ASCII chars but moreover this should be enforced by some script. I emphasize on "projects": developers should really know better about this. In a real-world where developers from various culture may collaborate on a project you do want to enforce strict guidelines on such a basic thing (and if you plan to do any automated scripting on different platform, you probably want to forbig space in filenames... This is easy to enforce by some guidelines and it will save the team lots of parsing/escaping headaches).

For personal files it's different: these are your files and you ain't sharing them with people from all over the world.

But for a collaborative project... You are shooting yourself and your whole team in the foot.

Tools ain't perfect (and in the Japanese shift-JIS/Unicode filename case some problems simply have no solution), developers must use brains.

Common sense 101.

Posted by: Anonymous Coward at May 24, 2007 2:01 PM

Oh god I thought I was nerd! Respect...

Posted by: James Gosling at May 27, 2007 6:44 PM