Geographically isolated bugs

I find it interesting how a certain class of programming problems may be so common in one geographical location, that are considered to be common knowledge, yet completely unknown in the rest of the world.
Consider this piece of code in C/C++

// ロボット機能
if (robotRequired)
{
    CreateRobot();
}

See anything wrong with it? No?

Well, if you happen to save that little piece of code in Shift-JIS, which is an encoding very commonly used for Japanese text, on some compilers the “if” statement will get completely ignored, and the contents will be executed every time.

What’s going on? If your compiler is not designed or configured to recognize Shift-JIS sequences as multi-byte characters, most of the times your program will work as intended, since everything after the “//” will get ignored as a comment. However, in this case, this is what the compiler will see:

// ƒƒ{ƒbƒg‹@”\
if (robotRequired)
{
    CreateRobot();
}

Notice the last character in the comment? That backslash will escape the line break, and make the comment continue on to the next line. After preprocessing, this is what the code looks like:

{
    CreateRobot();
}

This happens because multi-byte sequences in Shift-JIS do not guarantee that any bytes after the first one will be above 0x7f. In this case, the character 能 is encoded as 0x94 0x5c. That second byte happens to be the same as the backslash, which causes this problem.

This is certainly not isolated to the 能 symbol. Any symbol whose last byte is 0x5c will exhibit the same problem. These are some symbols that end in 0x5c, (list courtesy of this site):

ーソЫⅨ噂浬欺圭構蚕十申曾箪貼能表暴予禄兔喀媾彌拿杤歃濬畚秉綵臀藹觸軆鐔饅鷭偆砡纊犾

Most of these are not obscure symbols. A lot of them are actually very likely to appear in comments, string literals or anywhere else where a stray backslash is likely to cause a lot of trouble.

It turns out that in Japan, this is very common knowledge for programmers. and you should not call yourself a programmer in Japan in you don’t know about this. The problem is that, in the existence of such a problem, it can be very difficult to find what’s causing a problem, since at first sight there is nothing wrong with the code, and in a 100000+ line program, the comments are the last place you will start looking for problems.

Of course, I had never heard of this issue, and faced it a few days ago. Thankfully, I was not the one who inserted the problem, just the guy who found out about it after noticing how no assembly code was being generated by the compiler for that line, even with all optimizations disabled.

What can you do to avoid this problem? Use UTF-8. Always.

UTF-8 multi-byte sequences contain only bytes set to 0x80 or above, so this problem will never exist if you use UTF-8 characters. And your compiler does not really have to be UTF-8 aware to compile UTF-8 source files, as long as you don’t start naming your variables and functions with multi-byte characters.

Cool bug though. Exploiting it would make an awesome entry in the underhanded C contest.