Geographically isolated bugs

I find it interesting how a certain class of programming problems may be so common in one geographical location, that are considered to be common knowledge, yet completely unknown in the rest of the world.
Consider this piece of code in C/C++

// ロボット機能
if (robotRequired)
{
    CreateRobot();
}

See anything wrong with it? No?

Well, if you happen to save that little piece of code in Shift-JIS, which is an encoding very commonly used for Japanese text, on some compilers the “if” statement will get completely ignored, and the contents will be executed every time.

What’s going on? If your compiler is not designed or configured to recognize Shift-JIS sequences as multi-byte characters, most of the times your program will work as intended, since everything after the “//” will get ignored as a comment. However, in this case, this is what the compiler will see:

// ƒƒ{ƒbƒg‹@”\
if (robotRequired)
{
    CreateRobot();
}

Notice the last character in the comment? That backslash will escape the line break, and make the comment continue on to the next line. After preprocessing, this is what the code looks like:

{
    CreateRobot();
}

This happens because multi-byte sequences in Shift-JIS do not guarantee that any bytes after the first one will be above 0x7f. In this case, the character 能 is encoded as 0x94 0x5c. That second byte happens to be the same as the backslash, which causes this problem.

This is certainly not isolated to the 能 symbol. Any symbol whose last byte is 0x5c will exhibit the same problem. These are some symbols that end in 0x5c, (list courtesy of this site):

ーソЫⅨ噂浬欺圭構蚕十申曾箪貼能表暴予禄兔喀媾彌拿杤歃濬畚秉綵臀藹觸軆鐔饅鷭偆砡纊犾

Most of these are not obscure symbols. A lot of them are actually very likely to appear in comments, string literals or anywhere else where a stray backslash is likely to cause a lot of trouble.

It turns out that in Japan, this is very common knowledge for programmers. and you should not call yourself a programmer in Japan in you don’t know about this. The problem is that, in the existence of such a problem, it can be very difficult to find what’s causing a problem, since at first sight there is nothing wrong with the code, and in a 100000+ line program, the comments are the last place you will start looking for problems.

Of course, I had never heard of this issue, and faced it a few days ago. Thankfully, I was not the one who inserted the problem, just the guy who found out about it after noticing how no assembly code was being generated by the compiler for that line, even with all optimizations disabled.

What can you do to avoid this problem? Use UTF-8. Always.

UTF-8 multi-byte sequences contain only bytes set to 0x80 or above, so this problem will never exist if you use UTF-8 characters. And your compiler does not really have to be UTF-8 aware to compile UTF-8 source files, as long as you don’t start naming your variables and functions with multi-byte characters.

Cool bug though. Exploiting it would make an awesome entry in the underhanded C contest.

How to do UDP hole punching

I will now show you how to do UDP hole punching, with code in C. Hole punching is an advanced networking concept, so you’re expected to at least know how to compile/run this code.

I have written lots of comments inside the code to explain what is happening.

To use this, run the server code in a computer with a public IP address, and then run the client code in two or more different computers, each behind a different NAT.

I compiled and tested this on Ubuntu Server 11.04 and CentOS 5, it should work easily in probably all other linuces and BSDs. It could run in Windows using the WSA code. This code is also not endianness-safe, so it would be best to run it on x86-64 or similar in all machines.

I release this code to the public domain, but you’re a complete lunatic if you plan to use this code in any real program.

server code:

// UDP hole punching example, server code
// Base UDP code stolen from http://www.abc.se/~m6695/udp.html
// By Oscar Rodriguez
// This code is public domain, but you're a complete lunatic
// if you plan to use this code in any real program.

#include <arpa/inet.h>
#include <netinet/in.h>
#include 
#include <sys/types.h>
#include <sys/socket.h>
#include 
#include 
#include 
#define BUFLEN 512
#define NPACK 10
#define PORT 9930

// A small struct to hold a UDP endpoint. We'll use this to hold each client's endpoint.
struct client
{
    int host;
    short port;
};

// Just a function to kill the program when something goes wrong.
void diep(char *s)
{
    perror(s);
    exit(1);
}

int main(void)
{
    struct sockaddr_in si_me, si_other;
    int s, i, j, slen=sizeof(si_other);
    char buf[BUFLEN];
    struct client clients[10]; // 10 clients. Notice that we're not doing any bound checking.
    int n = 0;
    // Create a UDP socket
    if ((s=socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP))==-1)
        diep("socket");
    // si_me stores our local endpoint. Remember that this program
    // has to be run in a network with UDP endpoint previously known
    // and directly accessible by all clients. In simpler terms, the
    // server cannot be behind a NAT.
    memset((char *) &si_me, 0, sizeof(si_me));
    si_me.sin_family = AF_INET;
    si_me.sin_port = htons(PORT);
    si_me.sin_addr.s_addr = htonl(INADDR_ANY);
    if (bind(s, (struct sockaddr*)(&si_me), sizeof(si_me))==-1)
        diep("bind");
    while (1)
    {
        // When a new client sends a datagram...
        if (recvfrom(s, buf, BUFLEN, 0, (struct sockaddr*)(&si_other), &slen)==-1)
            diep("recvfrom");
        // The client's public UDP endpoint data is now in si_other.
        // Notice that we're completely ignoring the datagram payload.
        // If we want to support multiple clients inside the same NAT,
        // we'd have clients send their own private UDP endpoints
        // encoded in some way inside the payload, and store those as
        // well.
        printf("Received packet from %s:%d\n", inet_ntoa(si_other.sin_addr), ntohs(si_other.sin_port));
        // Now we add the client's UDP endpoint in our list.
        clients[n].host = si_other.sin_addr.s_addr;
        clients[n].port = si_other.sin_port;
        n++;
        // And then tell everybody about everybody's public UDP endpoints
        for (i = 0; i < n; i++)
        {
            si_other.sin_addr.s_addr = clients[i].host;
            si_other.sin_port = clients[i].port;
            // We send a datagram for each client in our list. Of course,
            // we could also assemble a single datagram and send that.
            for (j = 0; j < n; j++)
            {
                // The payload is the client's public UDP endpoint, clients[j]
                printf("Sending to %s:%d\n", inet_ntoa(si_other.sin_addr), ntohs(si_other.sin_port));
                // We're sending binary data here, using the server's byte order.
                // In your code, you should make sure every client agrees on the endianness.
                if (sendto(s, &clients[j], 6, 0, (struct sockaddr*)(&si_other), slen)==-1)
                    diep("sendto");
            }
        }
        printf("Now we have %d clients\n", n);
        // And we go back to listening. Notice that since UDP has no notion
        // of connections, we can use the same socket to listen for data
        // from different clients.
    }
    // Actually, we never reach this point...
    close(s);
    return 0;
}

client code:

// UDP hole punching example, client code
// Base UDP code stolen from http://www.abc.se/~m6695/udp.html
// By Oscar Rodriguez
// This code is public domain, but you're a complete lunatic
// if you plan to use this code in any real program.

#include <arpa/inet.h>
#include <netinet/in.h>
#include 
#include <sys/types.h>
#include <sys/socket.h>
#include 
#include 
#include 
#define BUFLEN 512
#define NPACK 10
#define PORT 9930

// This is our server's IP address. In case you're wondering, this one is an RFC 5737 address.
#define SRV_IP "203.0.113.61"

// A small struct to hold a UDP endpoint. We'll use this to hold each peer's endpoint.
struct client
{
    int host;
    short port;
};

// Just a function to kill the program when something goes wrong.
void diep(char *s)
{
    perror(s);
    exit(1);
}

int main(int argc, char* argv[])
{
    struct sockaddr_in si_me, si_other;
    int s, i, f, j, k, slen=sizeof(si_other);
    struct client buf;
    struct client server;
    struct client peers[10]; // 10 peers. Notice that we're not doing any bound checking.
    int n = 0;
    if ((s=socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP))==-1)
        diep("socket");
    // Our own endpoint data
    memset((char *) &si_me, 0, sizeof(si_me));
    si_me.sin_family = AF_INET;
    si_me.sin_port = htons(PORT); // This is not really necessary, we can also use 0 (any port)
    si_me.sin_addr.s_addr = htonl(INADDR_ANY);
    // The server's endpoint data
    memset((char *) &si_other, 0, sizeof(si_other));
    si_other.sin_family = AF_INET;
    si_other.sin_port = htons(PORT);
    if (inet_aton(SRV_IP, &si_other.sin_addr)==0)
        diep("aton");
    // Store the server's endpoint data so we can easily discriminate between server and peer datagrams.
    server.host = si_other.sin_addr.s_addr;
    server.port = si_other.sin_port;
    // Send a simple datagram to the server to let it know of our public UDP endpoint.
    // Not only the server, but other clients will send their data through this endpoint.
    // The datagram payload is irrelevant, but if we wanted to support multiple
    // clients behind the same NAT, we'd send our won private UDP endpoint information
    // as well.
    if (sendto(s, "hi", 2, 0, (struct sockaddr*)(&si_other), slen)==-1)
        diep("sendto");
    // Right here, our NAT should have a session entry between our host and the server.
    // We can only hope our NAT maps the same public endpoint (both host and port) when we
    // send datagrams to other clients using our same private endpoint.
    while (1)
    {
        // Receive data from the socket. Notice that we use the same socket for server and
        // peer communications. We discriminate by using the remote host endpoint data, but
        // remember that IP addresses are easily spoofed (actually, that's what the NAT is
        // doing), so remember to do some kind of validation in here.
        if (recvfrom(s, &buf, sizeof(buf), 0, (struct sockaddr*)(&si_other), &slen)==-1)
            diep("recvfrom");
        printf("Received packet from %s:%d\n", inet_ntoa(si_other.sin_addr), ntohs(si_other.sin_port));
        if (server.host == si_other.sin_addr.s_addr && server.port == (short)(si_other.sin_port))
        {
            // The datagram came from the server. The server code is set to send us a
            // datagram for each peer, in which the payload contains the peer's UDP
            // endpoint data. We're receiving binary data here, sent using the server's
            // byte ordering. We should make sure we agree on the endianness in any
            // serious code.
            f = 0;
            // Now we just have to add the reported peer into our peer list
            for (i = 0; i < n && f == 0; i++)
            {
                if (peers[i].host == buf.host && peers[i].port == buf.port)
                {
                    f = 1;
                }
            }
            // Only add it if we didn't have it before.
            if (f == 0)
            {
                peers[n].host = buf.host;
                peers[n].port = buf.port;
                n++;
            }
            si_other.sin_addr.s_addr = buf.host;
            si_other.sin_port = buf.port;
            printf("Added peer %s:%d\n", inet_ntoa(si_other.sin_addr), ntohs(si_other.sin_port));
            printf("Now we have %d peers\n", n);
            // And here is where the actual hole punching happens. We are going to send
            // a bunch of datagrams to each peer. Since we're using the same socket we
            // previously used to send data to the server, our local endpoint is the same
            // as before.
            // If the NAT maps our local endpoint to the same public endpoint
            // regardless of the remote endpoint, after the first datagram we send, we
            // have an open session (the hole punch) between our local endpoint and the
            // peer's public endpoint. The first datagram will probably not go through
            // the peer's NAT, but since UDP is stateless, there is no way for our NAT
            // to know that the datagram we sent got dropped by the peer's NAT (well,
            // our NAT may get an ICMP Destination Unreachable, but most NATs are
            // configured to simply discard them) but when the peer sends us a datagram,
            // it will pass through the hole punch into our local endpoint.
            for (k = 0; k < 10; k++)
            {
                // Send 10 datagrams.
                for (i = 0; i < n; i++)
                {
                    si_other.sin_addr.s_addr = peers[i].host;
                    si_other.sin_port = peers[i].port;
                    // Once again, the payload is irrelevant. Feel free to send your VoIP
                    // data in here.
                    if (sendto(s, "hi", 2, 0, (struct sockaddr*)(&si_other), slen)==-1)
                        diep("sendto()");
                }
            }
        }
        else
        {
            // The datagram came from a peer
            for (i = 0; i < n; i++)
            {
                // Identify which peer it came from
                if (peers[i].host == buf.host && peers[i].port == (short)(buf.port))
                {
                    // And do something useful with the received payload
                    printf("Received from peer %d!\n", i);
                    break;
                }
            }
            // It is possible to get data from an unregistered peer. These are some reasons
            // I quickly came up with, in which this can happen:
            // 1. The server's datagram notifying us with the peer's address got lost,
            //    or it hasn't arrived yet (likely)
            // 2. A malicious (or clueless) user is sending you data on this endpoint (maybe
            //    unlikely)
            // 3. The peer's public endpoint changed either because the session timed out,
            //    or because its NAT did not assign the same public endpoint when sending
            //    datagrams to different remote endpoints. If this happens, and we're able
            //    to detect this situation, we could change our peer's endpoint data to
            //    the correct one. If we manage to pull this off correctly, even if at most
            //    one client has a NAT that doesn't support hole punching, we can communicate
            //    directly between both peers.
        }
    }]
    // Actually, we never reach this point...
    close(s);
    return 0;
}