When creating cross-platform frameworks, text encodings can be a hairy topic. There are multiple different encodings to choose from as well as edge cases to be concerned about. This post is going to cover some suggestions on how to handle string encodings, but keep in mind that there’s more than one way to skin a cat. This post assumes you have at least a rudimentary understanding of what text encodings are and how they work.
As with most things in framework design, consistency is the key to creating a compelling interface. So the first suggestion is probably the most important one as well: pick an encoding, and ensure all of your APIs adhere to it. Which encoding you choose is up to you, but not all encodings were created equal. For starters, any encoding that’s not “Unicode” is not likely to cover all of the characters your users may expect. For instance, Windows Latin 1, or MacRoman, will only have a small subset of characters supported. So avoid these system-specific code pages as they will cause more problems than they solve. Instead, pick from UTF-8, UTF-16, UTF-32, UCS-2, or UCS-4. Any one of these is likely to cover all of the characters users may wish to work with.
My personal recommendation is to use UTF-8, as it has many helpful properties. For starters, ASCII is a subset of UTF-8, so any ASCII character is encoded the same way in UTF-8 as it is in ASCII. This means no special handling is required for ASCII characters, they “just work.” But beyond that, UTF-8 can be represented using the char datatype in C. One important thing to keep in mind is that one byte does not equal one character in all instances for UTF-8. Be wary of using strlen, as it will return to you the number of bytes in a string, instead of the number of characters. Thankfully, needing to know the character count of a string comes up far less frequently than needing to know the number of bytes. However, if you need to find the character count, you can use mbstowcs(NULL, str, 0) to get the count in a portable fashion.
The reason I usually choose UTF-8 has less to do with it being a “better” encoding and more to do with the underlying datatypes I can use in C. Using a wchar_t seems like it’d be a great idea because it’s already set up to be a “Unicode” string datatype. Unfortunately, wchar_t is a terrible cross-platform datatype because its size and encoding differs from platform to platform. For instance, on Windows, wchar_t is two bytes wide and is UTF-16. But on Mac, wchar_t is four bytes wide, and is UTF-32. So for single-platform frameworks, wchar_t is a viable option to pick for strings, but for cross-platform, it causes too much heartache.
Once you’ve picked your encoding, don’t forget to document your APIs! There’s nothing more frustrating to a user who cares about encodings when they see “const char *” as the datatype, but no indication as to whether that’s UTF-8, ASCII, or some system-specific encoding. Same goes for std::string if you happen to be making a C++ framework. So be sure to document your datatypes accordingly!
The next point to cover is external usage of the strings themselves. For most APIs, simply passing the UTF-8 data back and forth is sufficient. However, there are some exceptions to this rule. If you are going to send data to an external source (over a socket, to a file, etc) then you should include a BOM. The BOM allows the receiving end to know that it’s getting UTF-8 encoded data. To apply a BOM to your string, simply add 0xEF, 0xBB, 0xBF to the start of your string. This is the UTF-8 BOM, even though byte ordering doesn’t matter since it’s treated as single byte sequences. As an aside, if you are writing out to an external source, you may want to allow the user to control the encoding to be written out via a parameter. Be sure to write out the BOMs for whatever encoding the user chooses (if it has one), because the BOM is especially important for cross-platform data. UTF-16LE on Windows is unreadable on Mac unless there’s a BOM included. Likewise, UTF-16BE on the Mac is unreadable on Windows for the same reason.
Similar to writing out a BOM, when reading in data from an external source, you should not make assumptions about the encoding of the text. Check for a BOM at the start of the string, and that will tell you what encoding you are receiving. If there’s no BOM present, then you’re on your own for determining the string encoding, unfortunately.
The last thing to cover is how to use the string data within your framework to perform actual work. Understand that the OS APIs have their own expectations of string encodings. On Windows, you should be using the “W” versions of the APIs these days (there’s no excuse for using the A versions any longer). Those APIs expect UTF-16 data, and so you should use MultiByteToWideChar to convert the data as-needed. On Mac, the OS APIs generally expect UTF-8 data, and so you can typically pass the strings directly. But if you’re using Cocoa or Carbon, then you may need to create a CFString (CFStringCreateWithCString) or NSString (stringWithCString) object. Because the OS representation of string data may be different than the input string data, you may want to consider creating a helper string class that can convert from UTF-8 to the OS-specific datatype.
Text encoding is a fundamentally important part of framework design if your framework accepts string data. Not everyone is from the same locale as you, and so using a Unicode encoding will go a long ways towards including developers around the world instead of excluding them.
My friend Joe wanted me to clarify the NSString creation function. It should not be the vanilla stringWithCString as that’s deprecated. It should be stringWithCString:encoding:
Thanks Joe!