On bytes, chars, Strings, XML and Unicode
Strings
What does this print?
byte[] buf = new byte[]{'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd'};
String s = new String(buf);
System.out.println(s);
Obviously, the answer is the infamous “Hello World”. Unless you live in China or Japan. More about that later.
Let’s make it a bit more exiting:
byte[] buf = new byte[]{'H', (byte) 0xE9, 'l', 'l', 'o', ' ', 'W', (byte)0xF8, 'r', 'l', 'd'};
String s = new String(buf);
System.out.println(s);
Before you answer, let me tell you that 0xE9 is the ISO-8559-1 representation of é, and 0xF8 is the representation of ø. ISO-8859-1 being the character encoding that is used in most West-European countries. So, you would figure that this would print “Héllo Wørld”, right?
Well, it depends. On my Mac, it prints “HÎllo W¯rld”. On my Windows VMWare instance, it does print the correct string. What’s up with that?
The issue here is the implicit String constructor that’s used. According to the documentation of the String(byte[] bytes) constructor, this “constructs a new String by decoding the specified array of bytes using the platform’s default charset.” The default character encoding on OS X is Mac Roman. On Windows, it’s Windows-1252, which is almost, but not quite, entirely unlike ISO-8859-1. Hence the decode mixup. The way to make it would would be to use the other constructor, where you can specify a charset:
String s = new String(buf, "ISO-8859-1");
After working with Java for more than ten years, I still can’t see why SUN added the byte array “convenience” constructor. It’s not convenient at all. If anything, it’s inconvenient, because it causes many bugs. This is especially true in Enterprise apps, where you really don’t want to depend on the language settings of the underlying operating system to figure out how to encode your strings. It all works fine in the US on Windows, but when someone deploys your app in - say - Japan, you’re screwed.
InputStreams
There are a whole bunch of these “inconvenience” constructors in Java. Consider this:
byte[] buf = new byte[]{'H', (byte) 0xE9, 'l', 'l', 'o', ' ', 'W', (byte)0xF8, 'r', 'l', 'd'};
BufferedReader rdr = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(buf)));
String s = rdr.readLine();
System.out.println(s);
As it turns out, the InputStreamReader(InputStream in) also uses the default character set. Bad Java! Bad! We should have done
BufferedReader rdr = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(buf), "ISO-8859-1"));
By this point, Ruby developers will think that this is further proof as to why Java is so bloated, because of the verbosity. Well, scr�w you, and don’t come back until your language has proper Unicode support.
Java developers, on the other hand, will think that if they stick to using Strings everywhere, they’re good. After all, Java’s String is Unicode, right? Well, not really. As explained in the String javadoc, a String is made up of of UTF-16 encoded chars, which are exposed by toCharArray()), for instance. So the String is still decoded, but to a wide character array, rather than a byte array. The only way to properly deal with Unicode in Java is to use the Character class, more specifically its codePointAt and related methods.
If you’d been a good boy or girl, you would have read Joel’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), and you’d already know this. But wait, there’s more!
XML
If you add XML to the mix, it gets more interesting. You probably know that every XML file starts with a declaration, like so
<?xml version="1.0" encoding="UTF-8"?>
In fact, the encoding part is unnecessary. If you leave it out, an XML parser will default to UTF-8, unless the file begins with a Byte Order Mark, then it’s UTF-16. So far, so good.
So what if I create a String containing XML, like so:
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><content>Hëllo Wørld</content>";
In effect, we have two character encodings at work here: UTF-8 as defined in the declaration, but the String itself is UTF-16, as we just discovered. Doesn’t that confuse an XML parser? Let’s see, by using SAX:
ContentHandler handler = new DefaultHandler() {
public void characters(char ch[], int start, int length) throws SAXException {
System.out.println(new String(ch, start, length));
}
};
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
xmlReader.setContentHandler(handler);
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><content>Hëllo Wørld</content>";
xmlReader.parse(new InputSource(new StringReader(s)));
which prints the all familiar “Héllo Wørld”. This is a nice trick, and I am not completely sure how it works. I think that SAX completely ignores the encoding in the XML declaration, and uses UTF-16. But I could be wrong. This doesn’t mean that we can’t confuse it further, by replacing that last line with:
xmlReader.parse(new InputSource(new ByteArrayInputStream(s.getBytes())));
So the XML parser is smart, but not brilliant. I, for one, don’t want to rely on this automagical encoding process, and I would recommend you handle raw XML as bytes, not Strings. It’s the XML parser’s job to turn the bytes into Strings, and it is probably a lot better at it than you are. Coincidentally, this is also the reason why, in Spring Web Services, the JMS transport defaults to using a BytesMessage, rather than TextMessage.
Conclusion
There are two simple lessons here, which keep you out of encoding hell:
- Never rely on the default encoding with converting bytes to Strings.
- Handle raw XML as a series of bytes. Use a parser to turn those bytes into Strings.

Iwein said,
May 16, 2008 @ 13:04
> String s = “Hëllo Wørld”;
> xmlReader.parse(new InputSource(new StringReader(s)));
> which prints the all familiar “Héllo Wørld”.
Is that a typo or are we seeing more magic here?
Arjen Poutsma said,
May 16, 2008 @ 13:09
Typo eradicated.
James said,
May 17, 2008 @ 9:24
Hi
Nice article. Would you be interested in reposting this to JavaLobby - I think it would be useful there.
Contact me if you’re interested and we can organise it
James
Eric said,
May 19, 2008 @ 17:42
> which prints the all familiar “Héllo Wørld”. This is a nice trick, and I am not completely sure how it works.
If you change the code to create a byte[] for the XML string ’s’ does it work like the first case. Because there may be some magic in the file encoding the the javac compiler sees. Namely that ‘Héllo Wørld’ is read by the complier using some default encoding (e.g. other then UTF-8), and magically transforming this to the correct UTF-16 representation.
From javac
-encoding encoding
Set the source file encoding name, such as EUCJIS/SJIS. If -encoding is not specified, the platform default converter is used.