Beliebte Suchanfragen
//

Faster and Cleaner Code since Java 7

23.4.2014 | 3 minutes of reading time

Every Java developer with more than a few months of coding experience has written code like this before:

1try {
2  "Hello World".getBytes("UTF-8");
3} catch (UnsupportedEncodingException e) {
4  // Every implementation of the Java platform is required to support UTF-8
5  // Why the $!?% do I have to catch an exception which can never happen
6}

What I realized just recently is that Java 7 already provided a fix for this ugly code, which not many people have adopted:

1"Hello World".getBytes(StandardCharsets.UTF_8);

Yay! No Exception! But it is not only nicer, it is also faster! You will be surprised to see by how much!

Let us first look at the implementations for both getBytes() calls:

1return StringCoding.encode(charset, value, 0, value.length);

Not exciting. We shall dig on:

1static byte[] encode(String charsetName, char[] ca, int off, int len)
2    throws UnsupportedEncodingException
3{
4    StringEncoder se = deref(encoder);
5    String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;
6    if ((se == null) || !(csn.equals(se.requestedCharsetName())
7                          || csn.equals(se.charsetName()))) {
8        se = null;
9        try {
10            Charset cs = lookupCharset(csn);
11            if (cs != null)
12                se = new StringEncoder(cs, csn);
13        } catch (IllegalCharsetNameException x) {}
14        if (se == null)
15            throw new UnsupportedEncodingException (csn);
16        set(encoder, se);
17    }
18    return se.encode(ca, off, len);
19}

and

1static byte[] encode(Charset cs, char[] ca, int off, int len) {
2  CharsetEncoder ce = cs.newEncoder();
3  int en = scale(len, ce.maxBytesPerChar());
4  byte[] ba = new byte[en];
5  if (len == 0)
6      return ba;
7  boolean isTrusted = false;
8  if (System.getSecurityManager() != null) {
9      if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) {
10          ca =  Arrays.copyOfRange(ca, off, off + len);
11          off = 0;
12      }
13  }
14  ce.onMalformedInput(CodingErrorAction.REPLACE)
15    .onUnmappableCharacter(CodingErrorAction.REPLACE)
16    .reset();
17  if (ce instanceof ArrayEncoder) {
18      int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);
19      return safeTrim(ba, blen, cs, isTrusted);
20  } else {
21      ByteBuffer bb = ByteBuffer.wrap(ba);
22      CharBuffer cb = CharBuffer.wrap(ca, off, len);
23      try {
24          CoderResult cr = ce.encode(cb, bb, true);
25          if (!cr.isUnderflow())
26              cr.throwException();
27          cr = ce.flush(bb);
28          if (!cr.isUnderflow())
29              cr.throwException();
30      } catch (CharacterCodingException x) {
31          throw new Error(x);
32      }
33      return safeTrim(ba, bb.position(), cs, isTrusted);
34  }
35}

Wooha. Well it looks like the one taking a Charset is more complicated, right? Wrong. The last line of encode(String charsetName, char[] ca, int off, int len) is se.encode(ca, off, len), and the source of that looks mostly like the source of encode(Charset cs, char[] ca, int off, int len). Very much simplified, this makes the whole code from encode(String charsetName, char[] ca, int off, int len) basically just overhead.
Worth noting is the line Charset cs = lookupCharset(csn); which in the end will do this:

1private static Charset lookup(String charsetName) {
2  if (charsetName == null)
3      throw new IllegalArgumentException("Null charset name");
4 
5  Object[] a;
6  if ((a = cache1) != null && charsetName.equals(a[0]))
7      return (Charset)a[1];
8  // We expect most programs to use one Charset repeatedly.
9  // We convey a hint to this effect to the VM by putting the
10  // level 1 cache miss code in a separate method.
11  return lookup2(charsetName);
12}
13 
14private static Charset lookup2(String charsetName) {
15  Object[] a;
16  if ((a = cache2) != null && charsetName.equals(a[0])) {
17      cache2 = cache1;
18      cache1 = a;
19      return (Charset)a[1];
20  }
21 
22  Charset cs;
23  if ((cs = standardProvider.charsetForName(charsetName)) != null ||
24      (cs = lookupExtendedCharset(charsetName))           != null ||
25      (cs = lookupViaProviders(charsetName))              != null)
26  {
27      cache(charsetName, cs);
28      return cs;
29  }
30 
31  /* Only need to check the name if we didn't find a charset for it */
32  checkName(charsetName);
33  return null;
34}

Wooha again. Thats quite impressive code. Also note the comment // We expect most programs to use one Charset repeatedly.. Well thats not exactly true. We need to use charsets when we have more than one and need to convert between them. But yes, for most internal usage this will be true.

Equipped with this knowledge, I can easily write a JMH benchmark that will nicely show the performance difference between these two String.getBytes() calls.
The benchmark can be found in this gist . On my machine it produces this result:

Benchmark                Mean      Mean error  Units
preJava7CharsetLookup    3956.537  144.562     ops/ms
postJava7CharsetLookup   7138.064  179.101     ops/ms

The whole result can be found in the gist, or better: obtained by running the benchmark yourself.
But the numbers already speak for themselves: By using the StandardCharsets, you not only do not need to catch a pointless exception, but also almost double the performance of the code 🙂

share post

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.