Every Java developer with more than a few months of coding experience has written code like this before:
1try { 2 "Hello World".getBytes("UTF-8"); 3} catch (UnsupportedEncodingException e) { 4 // Every implementation of the Java platform is required to support UTF-8 5 // Why the $!?% do I have to catch an exception which can never happen 6}
What I realized just recently is that Java 7 already provided a fix for this ugly code, which not many people have adopted:
1"Hello World".getBytes(StandardCharsets.UTF_8);
Yay! No Exception! But it is not only nicer, it is also faster! You will be surprised to see by how much!
Let us first look at the implementations for both getBytes() calls:
1return StringCoding.encode(charset, value, 0, value.length);
Not exciting. We shall dig on:
1static byte[] encode(String charsetName, char[] ca, int off, int len) 2 throws UnsupportedEncodingException 3{ 4 StringEncoder se = deref(encoder); 5 String csn = (charsetName == null) ? "ISO-8859-1" : charsetName; 6 if ((se == null) || !(csn.equals(se.requestedCharsetName()) 7 || csn.equals(se.charsetName()))) { 8 se = null; 9 try { 10 Charset cs = lookupCharset(csn); 11 if (cs != null) 12 se = new StringEncoder(cs, csn); 13 } catch (IllegalCharsetNameException x) {} 14 if (se == null) 15 throw new UnsupportedEncodingException (csn); 16 set(encoder, se); 17 } 18 return se.encode(ca, off, len); 19}
and
1static byte[] encode(Charset cs, char[] ca, int off, int len) { 2 CharsetEncoder ce = cs.newEncoder(); 3 int en = scale(len, ce.maxBytesPerChar()); 4 byte[] ba = new byte[en]; 5 if (len == 0) 6 return ba; 7 boolean isTrusted = false; 8 if (System.getSecurityManager() != null) { 9 if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) { 10 ca = Arrays.copyOfRange(ca, off, off + len); 11 off = 0; 12 } 13 } 14 ce.onMalformedInput(CodingErrorAction.REPLACE) 15 .onUnmappableCharacter(CodingErrorAction.REPLACE) 16 .reset(); 17 if (ce instanceof ArrayEncoder) { 18 int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba); 19 return safeTrim(ba, blen, cs, isTrusted); 20 } else { 21 ByteBuffer bb = ByteBuffer.wrap(ba); 22 CharBuffer cb = CharBuffer.wrap(ca, off, len); 23 try { 24 CoderResult cr = ce.encode(cb, bb, true); 25 if (!cr.isUnderflow()) 26 cr.throwException(); 27 cr = ce.flush(bb); 28 if (!cr.isUnderflow()) 29 cr.throwException(); 30 } catch (CharacterCodingException x) { 31 throw new Error(x); 32 } 33 return safeTrim(ba, bb.position(), cs, isTrusted); 34 } 35}
Wooha. Well it looks like the one taking a Charset
is more complicated, right? Wrong. The last line of encode(String charsetName, char[] ca, int off, int len)
is se.encode(ca, off, len)
, and the source of that looks mostly like the source of encode(Charset cs, char[] ca, int off, int len)
. Very much simplified, this makes the whole code from encode(String charsetName, char[] ca, int off, int len)
basically just overhead.
Worth noting is the line Charset cs = lookupCharset(csn);
which in the end will do this:
1private static Charset lookup(String charsetName) {
2 if (charsetName == null)
3 throw new IllegalArgumentException("Null charset name");
4
5 Object[] a;
6 if ((a = cache1) != null && charsetName.equals(a[0]))
7 return (Charset)a[1];
8 // We expect most programs to use one Charset repeatedly.
9 // We convey a hint to this effect to the VM by putting the
10 // level 1 cache miss code in a separate method.
11 return lookup2(charsetName);
12}
13
14private static Charset lookup2(String charsetName) {
15 Object[] a;
16 if ((a = cache2) != null && charsetName.equals(a[0])) {
17 cache2 = cache1;
18 cache1 = a;
19 return (Charset)a[1];
20 }
21
22 Charset cs;
23 if ((cs = standardProvider.charsetForName(charsetName)) != null ||
24 (cs = lookupExtendedCharset(charsetName)) != null ||
25 (cs = lookupViaProviders(charsetName)) != null)
26 {
27 cache(charsetName, cs);
28 return cs;
29 }
30
31 /* Only need to check the name if we didn't find a charset for it */
32 checkName(charsetName);
33 return null;
34}
Wooha again. Thats quite impressive code. Also note the comment // We expect most programs to use one Charset repeatedly.
. Well thats not exactly true. We need to use charsets when we have more than one and need to convert between them. But yes, for most internal usage this will be true.
Equipped with this knowledge, I can easily write a JMH benchmark that will nicely show the performance difference between these two String.getBytes()
calls.
The benchmark can be found in this gist . On my machine it produces this result:
Benchmark Mean Mean error Units
preJava7CharsetLookup 3956.537 144.562 ops/ms
postJava7CharsetLookup 7138.064 179.101 ops/ms
The whole result can be found in the gist, or better: obtained by running the benchmark yourself.
But the numbers already speak for themselves: By using the StandardCharsets, you not only do not need to catch a pointless exception, but also almost double the performance of the code 🙂
More articles
fromFabian Lange
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Fabian Lange
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.