Overview
This post is in response to an interesting article http://puredanger.com/kablooie/2008/03/14/hacking-on-javalangstring/
The post talks about compressing long Strings in memory and dynamically decompressing them. Some downsides with the previous solution is that its requires an instrumentation framework and knowledge of byte code manipulation.
This is an alternative solution which does not use instrumentation and would work with any version of Java.
Don't do this in production unless you have tested this carefully! And even then.... |
A self compressing String.
The approach taken is to create a custom version of the String class and add it to a jar under the lib/endorsed directory. This results in the class taking preference over the built in implementation.
The custom String class stores the text as either a char[] or a byte[]. If its a byte[] this is decompressed the char[] as required. Note: this will mean some operations will perform poorly, however these cases could be optimised with time.
String.java - The custom implementation.
mystring.jar - Compiled for Java 5+ as a jar.
StringTest.java - A unit test which outputs the following
String of length=38890 uses 18527 bytes.
Took 25 ms to iterate over characters.
This would be faster once the VM has warmed up but it would still be slower than the standard, uncompressed version.
Another way to use compression would be to add a constructor which creates a compressed string while the standard one create an uncompressed string. This would leave it up tot he developer to chose when a String should be compressed.
Note: By putting the jar in lib/endorsed, every java program is using it, even javac...
The code
This method allows the calling code to get char[] as required. The decompressed version is cached as methods like charAt() tend to be called many times.
private char[] getValueAsChar() { if (value instanceof char[]) return (char[]) value; if (value instanceof CachedByteArray) { CachedByteArray value = (CachedByteArray) this.value; Reference<char[]> valueRef = value.chars; if (valueRef != null) { char[] chars = valueRef.get(); if (chars != null) return chars; } Reader isr = new InputStreamReader(new InflaterInputStream(new ByteArrayInputStream(value.bytes))); char[] buffer = new char[offset + count]; try { isr.read(buffer); } catch (IOException e) { throw new AssertionError(e); } value.chars = new WeakReference<char[]>(buffer); return buffer; } throw new AssertionError("Unable to decode " + value); }
And the following method compresses the String as required.
private static CachedByteArray compress(char[] value, int offset, int count) { ByteArrayOutputStream baos = new ByteArrayOutputStream(count); Writer osw = new OutputStreamWriter(new DeflaterOutputStream(baos)); try { osw.write(value, offset, count); osw.close(); } catch (IOException e) { throw new AssertionError(e); } byte[] bytes = baos.toByteArray(); return new CachedByteArray(bytes); }
Another use for overriding classes in the JDK.
I recently had a problem where a Buffer was being randomly clobbered it what appears to be a race condition between threads. None of the usual approaches worked an the debugger could only confirm that occasionally the Buffer was being written to by two threads.
To determine where this was happening I change the following methods
public abstract class Buffer { // fields deleted. private Thread thread; private Throwable flippedHere; public final Buffer flip() { thread = Thread.currentThread(); flippedHere = new Throwable("flipped here"); // code deleted. } public final int remaining() { if (thread != Thread.currentThread()) { throw new IllegalStateException("Accessed by a different thread " + thread + " flipped here", flippedHere); } return limit - position; } }
The assumption here is that remaining should only be called by the thread which last called flip(). This found the problem very quickly.
A thought experiment
Which class would you need to customise to change the behaviour of toString() for arrays to get the following?
byte[] bytes = { 1, 1, 2, 3, 5, 8, 13 }; System.out.println(bytes);
to print
[1, 1, 2, 3, 5, 8, 13]
As a hint here is some sample code.
public String toString() { final Class<? extends Object> thisClass = getClass(); if (thisClass.isArray()) { if (thisClass == boolean[].class) return Arrays.toString((boolean[]) this); if (thisClass == byte[].class) return Arrays.toString((byte[]) this); if (thisClass == char[].class) return Arrays.toString((char[]) this); if (thisClass == short[].class) return Arrays.toString((short[]) this); if (thisClass == int[].class) return Arrays.toString((int[]) this); if (thisClass == float[].class) return Arrays.toString((float[]) this); if (thisClass == double[].class) return Arrays.toString((double[]) this); return Arrays.toString((Object[]) this); } return thisClass.getName() + "@" + Integer.toHexString(hashCode()); }
Actually there are three classes you could customise, but only one would change the behaviour of toString().
Add Comment