Compressing java.lang.String

Overview

This post is in response to an interesting article http://puredanger.com/kablooie/2008/03/14/hacking-on-javalangstring/

The post talks about compressing long Strings in memory and dynamically decompressing them. Some downsides with the previous solution is that its requires an instrumentation framework and knowledge of byte code manipulation.

This is an alternative solution which does not use instrumentation and would work with any version of Java.

Don't do this in production unless you have tested this carefully! And even then....
Additionally, overriding a class in rt.jar might contravene the Java 2 Runtime Environment binary code license. See the note for -Xbootclasspath in java command line options

A self compressing String.

The approach taken is to create a custom version of the String class and add it to a jar under the lib/endorsed directory. This results in the class taking preference over the built in implementation.

The custom String class stores the text as either a char[] or a byte[]. If its a byte[] this is decompressed the char[] as required. Note: this will mean some operations will perform poorly, however these cases could be optimised with time.

String.java - The custom implementation.
mystring.jar - Compiled for Java 5+ as a jar.
StringTest.java - A unit test which outputs the following

String of length=38890 uses 18527 bytes.
Took 25 ms to iterate over characters.

This would be faster once the VM has warmed up but it would still be slower than the standard, uncompressed version.
Another way to use compression would be to add a constructor which creates a compressed string while the standard one create an uncompressed string. This would leave it up tot he developer to chose when a String should be compressed.
Note: By putting the jar in lib/endorsed, every java program is using it, even javac...

The code

This method allows the calling code to get char[] as required. The decompressed version is cached as methods like charAt() tend to be called many times.

private char[] getValueAsChar() {
    if (value instanceof char[])
        return (char[]) value;
    if (value instanceof CachedByteArray) {
        CachedByteArray value = (CachedByteArray) this.value;
        Reference<char[]> valueRef = value.chars;
        if (valueRef != null) {
            char[] chars = valueRef.get();
            if (chars != null)
                return chars;
        }
        Reader isr = new InputStreamReader(new InflaterInputStream(new ByteArrayInputStream(value.bytes)));
        char[] buffer = new char[offset + count];
        try {
            isr.read(buffer);
        } catch (IOException e) {
            throw new AssertionError(e);
        }
        value.chars = new WeakReference<char[]>(buffer);
        return buffer;
    }
    throw new AssertionError("Unable to decode " + value);
}

And the following method compresses the String as required.

private static CachedByteArray compress(char[] value, int offset, int count) {
    ByteArrayOutputStream baos = new ByteArrayOutputStream(count);
    Writer osw = new OutputStreamWriter(new DeflaterOutputStream(baos));
    try {
        osw.write(value, offset, count);
        osw.close();
    } catch (IOException e) {
        throw new AssertionError(e);
    }
    byte[] bytes = baos.toByteArray();
    return new CachedByteArray(bytes);
}

A thought experiment

Which class would you need to customise to change the behaviour of toString() for arrays to get the following?

byte[] bytes = { 1, 1, 2, 3, 5, 8, 13 };
System.out.println(bytes);

to print

[1, 1, 2, 3, 5, 8, 13]

As a hint here is some sample code.

public String toString() {
    final Class<? extends Object> thisClass = getClass();
    if (thisClass.isArray()) {
        if (thisClass == boolean[].class) return Arrays.toString((boolean[]) this);
        if (thisClass == byte[].class) return Arrays.toString((byte[]) this);
        if (thisClass == char[].class) return Arrays.toString((char[]) this);
        if (thisClass == short[].class) return Arrays.toString((short[]) this);
        if (thisClass == int[].class) return Arrays.toString((int[]) this);
        if (thisClass == float[].class) return Arrays.toString((float[]) this);
        if (thisClass == double[].class) return Arrays.toString((double[]) this);
        return Arrays.toString((Object[]) this);
    }
    return thisClass.getName() + "@" + Integer.toHexString(hashCode());
}

Actually there are three classes you could customise, but only one would change the behaviour of toString().

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.