SSJX.CO.UK
Content

Performance Optimisation in D - Squish File Compressor

Introduction

Squish is a Windows command line program that I created years / decades ago to shrink text and image for the Cybiko. It was not a great method but it could reduce some files by a third, ultimately it was too slow to be much use. It worked by using bit codes to represent bytes, if the file had a small number of different bytes, the the compression worked pretty well.

E.g. Hello World!

LetterOccurrencesBits
l301
o 2001
1101
! 10001
H 11001
W 11101
d 100001
e 110001
r 111001

Compressed bit stream:
10011000 10101001 10111010 01110010 10000100 01000000

In this example, the bytes required goes from 12 bytes to 6 bytes! The downside is that as the number of different characters increases, the compression gets worse, it can end having 4 bytes may represent 1 byte!

This method had the advantage of being easy to understand and not too hard to implement.

Note: If you are looking for a compression method to use, there are many much better ones available!

Porting to the D Programming Language

Looking through my old C programs, I thought this would be an interesting one to port to D. It would gain some memory safety and would give me the chance to improve it.

The port was fairly straight forward, the other guides cover most of what was done. Loops were changed to foreach, dynamic arrays used etc...

After this, the program worked but on my test text file, it seemed slow decompressing. Compressing was fine but the decompress was noticably very slow.

Benchmarking

To get a rough idea of the time taken, we can use the StopWatch:

import std.datetime.stopwatch;
...
StopWatch sw;
sw.start();
	
	go_decompress(output,message);

long msecs = sw.peek.total!"msecs";
writeln("Elapsed time: ",msecs,"ms");

This showed that compression took around 12ms which seemed fine but decompression was over 300ms! This was terrible! My computer is not the fastest (old Pentium Dual-Core) which may be why I noticed this in the first place. The downside of faster PC's is that they often mask inefficiency. On an i5, I doubt this problem would have even been noticed...

The Original Decompression Function

This is pretty much the same method as the C version. A block of bits gets built and is compared against a structure containing bitcodes, if there is a match, a value is returned.

bool decompress(ubyte[] input,ref ubyte[] output){
	int l;
	uint code;
	
	foreach(ubyte ch;input){
		ubyte va=cast(ubyte)0x80;
		foreach(i;0..8){
			if ((ch & va)==va ){
				code+=(0x80000000>>l);
			}
			va=(va & 0xff)>>1;
			
			l++;
			if (l>1){			
				auto outcode=codematch(code);
				if (outcode>=0){
					code=0;
					l=0;
					output~=node[outcode].value;
				}
			}
		}
	}
	return true;
}

It's simple but involves lots of loops. The C version saved directly to a file but this outputs to an array and then saves the array.

Optimising!

This involved a bit of trial and error to find the actual issue, so in order of things tried:

Final Version

This is the current version of the decompression code with all of the above changes.

bool decompress(ubyte[] input,ref ubyte[] output){
	int l;
	uint code;
	auto size=0;
	
	// Assuming that output is at least as big
	output.length=input.length;
	
	// Saves doing lots of shifts
	ubyte[8] va=[0x80,0x40,0x20,0x10,0x08,0x04,0x02,0x01];
	
	// Make our Associative Array
	ubyte[int] map;
	foreach(n;node){
		map[n.code]=n.value;
	}
	
	foreach(ubyte ch;input){
		foreach(i;0..8){
			if ((ch & va[i])==va[i] ){
				code+=(0x80000000>>l);
			}
			
			l++;
			
			// See if the bit code exists, then get the value			
			if (code in map){
				output[size]=map[code];
				code=0;
				l=0;
				size++;
				
				// Reached the current limit of array so increase!
				if (size==output.length){
					output.length+=1024;
				}
			}
		}
	}
	output.length=size;
	return true;
}

Conclusions

To improve your D programs, a good starting point is using StopWatch and checking the Garbage Collector profile log. Not reinventing the wheel is normally a good call too, generally a built-in or standard library function is faster, better and/or easier to use.

In this example, although the speed-up was large, this program is not critical and waiting a 0.3 seconds instead of 0.01 seconds is not the end of the world. However, if this was a key program that ran often, then the improvements could be the difference between needing a single server or 30! For the benefit of your users and the planet, it is worth running some benchmarks on your code!

Created 11/02/24