Porting the Postcard Sized Path Tracer to the D Programming Language

Introduction

This page details my porting of the Postcard Sized Path Tracer by Andrew Kensler to the D Programming Language. This started as 2037 byte obfuscated C++ program that was on the back of a Pixar recruitment flyer. Fabien Sanglard explains all about the C++ program and the differences between a path tracer and a ray tracer in the link below!

The main reason for doing this is that it would be cool if I could actually get it to work and it would be a good learning experience!

Update 15/02/2024: This program has been recompiled using the LDC D Compiler which is LLVM based and produces even faster programs! All the times haves been updated accordingly and are around 5 times faster than before!

This took 75,929ms to render...
(320 x 200 Sample count 16)

Download

To keep the download size small, this is provided as a 7zip archive. Archive includes the compiled exe and source.

Download Path Tracer (195KB 7Zip)
Updated (15/02/2024)

⚠️ If you get a message about missing msvcr120.dll or vcruntime140.dll (or similar), you need either the Visual C++ Runtime 2013 or Visual C++ Runtime 2015-2022 from Microsoft installed. You can get them from here. The programs on this page are all 32bit so you will need that runtime even if you are running 64bit.

Acknowledgements

As mentioned above, the original Postcard Path Tracer was created by Andrew Kensler. While making this, I made use of the work / websites by the following people:

Deciphering the Postcard Sized PathTracer by Fabien Sanglard
Fabien unobfuscates the the code and explains how it all works. This is where I got the C++ version from upon which my D version is based. Definitely worth a read!

The following sites were also used as a reference when trying to get the program to work:

Is C# a low-level language? by Matt Warren
This site shows Matt converting the path tracer to C# and getting it to run almost as fast as the C++ version!
From 46s to 5s - Optimizing a 350 Line Raytracer in Rust by Carl Fredrik Samson
Carl does a Rust port of the C# version then optimises the memory usage and speed. At the end, he uses the Rayon library to add parallelism to his version which results in a big speed increase. I steal the parallelism idea for my D version, details further down!

Thanks to all the people above!

Porting to the D Programming Language

The following is a list of some the changes made:

Simple Stuff

These are examples of simple things that were changed to be more D like and/or are better practice.
- Defines to Constants
  For example, from #define HIT_NONE 0 to const int HIT_NONE=0;.
- Function variables passed by Ref
  Such as QueryDatabase(Vec position, int &hitType) to QueryDatabase(Vec position, ref int hitType).
- Removing the min() function
  Why reinvent the wheel? There is a min() function in std.algorithm, made sense to use that.
- Float max
  From float distance = 1e9; to float distance=float.max;. No real benefit besides readability.
- Maths functions
  A lot of the maths functions lost the 'f' as the ones in D could handle floating point numbers. E.g. sqrtf to sqrt, cosf to cos.
- Loops
  Where possible, foreach (and foreach_reverse) loops are used.

Struct Operator Overloads

To enable us to add / multiply our 3D point values, we needed to create some overloads for the main Vec struct. These are a little different looking from their C++ equivalents E.g.

struct Vec {
    float x, y, z;
	...
	// To add two Vec together, allows the following:
	// Vec(1,2,3) + Vec(10,11,12) = Vec(11,13,15)
	Vec opBinary(string s)(Vec r) if (s == "+"){
		return Vec(x + r.x, y + r.y, z + r.z);
	}
	...
}

Inverse Square Root Function

The C++ version had a struct overload for the inverse square root using the '!' operator. I could not get a D version to work so eventually gave up and created a standard function for this. Simple and does the job:
```
Vec inverse(Vec r){
	float sq=1 / sqrt( r%r);
	return r*sq;
}
```
Arrays Moved Out Of Functions

The C++ version had two arrays (letters / curves) in the QueryDatabase() function. These don't change so were put outside the function. Not sure if these would be collected or just take up memory during each call so were taken out as a precaution!
Save As A BMP file

Removed the PPM saving code and copied and pasted the bitmap saving module from Anim2Bmp, also on this site. This needed a block of memory allocated which was useful for the parallel optimisation below!

Initial Benchmark

To get a rough idea of the time taken, we can use the StopWatch:

import std.datetime.stopwatch;
..
StopWatch sw;
sw.start();

// Rendering happens here...

long msecs = sw.peek.total!"msecs";
writeln("Elapsed time: ",msecs,"ms");

The settings used to test are a resolution of 320x200 and a sample count of 2, the program was compiled with ldc2 -O -i pixar.d

Non-Parallel Version : 18,248ms (LDC)
Non-Parallel Version : 93,405ms (DMD)

Parallelism

As mentioned above, the Rust version made use of parallelism to get a large boost in speed. I was able to use a similar method in D to get the same effect! The idea is to structure the rendering so that it can be run in a parallel foreach loop, this involve pre calculating where the pixels go.

struct mypoint{
	int x;
	int y;
	int pos;
}

mypoint[] points=new mypoint[w*h];

// Pre calc the locations of all the points
auto c=0;
foreach_reverse(y;0..h){
	auto px=y*(w*3);
	foreach_reverse(x;0..w){
		points[c].x=x;
		points[c].y=y;
		points[c].pos=px;
		px+=3;
		c++;
	}
}


// Parellel render the points
foreach(pt;parallel(points)){
	auto x=pt.x;
	auto y=pt.y;
	auto px=pt.pos;
	
	Vec color=0;
	foreach(p;0..samplesCount){
		color = color + Trace(position, inverse(goal + left * (x - w / 2 + randomVal()) + up * (y - h / 2 + randomVal())));
	}
	// Reinhard tone mapping
	color = color * (1.0f / samplesCount) + 14.0f / 241;
	Vec o = color + 1;
	color = Vec(color.x / o.x, color.y / o.y, color.z / o.z) * 255;

	// Write our bitmap
	output[px]=cast(ubyte)color.z;
	output[px+1]=cast(ubyte)color.y;
	output[px+2]=cast(ubyte)color.x;
}

The code within the parallel loop is self contained and not dependant on previous loops so can be done across multiple cores without any problems. The order the work is done in does not matter, it is not a problem if point #2 is finished before point #1 for example.

Final Benchmark

With the above changes and the same settings, the new time was:

Parallel Version : 9,328ms (LDC)
Parallel Version : 47,502ms (DMD)

Given I have two cores, it looks like the load was split fairly evenly between them as it did finish in almost half the time!

Conclusions

The aim was to make something cool and learn along the way, both missions accomplished! I have no doubt that this program can be improved upon and made even faster but I am pleased with the result and impressed by how easy it was to get the program using multiple cores!

Updated 15/02/24
Created 14/02/24