Cluster Computing Guide

FracTest supports parallel processing, which allows it to make use of multiple processors in one computer. More than this, though, it can distribute its computing workload around multiple computers in a network, which can let you render fractals much faster. This section describes how it works, and how to set it up.

Cluster computing is very handy when used in batch mode, though there is no specific connection between the two features — they can be used together or separately.

The FracTest Architecture

FracTest is divided into two components — the controller, and one or more servers. The controller operates the user interface, and manages a fractal computation. As described in the concepts page, it does this by dividing a fractal view into tiles, which are placed on a queue to be computed. The servers request tiles from the queue, compute their values, and submit the results back. The controller composes the results together into the final fractal image.

In more detail:

Each server has a number of logical CPUs (different from the number of physical CPUs, due to hyperthreading), and a number of worker threads, which is the same unless you override this. When running, each thread will compute one tile at a time.
Each server also has a queue of tiles awaiting execution (i.e. not currently running). This is used to make sure that as soon as a tile finishes, the next one is already on the server waiting to run; we don't have to wait for it to come over the network.
When a computation is started, all tiles to be computed are added to the master queue in the controller. This queue is initially sorted by screen position. The controller tells all the servers that we're running.
Once notified that we're running, the servers will request work items from the controller to occupy all of their threads, plus more to prime their queues. Each server will try to keep enough work in its queue to keep it busy for 20 seconds, but generally not fewer than 2 tiles.
When a server has less than 20 seconds' worth of work in its queue, it will request work from the controller; as long as the controller's queue isn't empty, it will pass back the next set of tiles waiting to be executed. The server will add these tiles to its queue, and execute them when ready.
When a tile executes on the server, each pixel is computed, with anti-aliasing if enabled, and coloured using the palette.
When a tile completes execution, the results (the pixel values for that tile) are sent back over the network; the controller paints them into the generated image.
When a tile completes a progressive rendering pass, but hasn't completed all passes, the controller will put it back on the queue.
In progressive mode, after the first pass, we have a rough idea of how long each tile will take to compute, and tiles are sorted slowest-first. All tiles of a lower pass will be sent to servers before tiles of a higher pass.
When all tiles have completed all passes, the controller finishes the computation and tells all the servers that we're no longer running.

A server can only be connected to one controller at a time. If you want to have a machine's processors shared between multiple FracTest front-ends, start multiple servers on the machine (you can do this by running them on different port numbers).

Note that this architecture does not depend on the application mode — there is always at least one server, and all work goes through the communications link between the controller and server. When running in the normal interactive mode, FracTest automatically starts an internal server, and uses that for all processing. To make use of network computing, you can start additional servers on other machines, and integrate them into FracTest. You can also suppress the setup of the internal server.

When a server is running locally, the network stack will automatically bypass as much of the network system as possible, making communications efficient. When using the internal server, communications can use shared memory, which is even more efficient. Hence, this networked architecture does not cost much in terms of efficiency, specially when computing the deep, processor-intensive images for which FracTest is optimised. Also, the internal server is started on a unique port number from a range reserved for internal servers; so you can have multiple standalone FracTest instances running at once.

Setting Up a Cluster

Setting up a compute network will require use of FracTest's command-line features; so access to a command prompt, terminal emulator, etc., is required. Also, automation will be very handy; so the ability to create very simple (one- or two-line) shell scripts / batch files will be useful.

Note that you will need to install FracTest on all the systems in the cluster. All of the machines will need to be running the same version of FracTest (it will check this on startup).

The basic steps to set up a cluster are as follows.

Start the Servers

On each of the remote machines, start a FracTest server. You will need to install FracTest in the usual way, then run it in server mode by entering this command:

java -jar FracTest.jar --server

FracTest will display the server UI. This is very different to the normal FracTest UI, and is shown at right (click to expand). This panel displays the current performance and status information for the server; when you first start it, it will show <no client>. This indicates that it is waiting for a connection.

If you need to start the server on a different port number, use the --port argument. For example, to run on port 1234:

java -jar FracTest.jar --server --port 1234

Set Up the Controller

Sample cluster management user interface (click to expand).

Having started servers on all the machines you want to use as servers, you need to start the controller. The simplest way is to start FracTest as normal (bringing up the main application window), then use the menu command "Work" → "Manage Cluster" to bring up the cluster management user interface. This panel displays all known servers; when you first start it, it will show one server called localhost. This is the built-in internal server, which performs all the computation when you use FracTest in its normal standalone mode.

An example of the UI is shown at right. This illustrates an app with its own built-in server localhost ready to run. The left-hand panel shows the overall status and statistics for the cluster as a whole.

You can use the cluster management UI to add all the servers you started earlier. Simply enter the name of each server in the text field and click "Add". As you do, its panel should be added to the UI. Each server's panel in the cluster management UI basically mirrors the panel in the server's own UI.

The computer name you enter should be whatever name the computer is known by on your network. This can be an IP address, but bear in mind that IP addresses can change when the network is re-configured.

You can append a port number to the name using a colon, if you have started the server on a non-standard port number; for example, Bill:1234. This can be handy if you want to set up multiple servers on one machine, e.g. for debugging, as only one server can be running on a particular port number on a given machine.

If you start a given cluster configuration a lot, entering the server names via the UI each time will get tedious. Instead, you can enter them on the command line, and have FracTest start up with those servers connected (as long as they are already running). The command to do start FracTest communicating with servers on machines Bill and Ben is:

java -jar FracTest.jar --servers Bill,Ben

Note that in this case, FracTest will still have its internal server running; so you would actually have three machines in the cluster, one of them being the controller itself.

If this makes the controller too laggy, you can tell FracTest not to start the internal server by specifying --no-local; as in:

java -jar FracTest.jar --servers Bill,Ben --no-local

Even if you do want a server on the controller, you might find it easier to control the cluster if the local server is running as a separate process, rather than the internal server. This will make it easier to stop and start it, should you need to. In that case, start up a server on the local machine in the same way as on the remote machines; then start the controller with:

java -jar FracTest.jar --servers localhost,Bill,Ben --no-local

Once you have the command set up, saving it in a shell script, BAT file, or similar, will obviously make life easier.

Running a Cluster

Once you have configured a cluster, just start rendering a view — you should see all the servers come to life. Each server's UI, and its shadow in the cluster management UI, will turn blue to indicate that it is running, and you will start to see performance statistics being displayed.

The image below illustrates the cluster management window in an app which is running a computation with three servers connected: the app's own built-in server localhost, which has been taken offline; and two networked servers, Fire and Ice, both running. The latter each have 32 logical CPUs.

In each server's panel, the following information appears:

The name by which the server was added to the cluster. Note that a computer can have many names, and also IP addresses; the name we display here is the name you specified to add it to the cluster.
The number of logical CPUs on the server, and the number of work threads we have running on it. Normally we start one thread per logical CPU; command-line options exist to allow you to override this.
The current performance stats ("Now"). 4 lines of data, as described below.
The heartbeat monitor; this animates to show you that the server is reporting in. If it goes amber, or worse red, we haven't heard from that server in a while.
The overall performance stats ("Overall"). 6 lines of data, as described below.
The "Remove" button; clicking this removes the server from the cluster. Any outstanding work will be put back on the queue to be executed on other servers.

The two areas of performance information are similar. The top one shows the performance stats for the last 4 seconds; this gives you an idea of what the server is doing right now. The lower one shows statistics averaged over the whole computation. The specific stats are:

Statistic	"Now"	"Overall"
The elapsed wall-clock time, and the compute time spent. Note that the compute time is totalled across all CPUs. Hence, if the system is perfectly efficient, the compute time should be equal to the wallclock time multiplied by the number of CPUs.	N / A	over the whole computation
The number of tiles completed, and the average time each tile took to complete.	in this 4-second interval	over the whole computation
The compute performance in megaflops or gigaflops. This counts the number of logical floating-point operations computed per second across all CPUs, so it shows the performance of the machine as a whole.	averaged over this interval	averaged over the whole computation
The computing efficiency of the server. This is the ratio of the time spent actually computing, to wall-clock time. A high number (well over 90%) is good; a low number indicates that servers are waiting for network data a lot of the time.	averaged over this interval	averaged over the whole computation
The number of threads running on the server.	N / A	averaged over the whole computation
The number of tiles requested from the controller (`Rq`); queued on the server (`Q`); and the number actually running (`R`).	at the time of the stats report	averaged over the whole computation

The panel on the left displays the status and totals for the whole cluster.

Some things to note:

The tile time is the average time one tile takes to compute on one processor. If tiles are taking an hour each, and there are 64 processors, expect to see a tile complete every minute or so.
The "Now" interval is quite short (currently 4 seconds) because it's nice to glance at the status panel and see that things are running, without having to wait too long. However, when tile times get long (an hour plus), this means that the numbers will often be zero, because no tile reported stats in that interval (even though tiles report quite often, not just at the end). Consequently, you will sometimes see several tiles finish in one interval, resulting in very high performance numbers, and efficiency scores above 100%. It's just a trade-off between responsiveness and accuracy.
The FLOPS scores are based on logical floating-point operations; i.e. software operations. A thigh precisions, each software operation takes many machine instructions; so don't be surprised if the numbers are very low. These are provided basically for comparison, not because they have any intrinsic meaning.

Configuration Changes

If a server fails, it will attempt to notify the controller, which will display an error message in the cluster management UI.

Of course if it's host system fails completely, it will be unable to do this, and will simply vanish. In that case, the controller will eventually show its heartbeat indicator red, to show that it hasn't been heard from in a while.

At present, that's it — the server will still be in the cluster, but failed. To take it out of the cluster, you need to remove it by clicking its "Remove" button. (In the future, this may be automated.)

You can also remove a server manually at any time, for example if you need to take it down for maintenance.

When a server is removed, all tiles which were queued or running on it are put back on the controller's work queue. They will then be sent to the remaining servers for execution. Thus, no work is lost. The server remains in the cluster management UI, but greyed out; this means that the server is gone, but remembered. This means that to add it back in (presumably after it is repaired), you just need to click it's "Add" button, rather than typing its name in again. (Of course, you will need to have a FracTest server running on the machine before you can add it.)

If one server has failed, and all the others have finished all available work, the computation will be stalled, because of the unfinished tiles on the failed server. If you then remove the failed server, its work will be sent to the other servers, and when they complete it, the computation will be done.

All of this means that the cluster is pretty resilient to server failures. You can easily take a server — or even all of them — down for maintenance, re-add them to the cluster, and have work picked up where it left off. FracTest doesn't care which servers are involved, so you can migrate a running computation to a completely different set of hardware, as long as it's running the same FracTest version.

The one point of weakness is the controller — if it fails, the current computation is lost. The safeguard against this is the checkpointing feature. With checkpointing enabled, in the event that the controller crashes, you can retrieve the most recent complete checkpoint and re-start from there.

Configuration Considerations

There are a few issues related to setting up a network which may be worth considering.

Network Type

FracTest is designed to work over a local network, such as you would typically find within a single home, office, etc. Specifically, FracTest is optimised for these situations:

every computer can see every other, with a specific IP address (whether you use a host name or not)
latency between computers is low — actually, this is managed to some extent by the local work queue on each server
bandwidth is reasonably high — anything in megabits is reasonable, although fractals with shorter render times will require more; conversely, very deep, slow views require very little bandwidth
the connection is reliable

FracTest does not try to be a widely-distributed collaborative computing system like BOINCOpen-source software for volunteer computing
An open framework for widely-distributed collaborative computing projects (BOINC)
https://boinc.berkeley.edu/. If you try to network systems which are in different locations, you will typically run into the problem that computers behind a router don't have a real IP addressNetwork Address Translation
Article on Network Address Translation. (Wikipedia)
https://en.wikipedia.org/wiki/Network_address_translation — if your IP address is something like 192.168.x.y, then that's a private address which only works in your local network.

Having said that, it is possible to set up computers in a local network with real IP addresses — it's just really tricky. Still, if you have such a situation, by all means try FracTest on it; but don't be surprised if it gets messed up by latency issues, or some other problem.

The Controller

The controller runs the FracTest user interface, and co-ordinates the rendering of all fractal images. This system is essential to the running of the cluster, so it needs to be available all the time that the cluster will be operating.

The controller can also be a compute server; whether it is, is up to you. Bear in mind that a compute server loads its host system rather heavily, so a controller running on the same system will be a bit laggy. The controller needs very little in the way of computing power; so a simple fanless media PC, while being too weak to bother running a server on, can make a great cluster controller.

Network Load

Every tile is passed to a server over the network; and every rendered tile's image is passed back. This constitutes the overwhelming majority of the network load generated by FracTest.

Bear in mind that in progressive mode, the tile has to be passed over the network both ways for each pass. At the beginning of the first pass, there is no pixel data, and the tile's data is tiny; but at the beginning of each subsequent pass, the pixel data has to be passed to whichever server is doing the next pass, and at the end of every pass, it is passed back again. The network traffic is therefore the size of the pixel buffer (the number of pixels per tile multiplied by 4), times (the number of passes times 2, minus 1). Add up the sizes of these objects, and multiply by the tile processing rate you expect, to see how much data needs to flow over the network.

This means that using many progressive passes can choke the network. However, this is generally only an issue on early passes, since the processing time goes up three times per pass.

Based on the amount of traffic you are generating, you may need to consider how your network is connected. If your computers are linked over your home WiFi, and especially if you're in a noisy area, the bandwidth may be too low, and your servers may spend much of their time waiting for network traffic. Another way to gauge this is with the "efficiency" rating in the server's statistics window — if it is much less then 100%, the server is probably being held back by the network. In this case, it may be worth connecting the servers to the cluster controller with a dedicated, wired router. The cluster controller can then connect this network to WiFi using network bridging.

In general, this is usually not too much of an issue. A fast fractal render may use 10Mb/s of network bandwidth, or even more; but you typically won't be using a cluster for those. Once you get down to deep views, network load should drop below 10kb/s — and for really deep, multi-day renders, it will be much less than that.

Port Numbers

In Internet-based networking, a port number is used to identify a server running on a particular machine. Only one server (of any kind, whether FracTest or anything else) can be running on a given port on a specific machine; but any number of services can be running on different ports. Standardised ports are used for many services, to make things easier; for example, web servers usually run on port 80. However, this is not a rule, and can be overridden.

The default port number for a Fractest server is 51215. Normally, you shouldn't have to mess with this; however, if you want to run multiple servers on one machine, for example for testing, then it's very easy to do so by assigning them different port numbers.

To do this, you would first start your service on whichever port you like, using the --port parameter as described above. (To be safe, use a number above 1023).

Then, to connect to the server, just add the port number to the name, with a ":"; as in myserver:1234, etc. This can be done when specifying the server name on the command line, or in the cluster management window.

Internal servers, as used when FracTest is running in standalone mode, are allocated from the range 51315 - 51414. An unused port number is chosen for each standalone instance. Hence, you can have multiple standalone FracTest instances running at once, and they don't interfere with separate servers.