## Brute Force Parallelization

posted by Craig Gidney on December 17, 2013

In this post: a technique for parallelizing accumulation, when the accumulator only takes on a small number of states.

### Trigger

Today I watched the talk How to Think about Parallel Programming: Not! by Guy Steele. (It’s really good. The first third is off topic and still really good.)

About an hour into the talk, Steele was talking about a strategy for parallelizing accumulation code: instead of trying to apply a function in parallel, try to compose it in parallel. This triggered an odd idea in my head: what if you just brute forced the composition? That is to say, for every possible value the accumulator might take, just compute what the result would be in that case. You could do this for sections of the list in parallel, and then accumulate the result in section-sized steps.

This is an interesting idea because its applicability depends entirely on the type of the accumulator, as opposed to associativity or other algebraic properties of the function usually associated with parallelization.

The best case scenario for this idea should be a boolean accumulator, which only takes on two values. Anything simpler than that and there’d be no need to even look at the list, so let’s go over the boolean case.

### Boolean Accumulation

Suppose we’re tasked with optimizing the following method:

bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
bool accumulator = false;
foreach (var e in vals) {
accumulator = accumulate(accumulator, e);
}
return accumulator;
}

or, more succinctly:

bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
return vals.Aggregate(false, accumulate);
}

We’re guaranteed that the accumulation function given to our function is pure (has no side effects), but otherwise it could be anything. In an ideal world it would satisfy the nice algebraic properties so often used to parallelize accumulation, like associativity or idempotence, but clearly we can’t rely on that. Luckily, because the accumulator can only take on two states, we have the option of brute forcing the accumulator transitions.

For example, suppose we had three processors. While a processor is scanning the first half of the list, the other two processors can be scanning the second half to figure out how to complete once the midway result is known. One process will start with the assumption that the midway accumulator was false, and the other with the assumption that it was true. When the first process finishes, we’ll know which of the other two to discard and which to keep. All three processes will finish at approximately the same time, about half the time it would have taken to sequentially scan the entire list. Pseudo code:

bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
var mid = vals.Length/2;
var firstHalf = OnProcessor1(() => vals.Take(mid).Aggregate(false, accumulate));
var secondHalfFalse = OnProcessor2(() => vals.Skip(mid).Aggregate(false, accumulate));
var secondHalfTrue = OnProcessor3(() => vals.Skip(mid).Aggregate(true, accumulate));
var chosen = wait(firstHalf) ? secondHalfTrue : secondHalfFalse;
return wait(chosen);
}

With even more processors, we can split the list into more sections.

Supposing we had five processors, we could divide the list into three sections. The first processor would advance the computation to the end of the first section. While it was doing that, processors 2 and 3 would brute force the transition function for the center section and processors 4 and 5 would do the same for the transition function of the last section. They’d all finish in about a third of the time it would have taken to work sequentially, at which point we’d run the result from the first third through the transition functions for the other two thirds to get the final result right away.

Because our accumulator has just two possible states, we can use processors to cut the list into sections. Each section’s transition function can be determined concurrently, and once the transition functions are known they can be folded together concurrently. This gives us an overall complexity of , where is the size of the list and is the number of available processors. The bit is for computing the transition functions of the sections while the cost is for combining those transition functions together.

Note that the asymptotic notation is hiding the factor 2 cost from having to compute everything for both false and true. If we parametrize the number of possible accumulator states as then the cost of the algorithm is actually . That’s why this technique would utterly fail for, say, summing 32-bit integers (where ): brute forcing a section would be four billion times as expensive as scanning it normally!

### Use In Practice

There are in fact practical cases where this technique can be applied. For example, when adding two numbers you need to know what the carry is in order to advance the digit-by-digit computation. This technique allows you to divide the digits into sections and propagate the carry a lot–

Wait a second.

That sounds really familiar.

searching

Oh come on. The Kogge-Stone adder, believed to be the fastest possible adder circuit, makes use of exactly the technique I’ve been talking about to propagate carries in logarithmic time.

I’m not sure why I’m so surprised. The idea is kind of obvious, at least in hindsight. It’s probably a standard technique I should have already known about.

One day I’ll have an idea and it won’t have been discovered over a decade before I was born. One day. In the mean time, I’ll keep the accumulator’s state size in mind if I need to parallelize some code.

### Summary

When an accumulation function has a small number of outputs, you can parallelize its evaluation by concurrently brute forcing transition functions for entire sections of the list and then concurrently folding function composition across those transition functions.

This technique is part of what makes the Kogge-Stone adder so efficient at propagating carries.