## Brute Force Parallelization

In this post: a technique for parallelizing accumulation, when the accumulator only takes on a small number of states.

### Trigger

Today I watched the talk How to Think about Parallel Programming: Not! by Guy Steele. (It’s really good. The first third is off topic and *still* really good.)

About an hour into the talk, Steele was talking about a strategy for parallelizing accumulation code: instead of trying to apply a function in parallel, try to compose it in parallel. This triggered an odd idea in my head: what if you just brute forced the composition? That is to say, for every possible value the accumulator might take, just compute what the result would be in that case. You could do this for sections of the list in parallel, and then accumulate the result in section-sized steps.

This is an interesting idea because its applicability depends entirely on the type of the accumulator, as opposed to associativity or other algebraic properties of the function usually associated with parallelization.

The best case scenario for this idea should be a boolean accumulator, which only takes on two values. Anything simpler than that and there’d be no need to even look at the list, so let’s go over the boolean case.

### Boolean Accumulation

Suppose we’re tasked with optimizing the following method:

```
bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
bool accumulator = false;
foreach (var e in vals) {
accumulator = accumulate(accumulator, e);
}
return accumulator;
}
```

or, more succinctly:

```
bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
return vals.Aggregate(false, accumulate);
}
```

We’re guaranteed that the accumulation function given to our function is pure (has no side effects), but otherwise it could be anything. In an ideal world it would satisfy the nice algebraic properties so often used to parallelize accumulation, like associativity or idempotence, but clearly we can’t rely on that. Luckily, because the accumulator can only take on two states, we have the option of brute forcing the accumulator transitions.

For example, suppose we had three processors. While a processor is scanning the first half of the list, the other two processors can be scanning the second half to figure out how to complete once the midway result is known. One process will start with the assumption that the midway accumulator was false, and the other with the assumption that it was true. When the first process finishes, we’ll know which of the other two to discard and which to keep. All three processes will finish at approximately the same time, about half the time it would have taken to sequentially scan the entire list. Pseudo code:

```
bool BooleanAccumulate<T>(T[] vals, Func<bool, T, bool> accumulate) {
var mid = vals.Length/2;
var firstHalf = OnProcessor1(() => vals.Take(mid).Aggregate(false, accumulate));
var secondHalfFalse = OnProcessor2(() => vals.Skip(mid).Aggregate(false, accumulate));
var secondHalfTrue = OnProcessor3(() => vals.Skip(mid).Aggregate(true, accumulate));
var chosen = wait(firstHalf) ? secondHalfTrue : secondHalfFalse;
return wait(chosen);
}
```

With even more processors, we can split the list into more sections.

Supposing we had five processors, we could divide the list into three sections. The first processor would advance the computation to the end of the first section. While it was doing that, processors 2 and 3 would brute force the transition function for the center section and processors 4 and 5 would do the same for the transition function of the last section. They’d all finish in about a third of the time it would have taken to work sequentially, at which point we’d run the result from the first third through the transition functions for the other two thirds to get the final result right away.

Because our accumulator has just two possible states, we can use processors to cut the list into sections. Each section’s transition function can be determined concurrently, and once the transition functions are known they can be folded together concurrently. This gives us an overall complexity of , where is the size of the list and is the number of available processors. The bit is for computing the transition functions of the sections while the cost is for combining those transition functions together.

Note that the asymptotic notation is hiding the factor 2 cost from having to compute everything for both false and true. If we parametrize the number of possible accumulator states as then the cost of the algorithm is actually . That’s why this technique would utterly fail for, say, summing 32-bit integers (where ): brute forcing a section would be four billion times as expensive as scanning it normally!

### Use In Practice

There are in fact practical cases where this technique can be applied. For example, when adding two numbers you need to know what the carry is in order to advance the digit-by-digit computation. This technique allows you to divide the digits into sections and propagate the carry a lot–

Wait a second.

That sounds really familiar.

…

… *searching* …

Oh * come on*. The Kogge-Stone adder, believed to be the fastest possible adder circuit, makes use of exactly the technique I’ve been talking about to propagate carries in logarithmic time.

I’m not sure why I’m so surprised. The idea is kind of obvious, at least in hindsight. It’s probably a standard technique I should have already known about.

One day I’ll have an idea and it won’t have been discovered over a decade before I was born. One day. In the mean time, I’ll keep the accumulator’s state size in mind if I need to parallelize some code.

### Summary

When an accumulation function has a small number of outputs, you can parallelize its evaluation by concurrently brute forcing transition functions for entire sections of the list and then concurrently folding function composition across those transition functions.

This technique is part of what makes the Kogge-Stone adder so efficient at propagating carries.

—

### Discuss on Reddit

—

### My Twitter: @CraigGidney

—

Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

## Archive

- strilanc
- What Quantum Computers Do Faster, with Caveats
- Naming Things: Fail-Useful
- Detecting Simple Cycles Forming, Faster
- Third Party Bit Commitment
- Angular Velocity is Simple
- Collection Equality is Hard
- Deadlocks in Practice: Don’t Hold Locks While Notifying
- Brute Force Parallelization
- A Year’s Worth of Opinions about Objective-C
- Referencing Substrings Faster, without Leaking Memory
- Not Crying Over Old Code
- Exploring Universal Ternary Gates
- Impractical Experiments #2: Securing Peer to Peer Fog of War against Map Hacks
- Achieving Exponential Slowdown by Enumerating Twice
- Using Immortality to Kill Accidental Callback Cycles
- Cancellation Tokens (and Collapsing Futures) for Objective-C
- Visualizing the Eigenvectors of a Rotation
- Collapsing Futures in Objective-C
- Bug Hunting #1: Garbled Audio from End to End
- Impractical Experiments #1: Representing Numbers as Polynomials
- Implementing Quantum Pseudo-Telepathy
- Turn On Your Damn Warnings
- Big-O Made Trivial
- Unfathomable Bugs #7: The Broken Oven
- Solomonoff’s Mad Scientist
- Yearly Blogging Roundup #1
- What isn’t a Monad
- Searching a Sorted Matrix Faster
- How to Read Nested Ternary Operators
- Making Sublime Text 2 Jump to the Correct Line with Unity on OS X
- My Bug, My Bad #4: Reading Concurrently
- Whole API Testing with Reflection
- Optimizing a Parser Combinator into a memcpy
- Don’t Treat Paths Like Strings
- Breaking a Toy Hash Function
- Counting Iterators Lazily
- Unfathomable Bugs #6: Pretend Precision
- My Bug, My Bad #3: Accidentally Attacking WarCraft 3
- Collapsing Types vs Monads (followup)
- Collapsing Futures: Easy to Use, Hard to Represent
- Eventual Exceptions vs Programming in a Minimal Functional Style
- The Mystery of Flunf
- Explain it like I’m Five: The Socialist Millionaire Problem and Secure Multi-Party Computation
- Computer Science Blows My Mind
- A visit to Execution Labs in Montréal
- Transmuting Dice, Conserving Entropy
- Rule of Thumb: Ask for the Clock
- Rule of Thumb: Use Purposefully Weakened Methods
- Rule of thumb: Preconditions Should be Checked Explicitly
- Intersecting Linked Lists Faster
- Mouse Path Smoothing for Jack Lumber
- My Bug, My Bad #2: Sunk by Float
- Repeat Yourself Differently
- Grover’s Quantum Search Algorithm
- Followup to Non-Nullable Types vs C#
- Optimizing Just in Time with Expression Trees
- When One-Way Latency Doesn’t Matter
- Determining exactly if/when/where a moving line intersected a moving point
- Emulating Actors in C# with Async/Await
- Making an immutable queue with guaranteed constant time operations
- Improving Checked Exceptions
- Perishable Collections: The Benefits of Removal-by-Lifetime
- Decoupling shared control
- Decoupling inlined UI code
- Linq to Collections: Beyond IEnumerable<T>
- Publish your .Net library as a NuGet package
- When null is not enough: an option type for C#
- Unfathomable Bugs #5: Readonly or not
- Minkowski sums: examples
- My Bug, My Bad #1: Fractal Spheres
- Working around the brittle UI Virtualization in Windows 8
- Encapsulating Angles
- Unfathomable Bugs #4: Keys that aren’t
- How would I even use a monad (in C#)?
- Useful/Interesting Methods #1: Observable.WhenEach
- Unfathomable Bugs #3: Stringing you along
- Anonymous Implementation Classes – A Design Pattern for C#
- Tasks for ActionScript 3 – Improving on Event-Driven Programming
- Minkowski sums and differences
- Non-Nullable Types vs C#: Fixing the Billion Dollar Mistake
- Unfathomable Bugs #2: Slashing Out
- Script templates and base classes
- Unity font extraction
- Abusing “Phantom Types” to Encode List Lengths Into Their Type
- Constructive Criticism of the Reactive Extensions API
- Quaternions part 3
- Quaternions part 2
- Quaternions part 1
- Unfathomable Bugs #1: You can have things! You can have things IN things! You can have …
- Coroutines – More than you want to know
- Asset Bundle Helper
- The Visual Studio goes away
- .Net’s time traveling StopWatch
- Introducing Catalyst