Achieving Exponential Slowdown by Enumerating Twice

posted by Craig Gidney on November 5, 2013

In this post: the cost of iterating over a sequence multiple times.

Deferred Danger

One of the many useful warnings in ReSharper is “Possible multiple enumeration of IEnumerable“. If you enumerate an enumerable more than once, ReSharper detects that and warns you about it. Although this warning may seem pointless at first, there are two good reasons to pay attention to it.

First, sometimes enumerating an enumerable is very expensive. For example, an enumerable might be backed by a database. Re-enumerating may force you to wait another network round trip. You don’t want to pay that cost twice.

The second reason, the one I’m discussing in this post, is that the performance problems caused by multiple enumerations compound on themselves.

Test Case: Odds and Evens

Suppose we’re tasked with writing a method that reorders the items in an enumerable so that the odd items come first.

A straightforward way to implement that method is to enumerate the given enumerable twice: once for the odd items, and once for the even items. Here’s what that looks like:

public static IEnumerable<T> OddsThenEvens<T>(this IEnumerable<T> items) {
    var even = true;
    foreach (var item in items) {
        if (!even) yield return item;
        even ^= true;
    }
    
    even = true;
    foreach (var item in items) {
        if (even) yield return item;
        even ^= true;
    }
}

Testing it out with LinqPad:

Enumerable.Range(0, 20).OddsThenEvens().Dump();
// 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18

Seems to work fine. On the other hand, the above test isn’t very representative of real-world code.

In practice you tend to chain lots of transformations onto the same enumerable. You don’t just reorder: you project and filter and reorder and group and join and zip and flatten and … you get the idea. Also, if you’re extracting functions appropriately, it won’t even all happen in one place. It’s more like the enumerable gradually accrues transformations as it travels across your program to where it’s ultimately consumed.

Let’s do a test more in line with the fact that transformations are layered on, by layering this transformation onto itself several times and seeing what happens. We’ll push it a bit and go to twenty layers:

var r = Enumerable.Range(0, 20);

// count passes by watching for first item going by
int passes = 0;
r = r.Select(item => {
    if (item == 0) passes++;
    return item;
});

// layer OddsThenEvens 20 times
for (var i = 0; i < 20; i++) {
    r = r.OddsThenEvens();
}

Testing that out:

string.Join(", ", r).Dump()
// (wait 5 seconds)
// 3, 7, 11, 15, 19, 2, 6, 10, 14, 18, 1, 5, 9, 13, 17, 0, 4, 8, 12, 16
passes.Dump();
// 1048576

Huh, that took awhile to finish. Hold on. How many passes does that say we did over the sequence? A MILLION? That's not good.

The root cause of those million passes is OddsThenEvens using multiple enumerations. Each layer of OddsThenEvens is doing two passes of the underlying layer. Each additional layer isn't just increasing the total number of passes by two, it's multiplying the total number of passes by two. By the time we get to the twentieth layer that's more than a million passes.

Performance penalties for multiple enumeration don't add up, they multiply up.

Trading Space against Compound Interest

To fix our performance-when-layering problem, we need to fix the multiple enumerations problem. In the case of OddsThenEvens, all we need to do is remember the even items. We just stash them into a queue, as we're yielding the odd items, for later consumption.

Using a queue to stash the even items will ensure we do only one pass, while still allowing us to trickle out items as we go (the caller doesn't have to wait for us to scan the whole frickin' world before they get their first frickin' item). Here's what it looks like:

public static IEnumerable<T> OddsThenEvens<T>(this IEnumerable<T> items) {
    var evens = new Queue<T>();
    var even = true;
    foreach (var item in items) {
        if (even) {
            evens.Enqueue(item);
        } else {
            yield return item;
        }
        even ^= true;
    }
    while (evens.Count > 0) {
        yield return evens.Dequeue();
    }
}

Caching items as you enumerate them the first time is a standard trick for avoiding multiple enumerations. Another common technique is smearing all your passes together, but that wouldn't have worked for OddsThenEvens. We could also have checked if the underlying sequence was a IList (i.e. supported random access), and taken advantage of that when possible.

Testing our caching variant of OddsThenEvens out:

Enumerable.Range(0, 20).OddsThenEvens().Dump();
// 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18

Same result as last time. That's good. Now let's try the nesting test:

... // same deal as last time

string.Join(", ", r).Dump()
// (wait 0.05 milliseconds)
// 3, 7, 11, 15, 19, 2, 6, 10, 14, 18, 1, 5, 9, 13, 17, 0, 4, 8, 12, 16
passes.Dump();
// 1

That's a rather large improvement. We've gone from a million passes to just one pass, and the improvement is directly reflected in the elapsed time (it went down by six orders of magnitude).

A bit of asymptotic analysis really highlights the improvement we've made in the running time. If the number of layers is r and the number of items is n, then each layer in the original method was costing O(n*2^r) time and O(1) space. The caching variant, on the other hand, has layers that cost O(n) time and O(n) space each.

The caching variant isn't just a lot cheaper in terms of time, it's a lot more predictable. In practice, the value of r is hard to pin down. It depends on the entire program and can vary based on what happens at runtime. When marginal costs depend on r, it becomes very difficult to guess how changes will affect the running time of the program. Questions like "Can I chain these two components together?" become much harder, because the inefficiencies of each component might "resonate".

Of course, the original method's running time doesn't just depend on r, it's exponential in r. When your marginal cost is exponentially unpredictable... well, you're gonna have a bad time.

Other Applications

"Don't do it twice" doesn't just apply to enumerating. It applies any time a function is performing actions or invoking functions given as an argument.

For example, simple implementations of MaxBy have a tendency to repeatedly invoke the value-to-compare selector on the largest item so far, instead of stashing that value. You don't know how expensive that call is, but you're repeating it n times instead of once! It's not inconceivable for the function given to MaxBy to involve a smaller MaxBy and... well, the same thing that happened with OddsThenEvens will happen.

Recursive functions are another example, although a bit less direct. The difference between a Fibonacci function that calls itself twice and one that uses dynamic programming is substantial (and don't forget to turn it into matrix exponentiation while you're at it).

You can also combine recursing with enumerating multiple times (e.g. that's one way to implement an AllSubsequencesOfSize that won't miss any results when given an infinite sequence). It's a great way to keep the room warm.

Summary

Enumerating a sequence twice can multiply the cost of an entire path through your program by two. When it happens several times on the same path, the slowdown compounds and becomes unmanageable. Prefer caching during the first enumeration over re-enumerating.

---

Discuss on Reddit

---

My Twitter: @CraigGidney

---


Twisted Oak Studios offers consulting and development on high-tech interactive projects. Check out our portfolio, or Give us a shout if you have anything you think some really rad engineers should help you with.

Archive

More interesting posts (5 of 34 articles)

Or check out our Portfolio.