Referencing Substrings Faster, without Leaking Memory

posted by Craig Gidney on December 3, 2013

In this post: computing substrings in less than linear time, without introducing tricky memory leaks.

Java Substrings

Last month, Java changed how it computes substrings.

Previous versions of Java computed substrings by just grabbing a reference to the parent string, and noting the offset/length of the substring within it. This was extremely efficient, taking only a constant amount of time, but had the downside of the substring keeping the parent alive past when it would have normally been collected. Since the parent string could be significantly larger, tiny strings could be backed by huge amounts of data. It was a tricky memory leak.

Java now uses a different method: always copying the substring. This is a lot more expensive in terms of time and (initially) memory, but fixes the memory leak. For more discussion of the tradeoffs involved, check out the reddit thread where the author of the change discusses it.

Reading about this made me wonder: why can’t the garbage collector just detect when only a tiny subset of a string is still referenced, and reclaim the unreferenced holes? I see two main obstacles: the massive engineering challenge of changing how the JVM works with arrays internally, and the algorithmic challenge of tracking the intervals and efficiently detecting when holes appear. I find the latter more interesting, which is how I ended up spending my weekend solving it.

Referencing Intervals

Our goal is to make a system where objects can reference a whole section of an array, while still allowing the non-referenced bits to be cleaned up. (Also, we’re not happy just turning our arrays into trees. We want the random access performance.) Assuming we can get these interval references to work correctly and efficiently, implementing substring will be trivial.

The problem we need to solve is that as interval references are created and collected, holes may or may not open up. We need to be able detect those holes so we can reclaim the associated memory. Here’s an example:

Interval references

In the above diagram, a program starts with a string referencing an array of characters containing “The quick brown fox jumps over the lazy dog.”. Then substrings are created, referencing “The quick brown fox”, “the lazy dog”, and “fox jumps over the lazy”. When the original string is collected, the only part of the character array not covered is the final period. The memory for the final period is free’d, leaving the rest unchanged. Then the center substring, “fox jumps over the lazy”, is collected and opens up a hole that is free’d. We’re left with two substrings that can be independently moved/compacted/etc.

Visually, the problem seems really straightforward. However, all the obvious ways to actually compute the holes involve scanning memory or iterating over the intervals. Both of those can end up using linear time per added substring, or more, so we need something more efficient.

Coming up with Ideas

I didn’t go into this problem already knowing the solution. I wasn’t even sure if it had an efficient solution. But, at least to me, the fact that we’re searching a 1-d space screams “logarithmic time”. There should be something that does it that fast, I just need to come up with ideas for what that something could be.

So I started thinking.

Idea #1: Just use Interval Trees

The first thing I thought of is interval trees. Interval trees can efficiently query regions and tell you all the intervals in it. To solve the problem, we’d put each referenced interval into an interval tree and then query the… uh… Oh. We want to find holes, where stuff isn’t, but interval trees tell us where stuff is.

An interval tree can be used to confirm an area is totally empty, or to limit our focus to the intervals overlapping a particular area, but it doesn’t handle our hard case. For example, if we have a huge area with a dense mass of tiny overlapping intervals in the center, then the interval tree’s contributions amount to roughly “None of the intervals can be excluded! Have fun!”.

Idea #2: Trees of Intervals

Interval trees almost do what’s needed, so my second approach was to look into other ways of arranging intervals into trees. Maybe I could somehow sacrifice the efficient querying of intervals for efficient hole detection.

So I grabbed some paper and started doodling trees, had a few promising false starts, but ultimately failed to make this approach worked. No matter how I decided to structure the trees, I could either find ways to force the tree out of balance or create cases where a hole was not detected.

I began to suspect there was no way to avoid linear time in the worst case.

Idea #3: Go Randomized

Some problems can be solved more efficiently on average than in the worst case; usually by doing something random. What if I used an interval tree, but found holes probabilistically by querying random spots? That was my third idea.

The interesting thing about this idea is that the bigger a hole is (i.e. the more we care about it), the more likely we are to stumble onto it with the random queries. Tiny holes could easily be overlooked, but they were probably insignificant anyways. The probability of missing a hole of width w when the original string had size n, assuming we query c times, is (\frac{w}{n})^c. The fact that the probability decays exponentially with respect to the amount of work we do is very promising.

I figured that if I couldn’t come up with anything else, this route might just work out as a consolation solution. I didn’t want to give up on finding a deterministic solution yet, though. It had only been a few hours after all, and sometimes it takes awhile for the right idea to hit. Sounds like an opportunity for the best problem solving technique: taking a nap.

Idea #4: Depth

I didn’t even get to sleep before I realized what I’d been doing wrong. I’d been keeping track of the intervals but, as far as where the holes are is concerned, that’s way more information than necessary. For example, here’s two cases that are exactly identical in terms of their effect on tracking holes:

Congruent overlaps w.r.t. holes

Ooooooooooooh, we only care about the depth! We can separate the starts of the intervals from their ends! It’s like every interval-start is an open-bracket, every interval-end is a close-bracket, and we’re trying to find spots that aren’t nested inside any brackets. So we’d interpret both cases in the above diagram as the same bracket problem: (()).

I can definitely use a tree to keep track of nesting depth.

Trees

Before we continue, I have to get something off my chest: I really, really hate implementing balanced binary trees. They are the tweakiest, corner case-iest, time sucking-est things that solve every problem ever.

I also dislike trying to find existing implementations of trees because, for some odd reason, trees tend to be named by abbreviations that only make sense in hindsight (e.g. a kd-tree is for searching kdimensional spaces). This makes them nearly impossible to find, because you don’t know what arbitrary letter corresponds to the tree you want. What should I search for? Nesting depth tree? Hole tree? Series tree? Delta summation tree? I’d probably have more luck searching for “nd-tree”, “h-tree”, and “s-tree”…

Nesting Depth Tree

In order to ensure we can adjust nesting depth across entire ranges, without having to scan and modify that entire range of the tree, the values we put into our tree will not be “how deep is the nesting at X?”. Instead, each node will store “how much does the nesting depth change at X?”. That allows changes in one part of the tree to implicitly affect the others.

Determining the absolute nesting depth at a given point requires summing up all the changes that are to the left of it in the tree. To avoid needing to scan the entire tree, we’ll have each subtree keep track of the total change in nesting depth contributed by all of its nodes. Instead of scanning the entire subtree, we’ll just read off the total. That will allow us to query the nesting depth of any particular point in time proportional to the height of the tree, by summing up the adjustments of lesser subtrees as we tunnel down to the point in question.

Another thing we need to track is how much each subtree “dives”. You can think of this as the number of unmatched close-brackets covered by the subtree. We can use a substree’s “dive” to determine if it contains a hole or not, based on whether or not it dives enough to overcome the nesting depth contributed by parts of the tree to its left.

Here is how we update the total change in nesting depth and the dive (or “relative minimum”):

private void RecomputeAggregates() {
    // total change in nesting depth
    _subTreeTotalAdjust = GetTotalAdjust(_less) + _adjust + GetTotalAdjust(_more);

    // lowest (relative) dive in nesting depth
    var totalAdjustJustBefore = GetTotalAdjust(_less) + _adjust;
    var totalAdjustJustAfter = totalAdjustJustBefore + (_more == null ? 0 : _more._subTreeRelativeMinimum);
    _subTreeRelativeMinimum = Math.Min(totalAdjustJustBefore, totalAdjustJustAfter);
    if (_less != null) _subTreeRelativeMinimum = Math.Min(_less._subTreeRelativeMinimum, _subTreeRelativeMinimum);
}

and here’s an example situation, where those values have been calculated:

Example tree

In the above diagram, each node is showing its own change-in-nesting-depth (the value labelled “d”), the total change-in-nesting depth of its subtree (called “t”), and the minimum it reaches by diving (called “m”). You can see that the root has a total change-in-nesting depth of zero, which is good since if it wasn’t zero then that would mean that the number of interval starts and ends didn’t match.

Our two aggregate values, total and dive, basically define our tree. It’s just a matter of making a balanced binary search tree that keeps them up to date. Then we need to use those values for our hole detection. Shouldn’t be too hard.

18 hours later…

I hate balanced search trees so much.

Implementation Aside: Hacky Balancing

While I was implementing my tree, I came up with a way to cheat on balancing. I would just use each node’s offset (i.e. where it applies in the array) to decide who should be whose parent. It works like this:

private bool ShouldBeParentOf(Node other) {
    if (other == null) return false;
    return PowerOf2Ness(other._offset) < PowerOf2Ness(this._offset);
}
private static int PowerOf2Ness(int value) {
    // only the lowest set bit affects the result
    // xxxxxxxxx1000000 - 1 == xxxxxxxxx0111111
    // xxxxxxxxx1000000 ^ xxxxxxxxx0111111 == 0000000001111111
    // larger powers of 2 have their first set bit higher
    return value ^ (value - 1);
}

The expression value ^ (value - 1) is a bit twiddling hack similar to the one used to count the number of set bits in O(|set bits|) time. It returns larger numbers when the input is a multiple of a larger power of 2: it evaluates to 1 for odd numbers, 3 for even numbers that aren't a multiple of four, 7 for multiples of four that aren't multiples of eight, and so forth.

The reason I'm using "power of 2-ness" is that it mimics the structure of a perfectly balanced binary tree. For example, if you evaluate the power-of-2-ness of 1 through 15 you get 1,3,1,7,1,3,1,15,1,3,1,7,1,3,1. Draw that out on paper and you should see a familiar structure.

Unfortunately, this only ended up sorta-kinda working out. The problem is that our tree can be sparse compared to the number of offsets in an array. If the initial string has